DEV Community

kojix2
kojix2

Posted on

Writing SIMD in Crystal with Inline Assembly

Introduction

In this article, we explore how to write SIMD instructions—SSE for x86\64 and NEON for AArch64—using inline assembly in the Crystal programming language.
Crystal uses LLVM as its backend, but it doesn’t yet fully optimize with SIMD.
This is not a performance tuning guide, but rather a fun exploration into low-level programming with Crystal.

asm Syntax

Crystal provides the asm keyword for writing inline assembly. The syntax is based on LLVM's integrated assembler.

asm("template" : outputs : inputs : clobbers : flags) 
Enter fullscreen mode Exit fullscreen mode

Each section:

  • template: LLVM-style assembly code
  • outputs: Output operands
  • inputs: Input operands
  • clobbers: Registers that will be modified
  • flags: Optional (e.g., "volatile")

For a detailed explanation, see the official docs

Types of SIMD Instructions

  • SSE / AVX for Intel and AMD CPUs (x86_64)
  • NEON for ARM CPUs (like Apple Silicon)

Types of Registers

Registers Used in x86_64

  • General-purpose: rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8–r15
  • SIMD:
Name Width Instruction Set Usage
xmm0–xmm15 128-bit SSE Floats, ints
ymm0–ymm15 256-bit AVX Wider SIMD
zmm0–zmm31 512-bit AVX-512 Used in newer CPUs

Registers Used in AArch64 (NEON)

  • Vector registers: v0v31

    • v0.4s = 4 × 32-bit floats
    • v1.8h = 8 × 16-bit half-precision floats

Examples of Register Specification

  • SSE: xmm0, xmm1, etc.
  • NEON: v0.4s, v1.8h, etc.

Note:

  • LLVM assigns SSE registers automatically
  • NEON requires explicit register naming in inline assembly

Prerequisites

To follow along:

  • Emit LLVM IR:
 crystal build --emit llvm-ir foo.cr 
Enter fullscreen mode Exit fullscreen mode
  • Emit assembly:
 crystal build --emit asm foo.cr 
Enter fullscreen mode Exit fullscreen mode
  • Benchmarking tool: hyperfine

  • Use of uninitialized and to_unsafe for low-level memory access


Basic Vector Operations

Vector Addition

SSE (x86_64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32] b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32] def simd_vector_add(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4) result = uninitialized StaticArray(Float32, 4) a_ptr = a.to_unsafe b_ptr = b.to_unsafe result_ptr = result.to_unsafe asm( "movups ($1), %xmm0 // load vector a into xmm0 movups ($2), %xmm1 // load vector b into xmm1 addps %xmm1, %xmm0 // perform parallel addition of four 32-bit floats movups %xmm0, ($0) // store result to memory" :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr) : "xmm0", "xmm1", "memory" : "volatile" ) result end puts "Vector addition: #{simd_vector_add(a, b)}" 
Enter fullscreen mode Exit fullscreen mode

NEON (AArch64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32] b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32] def simd_vector_add(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4) result = uninitialized StaticArray(Float32, 4) a_ptr = a.to_unsafe b_ptr = b.to_unsafe result_ptr = result.to_unsafe asm( "ld1 {v0.4s}, [$1] // load vector a ld1 {v1.4s}, [$2] // load vector b fadd v2.4s, v0.4s, v1.4s // add each element st1 {v2.4s}, [$0] // store the result" :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr) : "v0", "v1", "v2", "memory" : "volatile" ) result end puts "Vector addition: #{simd_vector_add(a, b)}" 
Enter fullscreen mode Exit fullscreen mode

Vector Multiplication

SSE (x86_64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32] b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32] def simd_vector_multiply(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4) result = uninitialized StaticArray(Float32, 4) a_ptr = a.to_unsafe b_ptr = b.to_unsafe result_ptr = result.to_unsafe asm( "movups ($1), %xmm0 // load vector a into xmm0 movups ($2), %xmm1 // load vector b into xmm1 mulps %xmm1, %xmm0 // perform parallel multiplication of four 32-bit floats movups %xmm0, ($0) // store result to memory" :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr) : "xmm0", "xmm1", "memory" : "volatile" ) result end puts "Vector multiplication: #{simd_vector_multiply(a, b)}" 
Enter fullscreen mode Exit fullscreen mode

NEON (AArch64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32] b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32] def simd_vector_multiply(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4) result = uninitialized StaticArray(Float32, 4) a_ptr = a.to_unsafe b_ptr = b.to_unsafe result_ptr = result.to_unsafe asm( "ld1 {v0.4s}, [$1] // load vector a ld1 {v1.4s}, [$2] // load vector b fmul v2.4s, v0.4s, v1.4s // multiply each element st1 {v2.4s}, [$0] // store the result" :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr) : "v0", "v1", "v2", "memory" : "volatile" ) result end puts "Vector multiplication: #{simd_vector_multiply(a, b)}" 
Enter fullscreen mode Exit fullscreen mode

Aggregation Operations

Vector Sum

SSE (x86_64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32] def simd_vector_sum(vec : StaticArray(Float32, 4)) : Float32 result = uninitialized Float32 vec_ptr = vec.to_unsafe result_ptr = pointerof(result) asm( "movups ($1), %xmm0 // load vector into xmm0 haddps %xmm0, %xmm0 // horizontal add: [a+b, c+d, a+b, c+d] haddps %xmm0, %xmm0 // horizontal add again: [a+b+c+d, *, *, *] movss %xmm0, ($0) // store the first element of result" :: "r"(result_ptr), "r"(vec_ptr) : "xmm0", "memory" : "volatile" ) result end puts "Vector sum: #{simd_vector_sum(a)}" 
Enter fullscreen mode Exit fullscreen mode

NEON (AArch64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32] def simd_vector_sum(vec : StaticArray(Float32, 4)) : Float32 result = uninitialized Float32 vec_ptr = vec.to_unsafe result_ptr = pointerof(result) asm( "ld1 {v0.4s}, [$1] // load vector faddp v1.4s, v0.4s, v0.4s // pairwise add: [a+b, c+d, a+b, c+d] faddp v2.2s, v1.2s, v1.2s // pairwise add again: [a+b+c+d, *] str s2, [$0] // store the final sum" :: "r"(result_ptr), "r"(vec_ptr) : "v0", "v1", "v2", "memory" : "volatile" ) result end puts "Vector sum: #{simd_vector_sum(a)}" 
Enter fullscreen mode Exit fullscreen mode

Finding Maximum Value

SSE (x86_64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32] def simd_vector_max(vec : StaticArray(Float32, 4)) : Float32 result = uninitialized Float32 vec_ptr = vec.to_unsafe result_ptr = pointerof(result) asm( "movups ($1), %xmm0 // load vector into xmm0 movaps %xmm0, %xmm1 // copy xmm0 to xmm1 shufps $$0x4E, %xmm1, %xmm1 // swap upper and lower pairs maxps %xmm1, %xmm0 // compute max of each pair movaps %xmm0, %xmm1 // copy result to xmm1 shufps $$0x01, %xmm1, %xmm1 // shuffle adjacent elements maxps %xmm1, %xmm0 // compute final max movss %xmm0, ($0) // store the result" :: "r"(result_ptr), "r"(vec_ptr) : "xmm0", "xmm1", "memory" : "volatile" ) result end puts "Vector max: #{simd_vector_max(a)}" 
Enter fullscreen mode Exit fullscreen mode

NEON (AArch64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32] def simd_vector_max(vec : StaticArray(Float32, 4)) : Float32 result = uninitialized Float32 vec_ptr = vec.to_unsafe result_ptr = pointerof(result) asm( "ld1 {v0.4s}, [$1] // load vector fmaxp v1.4s, v0.4s, v0.4s // pairwise max: [max(a, b), max(c, d), ...] fmaxp v2.2s, v1.2s, v1.2s // final pairwise max str s2, [$0] // store result" :: "r"(result_ptr), "r"(vec_ptr) : "v0", "v1", "v2", "memory" : "volatile" ) result end puts "Vector max: #{simd_vector_max(a)}" 
Enter fullscreen mode Exit fullscreen mode

Integer Operations

Integer Addition

SSE (x86_64)

int_a = StaticArray[1, 2, 3, 4] int_b = StaticArray[10, 20, 30, 40] def simd_int_add(a : StaticArray(Int32, 4), b : StaticArray(Int32, 4)) : StaticArray(Int32, 4) result = uninitialized StaticArray(Int32, 4) a_ptr = a.to_unsafe b_ptr = b.to_unsafe result_ptr = result.to_unsafe asm( "movdqu ($1), %xmm0 // load integer vector a into xmm0 movdqu ($2), %xmm1 // load integer vector b into xmm1 paddd %xmm1, %xmm0 // perform parallel addition of four 32-bit integers movdqu %xmm0, ($0) // store result to memory" :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr) : "xmm0", "xmm1", "memory" : "volatile" ) result end puts "Integer addition: #{simd_int_add(int_a, int_b)}" 
Enter fullscreen mode Exit fullscreen mode

NEON (AArch64)

int_a = StaticArray[1, 2, 3, 4] int_b = StaticArray[10, 20, 30, 40] def simd_int_add(a : StaticArray(Int32, 4), b : StaticArray(Int32, 4)) : StaticArray(Int32, 4) result = uninitialized StaticArray(Int32, 4) a_ptr = a.to_unsafe b_ptr = b.to_unsafe result_ptr = result.to_unsafe asm( "ld1 {v0.4s}, [$1] // load integer vector a ld1 {v1.4s}, [$2] // load integer vector b add v2.4s, v0.4s, v1.4s // perform element-wise addition st1 {v2.4s}, [$0] // store result to memory" :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr) : "v0", "v1", "v2", "memory" : "volatile" ) result end puts "Integer addition: #{simd_int_add(int_a, int_b)}" 
Enter fullscreen mode Exit fullscreen mode

Saturated Addition

SSE (x86_64)

sat_a = StaticArray[29_000_i16, 30_000_i16, 31_000_i16, 32_000_i16, 32_000_i16, 32_000_i16, 32_000_i16, 32_000_i16] sat_b = StaticArray[1_000_i16, 1_000_i16, 1_000_i16, 1_000_i16, 500_i16, 600_i16, 700_i16, 800_i16] def simd_saturated_add(a : StaticArray(Int16, 8), b : StaticArray(Int16, 8)) : StaticArray(Int16, 8) result = uninitialized StaticArray(Int16, 8) a_ptr = a.to_unsafe b_ptr = b.to_unsafe result_ptr = result.to_unsafe asm( "movdqu ($1), %xmm0 // load 8 × 16-bit integers into xmm0 movdqu ($2), %xmm1 // load 8 × 16-bit integers into xmm1 paddsw %xmm1, %xmm0 // perform saturated addition movdqu %xmm0, ($0) // store result to memory" :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr) : "xmm0", "xmm1", "memory" : "volatile" ) result end puts "Saturated addition: #{simd_saturated_add(sat_a, sat_b)}" 
Enter fullscreen mode Exit fullscreen mode

NEON (AArch64)

sat_a = StaticArray[29_000_i16, 30_000_i16, 31_000_i16, 32_000_i16, 32_000_i16, 32_000_i16, 32_000_i16, 32_000_i16] sat_b = StaticArray[1_000_i16, 1_000_i16, 1_000_i16, 1_000_i16, 500_i16, 600_i16, 700_i16, 800_i16] def simd_saturated_add(a : StaticArray(Int16, 8), b : StaticArray(Int16, 8)) : StaticArray(Int16, 8) result = uninitialized StaticArray(Int16, 8) a_ptr = a.to_unsafe b_ptr = b.to_unsafe result_ptr = result.to_unsafe asm( "ld1 {v0.8h}, [$1] // load 8 × 16-bit integers from a into v0 ld1 {v1.8h}, [$2] // load 8 × 16-bit integers from b into v1 sqadd v2.8h, v0.8h, v1.8h // perform saturated addition st1 {v2.8h}, [$0] // store result to memory" :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr) : "v0", "v1", "v2", "memory" : "volatile" ) result end puts "Saturated addition: #{simd_saturated_add(sat_a, sat_b)}" 
Enter fullscreen mode Exit fullscreen mode

Examining LLVM-IR and Assembly

To inspect LLVM IR output:

crystal build your_file.cr --emit llvm-ir --no-debug 
Enter fullscreen mode Exit fullscreen mode

To inspect raw assembly:

crystal build your_file.cr --emit asm --no-debug 
Enter fullscreen mode Exit fullscreen mode

You’ll see that your inline asm blocks are preserved as-is, even with optimizations (-O3).

__crystal_once.exit.i.i: ; preds = %else.i.i.i, %.noexc98 call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %path.i.i.i.i.i) call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %obj1.i.i.i.i) call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %b2.i.i.i) store <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+00>, ptr %obj1.i.i.i.i, align 16 store <4 x float> <float 5.000000e+00, float 6.000000e+00, float 7.000000e+00, float 8.000000e+00>, ptr %b2.i.i.i, align 16 call void asm sideeffect "ld1 {v0.4s}, [$1] \0Ald1 {v1.4s}, [$2] \0Afadd v2.4s, v0.4s, v1.4s \0Ast1 {v2.4s}, [$0]", "r,r,r,~{v0},~{v1},~{v2},~{memory}"(ptr nonnull %path.i.i.i.i.i, ptr nonnull %obj1.i.i.i.i, ptr nonnull %b2.i.i.i) #30 %314 = load <4 x float>, ptr %path.i.i.i.i.i, align 16 call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %path.i.i.i.i.i) call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %obj1.i.i.i.i) call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %b2.i.i.i) %315 = invoke ptr @GC_malloc(i64 80) to label %.noexc100 unwind label %rescue2.loopexit.split-lp.loopexit.split-lp.loopexit.split-lp 
Enter fullscreen mode Exit fullscreen mode
Lloh2300: ldr q1, [x9, lCPI312_43@PAGEOFF] add x8, sp, #164 add x9, sp, #128 str q0, [sp, #128] stur q1, [x29, #-128] ; InlineAsm Start ld1.4s { v0 }, [x9] ld1.4s { v1 }, [x10] fadd.4s v2, v0, v1 st1.4s { v2 }, [x8] ; InlineAsm End ldr q0, [x25] str q0, [sp, #16] 
Enter fullscreen mode Exit fullscreen mode

Miscellaneous

When using SIMD with parallelism, memory bandwidth can become the bottleneck.
Although Crystal currently runs single-threaded by default, true parallelism is in progress, and memory limitations may become relevant in the future.

Conclusion

We’ve explored how to write SIMD operations in Crystal using inline asm, and examined how those instructions are lowered into LLVM IR and eventually into assembly.

This was a deep dive into low-level Crystal.


Appendix: SIMD Instruction Reference

SSE (x86_64)

Instruction Description
movups Load/store 4 × Float32 (unaligned)
movaps Load/store 4 × Float32 (aligned)
movdqu Load/store 4 × Int32 or 8 × Int16
movss Store scalar Float32 (lowest lane)
addps Add 4 × Float32
mulps Multiply 4 × Float32
paddd Add 4 × Int32
paddsw Saturated add 8 × Int16
haddps Horizontal add of Float32 pairs
maxps Element-wise max (Float32)
shufps Shuffle Float32 lanes (for reduction)

NEON (AArch64)

Instruction Description
ld1 Load vector (e.g. v0.4s, v0.8h)
st1 Store vector
add Add 4 × Int32
sqadd Saturated add 8 × Int16
fadd Add 4 × Float32
fmul Multiply 4 × Float32
faddp Pairwise add (Float32 reduction)
fmaxp Pairwise max (Float32 reduction)
faddv Vector-wide add (optional)
fmaxv Vector-wide max (optional)

Notes

  • SSE's movaps and movdqa require 16-byte alignment.
  • NEON's faddp, fmaxp reduce in two steps: 4 → 2 → 1.
  • shufps is used with masks like 0x4E, 0x01 for reordering lanes during reduction.
  • Saturated arithmetic (paddsw, sqadd) clamps values on overflow.

Thanks for reading — and happy crystaling! 💎

Top comments (0)