14 — Floating Point and SIMD¶
Modern x86-64 processors support two floating-point systems (x87 FPU and SSE/AVX) and SIMD (Single Instruction, Multiple Data) execution. SIMD lets one instruction operate on multiple values simultaneously — critical for audio, video, scientific, and ML workloads.
Floating-Point Representations¶
IEEE 754 (the standard)¶
| Type | C Type | Bits | Sign | Exponent | Mantissa | Range |
|---|---|---|---|---|---|---|
| Single | float |
32 | 1 | 8 | 23 | ~1.2×10⁻³⁸ to ~3.4×10³⁸ |
| Double | double |
64 | 1 | 11 | 52 | ~2.2×10⁻³⁰⁸ to ~1.8×10³⁰⁸ |
Value = (−1)^sign × 2^(exp−bias) × 1.mantissa
Special values: ±Inf, NaN, -0.
x87 FPU (Legacy)¶
The original 80-bit floating-point stack from the 8087 coprocessor. Still valid but largely replaced by SSE.
x87 Register Stack¶
8 registers: ST(0) through ST(7), organized as a stack. ST(0) is the top.
fld qword [src] ; push double from memory onto FP stack
fld1 ; push 1.0
fldz ; push 0.0
fldpi ; push π
fadd qword [src] ; ST(0) += mem
fadd st0, st1 ; ST(0) += ST(1)
fmul qword [src] ; ST(0) *= mem
fsub qword [src] ; ST(0) -= mem
fdiv qword [src] ; ST(0) /= mem
fsqrt ; ST(0) = √ST(0)
fsin ; ST(0) = sin(ST(0)) (radians)
fcos ; ST(0) = cos(ST(0))
fptan ; ST(1) = tan(ST(0)), ST(0) = 1.0
fstp qword [dst] ; pop ST(0), store to memory
fst qword [dst] ; store ST(0) without popping
x87 Example: Hypotenuse¶
; hypotenuse(a, b) = sqrt(a² + b²)
; a in [rbp-8], b in [rbp-16]
fld qword [rbp - 8] ; ST(0) = a
fmul st0, st0 ; ST(0) = a²
fld qword [rbp - 16] ; ST(0) = b, ST(1) = a²
fmul st0, st0 ; ST(0) = b²
fadd st0, st1 ; ST(0) = a² + b²
fsqrt ; ST(0) = √(a²+b²)
fstp qword [rbp - 24] ; store result
x87 is largely superseded by SSE2 for scalar float/double. Prefer SSE in new code.
SSE and SSE2 — Scalar Float/Double¶
SSE (Streaming SIMD Extensions) operates on XMM0–XMM15 (128-bit registers).
For scalar (single value) operations:
Scalar Single-Precision (float — 32-bit)¶
movss xmm0, [src] ; load 32-bit float
addss xmm0, xmm1 ; xmm0 += xmm1 (scalar single)
subss xmm0, xmm1
mulss xmm0, xmm1
divss xmm0, xmm1
sqrtss xmm0, xmm1 ; xmm0 = sqrt(xmm1)
movss [dst], xmm0 ; store 32-bit float
Scalar Double-Precision (double — 64-bit)¶
movsd xmm0, [src] ; load 64-bit double
addsd xmm0, xmm1 ; xmm0 += xmm1 (scalar double)
subsd xmm0, xmm1
mulsd xmm0, xmm1
divsd xmm0, xmm1
sqrtsd xmm0, xmm1
movsd [dst], xmm0
Floating-Point Comparison (SSE)¶
ucomisd xmm0, xmm1 ; compare doubles (unordered), sets ZF/PF/CF
; Then use:
je .equal
jb .less_than ; CF=1 → xmm0 < xmm1
ja .greater_than ; CF=0 and ZF=0 → xmm0 > xmm1
jp .unordered ; PF=1 → NaN involved
Type Conversion¶
; Integer ↔ Float
cvtsi2sd xmm0, rax ; convert int64 → double
cvtsi2ss xmm0, eax ; convert int32 → float
cvttsd2si rax, xmm0 ; convert double → int64 (truncate)
cvttss2si eax, xmm0 ; convert float → int32 (truncate)
; Float ↔ Double
cvtss2sd xmm0, xmm1 ; float → double
cvtsd2ss xmm0, xmm1 ; double → float (precision loss)
SIMD — Packed Operations¶
SIMD processes multiple values in parallel using one instruction.
XMM Register Layout¶
An XMM register (128 bits) can hold:
4 × 32-bit floats [float3 | float2 | float1 | float0]
2 × 64-bit doubles [double1 | double0 ]
16 × 8-bit integers [b15|b14|...|b1|b0]
8 × 16-bit integers [w7|w6|w5|w4|w3|w2|w1|w0]
4 × 32-bit integers [d3|d2|d1|d0]
2 × 64-bit integers [q1|q0]
YMM Register Layout (AVX, 256 bits)¶
Packed Arithmetic Naming Convention¶
[operation][precision][suffix]
addp s — packed single (4 floats in XMM)
addp d — packed double (2 doubles in XMM)
vadds s — scalar single (AVX version)
SSE Packed Float Examples¶
Load, Add, Store¶
section .data
a dd 1.0, 2.0, 3.0, 4.0 ; 4 floats
b dd 5.0, 6.0, 7.0, 8.0
section .text
movaps xmm0, [a] ; load 4 aligned floats into xmm0
movaps xmm1, [b] ; load 4 aligned floats into xmm1
addps xmm0, xmm1 ; xmm0 = {6.0, 8.0, 10.0, 12.0}
movaps [a], xmm0 ; store result
Alignment Note¶
MOVAPS/MOVAPD— aligned (16-byte boundary required, crashes if misaligned)MOVUPS/MOVUPD— unaligned (slower, works anywhere)- In
.datasection, usealign 16before SIMD data:
Packed Integer Instructions (SSE2)¶
movdqu xmm0, [src] ; load 128-bit unaligned (integers)
paddb xmm0, xmm1 ; add 16 bytes (8-bit each)
paddw xmm0, xmm1 ; add 8 words (16-bit each)
paddd xmm0, xmm1 ; add 4 dwords (32-bit each)
paddq xmm0, xmm1 ; add 2 qwords (64-bit each)
psubb xmm0, xmm1 ; subtract bytes
pmullw xmm0, xmm1 ; multiply 8 × 16-bit (low result)
pand xmm0, xmm1 ; bitwise AND
por xmm0, xmm1 ; bitwise OR
pxor xmm0, xmm1 ; bitwise XOR
AVX — 256-bit Operations¶
AVX uses 256-bit YMM registers (8 floats or 4 doubles at once). Prefix V is required.
; Requires: -march=avx or checking CPUID for AVX support
vmovaps ymm0, [src] ; load 8 aligned floats
vaddps ymm0, ymm1, ymm2 ; ymm0 = ymm1 + ymm2 (3-operand, non-destructive)
vmulps ymm0, ymm1, ymm2 ; ymm0 = ymm1 * ymm2
vdivps ymm0, ymm1, ymm2
vsqrtps ymm0, ymm1 ; element-wise sqrt
vmovaps [dst], ymm0 ; store
3-operand form: AVX allows dst ≠ src1, unlike SSE which always overwrites an input.
Horizontal Operations¶
Operating across lanes within a single register:
; Sum all 4 floats in XMM0
haddps xmm0, xmm0 ; {a+b, c+d, a+b, c+d}
haddps xmm0, xmm0 ; {a+b+c+d, a+b+c+d, ...} → XMM0[0] = sum
Complete Example: Dot Product¶
; dot_product: sum of element-wise products
; a, b: arrays of 4 floats (16-byte aligned)
; result in xmm0
section .data
align 16
vec_a dd 1.0, 2.0, 3.0, 4.0
align 16
vec_b dd 5.0, 6.0, 7.0, 8.0
section .text
global _start
_start:
movaps xmm0, [vec_a] ; xmm0 = {1, 2, 3, 4}
movaps xmm1, [vec_b] ; xmm1 = {5, 6, 7, 8}
mulps xmm0, xmm1 ; xmm0 = {5, 12, 21, 32}
; Horizontal sum: {5+12, 21+32, 5+12, 21+32}
haddps xmm0, xmm0 ; xmm0 = {17, 53, 17, 53}
haddps xmm0, xmm0 ; xmm0 = {70, 70, 70, 70}
; xmm0[0] = 70 = dot product
; Or use SSE4.1 DPPS instruction:
; movaps xmm0, [vec_a]
; movaps xmm1, [vec_b]
; dpps xmm0, xmm1, 0xFF ; full dot product in xmm0[0]
mov rax, 60
xor rdi, rdi
syscall
CPUID — Detecting SIMD Support¶
Before using SSE/AVX, check CPU capability:
; Check for SSE4.1 support
mov eax, 1
cpuid ; eax=1: returns feature flags in ecx/edx
test ecx, (1 << 19) ; bit 19 of ECX = SSE4.1
jz .no_sse41
; SSE4.1 is available
.no_sse41:
Common feature bits (CPUID EAX=1): - EDX bit 25: SSE - EDX bit 26: SSE2 - ECX bit 0: SSE3 - ECX bit 19: SSE4.1 - ECX bit 28: AVX
Key Takeaways¶
| Feature | Registers | Width | Typical Use |
|---|---|---|---|
| x87 FPU | ST(0)–ST(7) | 80-bit | Legacy; avoid in new code |
| SSE scalar | XMM0–XMM15 | 32/64-bit | Single float/double |
| SSE packed | XMM0–XMM15 | 128-bit | 4 floats / 2 doubles / 16 bytes |
| AVX packed | YMM0–YMM15 | 256-bit | 8 floats / 4 doubles |
| AVX-512 | ZMM0–ZMM31 | 512-bit | 16 floats / 8 doubles |
- Use
SS/SDsuffix for scalar SSE ops - Use
PS/PDsuffix for packed SSE ops - Data must be 16-byte aligned for
MOVAPS; useMOVUPSfor unaligned - AVX uses 3-operand non-destructive form:
vop dst, src1, src2