Skip to content

14 — Floating Point and SIMD

Modern x86-64 processors support two floating-point systems (x87 FPU and SSE/AVX) and SIMD (Single Instruction, Multiple Data) execution. SIMD lets one instruction operate on multiple values simultaneously — critical for audio, video, scientific, and ML workloads.


Floating-Point Representations

IEEE 754 (the standard)

Type C Type Bits Sign Exponent Mantissa Range
Single float 32 1 8 23 ~1.2×10⁻³⁸ to ~3.4×10³⁸
Double double 64 1 11 52 ~2.2×10⁻³⁰⁸ to ~1.8×10³⁰⁸

Value = (−1)^sign × 2^(exp−bias) × 1.mantissa

Special values: ±Inf, NaN, -0.


x87 FPU (Legacy)

The original 80-bit floating-point stack from the 8087 coprocessor. Still valid but largely replaced by SSE.

x87 Register Stack

8 registers: ST(0) through ST(7), organized as a stack. ST(0) is the top.

fld   qword [src]    ; push double from memory onto FP stack
fld1                 ; push 1.0
fldz                 ; push 0.0
fldpi                ; push π

fadd  qword [src]    ; ST(0) += mem
fadd  st0, st1       ; ST(0) += ST(1)
fmul  qword [src]    ; ST(0) *= mem
fsub  qword [src]    ; ST(0) -= mem
fdiv  qword [src]    ; ST(0) /= mem

fsqrt               ; ST(0) = √ST(0)
fsin                ; ST(0) = sin(ST(0))  (radians)
fcos                ; ST(0) = cos(ST(0))
fptan               ; ST(1) = tan(ST(0)), ST(0) = 1.0

fstp  qword [dst]    ; pop ST(0), store to memory
fst   qword [dst]    ; store ST(0) without popping

x87 Example: Hypotenuse

; hypotenuse(a, b) = sqrt(a² + b²)
; a in [rbp-8], b in [rbp-16]

fld   qword [rbp - 8]   ; ST(0) = a
fmul  st0, st0           ; ST(0) = a²
fld   qword [rbp - 16]  ; ST(0) = b, ST(1) = a²
fmul  st0, st0           ; ST(0) = b²
fadd  st0, st1           ; ST(0) = a² + b²
fsqrt                    ; ST(0) = √(a²+b²)
fstp  qword [rbp - 24]  ; store result

x87 is largely superseded by SSE2 for scalar float/double. Prefer SSE in new code.


SSE and SSE2 — Scalar Float/Double

SSE (Streaming SIMD Extensions) operates on XMM0XMM15 (128-bit registers).

For scalar (single value) operations:

Scalar Single-Precision (float — 32-bit)

movss xmm0, [src]         ; load 32-bit float
addss xmm0, xmm1          ; xmm0 += xmm1 (scalar single)
subss xmm0, xmm1
mulss xmm0, xmm1
divss xmm0, xmm1
sqrtss xmm0, xmm1         ; xmm0 = sqrt(xmm1)
movss [dst], xmm0         ; store 32-bit float

Scalar Double-Precision (double — 64-bit)

movsd xmm0, [src]         ; load 64-bit double
addsd xmm0, xmm1          ; xmm0 += xmm1 (scalar double)
subsd xmm0, xmm1
mulsd xmm0, xmm1
divsd xmm0, xmm1
sqrtsd xmm0, xmm1
movsd [dst], xmm0

Floating-Point Comparison (SSE)

ucomisd xmm0, xmm1    ; compare doubles (unordered), sets ZF/PF/CF
; Then use:
je   .equal
jb   .less_than      ; CF=1 → xmm0 < xmm1
ja   .greater_than   ; CF=0 and ZF=0 → xmm0 > xmm1
jp   .unordered      ; PF=1 → NaN involved

Type Conversion

; Integer ↔ Float
cvtsi2sd xmm0, rax        ; convert int64 → double
cvtsi2ss xmm0, eax        ; convert int32 → float
cvttsd2si rax, xmm0       ; convert double → int64 (truncate)
cvttss2si eax, xmm0       ; convert float  → int32 (truncate)

; Float ↔ Double
cvtss2sd xmm0, xmm1       ; float → double
cvtsd2ss xmm0, xmm1       ; double → float (precision loss)

SIMD — Packed Operations

SIMD processes multiple values in parallel using one instruction.

XMM Register Layout

An XMM register (128 bits) can hold:

4 × 32-bit floats    [float3 | float2 | float1 | float0]
2 × 64-bit doubles   [double1          | double0        ]
16 × 8-bit integers  [b15|b14|...|b1|b0]
8 × 16-bit integers  [w7|w6|w5|w4|w3|w2|w1|w0]
4 × 32-bit integers  [d3|d2|d1|d0]
2 × 64-bit integers  [q1|q0]

YMM Register Layout (AVX, 256 bits)

8 × 32-bit floats
4 × 64-bit doubles

Packed Arithmetic Naming Convention

[operation][precision][suffix]
 addp       s           — packed single (4 floats in XMM)
 addp       d           — packed double (2 doubles in XMM)
 vadds      s           — scalar single (AVX version)

SSE Packed Float Examples

Load, Add, Store

section .data
    a  dd 1.0, 2.0, 3.0, 4.0    ; 4 floats
    b  dd 5.0, 6.0, 7.0, 8.0

section .text
    movaps xmm0, [a]    ; load 4 aligned floats into xmm0
    movaps xmm1, [b]    ; load 4 aligned floats into xmm1
    addps  xmm0, xmm1   ; xmm0 = {6.0, 8.0, 10.0, 12.0}
    movaps [a],  xmm0   ; store result

Alignment Note

  • MOVAPS / MOVAPD — aligned (16-byte boundary required, crashes if misaligned)
  • MOVUPS / MOVUPD — unaligned (slower, works anywhere)
  • In .data section, use align 16 before SIMD data:
section .data
    align 16
    vec_a  dd 1.0, 2.0, 3.0, 4.0

Packed Integer Instructions (SSE2)

movdqu xmm0, [src]        ; load 128-bit unaligned (integers)
paddb  xmm0, xmm1         ; add 16 bytes (8-bit each)
paddw  xmm0, xmm1         ; add 8 words (16-bit each)
paddd  xmm0, xmm1         ; add 4 dwords (32-bit each)
paddq  xmm0, xmm1         ; add 2 qwords (64-bit each)
psubb  xmm0, xmm1         ; subtract bytes
pmullw xmm0, xmm1         ; multiply 8 × 16-bit (low result)
pand   xmm0, xmm1         ; bitwise AND
por    xmm0, xmm1         ; bitwise OR
pxor   xmm0, xmm1         ; bitwise XOR

AVX — 256-bit Operations

AVX uses 256-bit YMM registers (8 floats or 4 doubles at once). Prefix V is required.

; Requires: -march=avx or checking CPUID for AVX support
vmovaps ymm0, [src]       ; load 8 aligned floats
vaddps  ymm0, ymm1, ymm2  ; ymm0 = ymm1 + ymm2 (3-operand, non-destructive)
vmulps  ymm0, ymm1, ymm2  ; ymm0 = ymm1 * ymm2
vdivps  ymm0, ymm1, ymm2
vsqrtps ymm0, ymm1        ; element-wise sqrt
vmovaps [dst], ymm0       ; store

3-operand form: AVX allows dst ≠ src1, unlike SSE which always overwrites an input.


Horizontal Operations

Operating across lanes within a single register:

; Sum all 4 floats in XMM0
haddps xmm0, xmm0    ; {a+b, c+d, a+b, c+d}
haddps xmm0, xmm0    ; {a+b+c+d, a+b+c+d, ...}  → XMM0[0] = sum

Complete Example: Dot Product

; dot_product: sum of element-wise products
; a, b: arrays of 4 floats (16-byte aligned)
; result in xmm0

section .data
    align 16
    vec_a  dd 1.0, 2.0, 3.0, 4.0
    align 16
    vec_b  dd 5.0, 6.0, 7.0, 8.0

section .text
global _start

_start:
    movaps xmm0, [vec_a]    ; xmm0 = {1, 2, 3, 4}
    movaps xmm1, [vec_b]    ; xmm1 = {5, 6, 7, 8}
    mulps  xmm0, xmm1       ; xmm0 = {5, 12, 21, 32}

    ; Horizontal sum: {5+12, 21+32, 5+12, 21+32}
    haddps xmm0, xmm0       ; xmm0 = {17, 53, 17, 53}
    haddps xmm0, xmm0       ; xmm0 = {70, 70, 70, 70}
    ; xmm0[0] = 70 = dot product

    ; Or use SSE4.1 DPPS instruction:
    ; movaps xmm0, [vec_a]
    ; movaps xmm1, [vec_b]
    ; dpps   xmm0, xmm1, 0xFF  ; full dot product in xmm0[0]

    mov rax, 60
    xor rdi, rdi
    syscall

CPUID — Detecting SIMD Support

Before using SSE/AVX, check CPU capability:

; Check for SSE4.1 support
mov  eax, 1
cpuid               ; eax=1: returns feature flags in ecx/edx
test ecx, (1 << 19) ; bit 19 of ECX = SSE4.1
jz   .no_sse41
; SSE4.1 is available
.no_sse41:

Common feature bits (CPUID EAX=1): - EDX bit 25: SSE - EDX bit 26: SSE2 - ECX bit 0: SSE3 - ECX bit 19: SSE4.1 - ECX bit 28: AVX


Key Takeaways

Feature Registers Width Typical Use
x87 FPU ST(0)–ST(7) 80-bit Legacy; avoid in new code
SSE scalar XMM0–XMM15 32/64-bit Single float/double
SSE packed XMM0–XMM15 128-bit 4 floats / 2 doubles / 16 bytes
AVX packed YMM0–YMM15 256-bit 8 floats / 4 doubles
AVX-512 ZMM0–ZMM31 512-bit 16 floats / 8 doubles
  • Use SS/SD suffix for scalar SSE ops
  • Use PS/PD suffix for packed SSE ops
  • Data must be 16-byte aligned for MOVAPS; use MOVUPS for unaligned
  • AVX uses 3-operand non-destructive form: vop dst, src1, src2

Next: 15 — Inline Assembly