14 — Floating Point and SIMD¶

Modern x86-64 processors support two floating-point systems (x87 FPU and SSE/AVX) and SIMD (Single Instruction, Multiple Data) execution. SIMD lets one instruction operate on multiple values simultaneously — critical for audio, video, scientific, and ML workloads.

Floating-Point Representations¶

IEEE 754 (the standard)¶

Type	C Type	Bits	Sign	Exponent	Mantissa	Range
Single	`float`	32	1	8	23	~1.2×10⁻³⁸ to ~3.4×10³⁸
Double	`double`	64	1	11	52	~2.2×10⁻³⁰⁸ to ~1.8×10³⁰⁸

Value = (−1)^sign × 2^(exp−bias) × 1.mantissa

Special values: ±Inf, NaN, -0.

x87 FPU (Legacy)¶

The original 80-bit floating-point stack from the 8087 coprocessor. Still valid but largely replaced by SSE.

x87 Register Stack¶

8 registers: ST(0) through ST(7), organized as a stack. ST(0) is the top.

fld   qword [src]    ; push double from memory onto FP stack
fld1                 ; push 1.0
fldz                 ; push 0.0
fldpi                ; push π

fadd  qword [src]    ; ST(0) += mem
fadd  st0, st1       ; ST(0) += ST(1)
fmul  qword [src]    ; ST(0) *= mem
fsub  qword [src]    ; ST(0) -= mem
fdiv  qword [src]    ; ST(0) /= mem

fsqrt               ; ST(0) = √ST(0)
fsin                ; ST(0) = sin(ST(0))  (radians)
fcos                ; ST(0) = cos(ST(0))
fptan               ; ST(1) = tan(ST(0)), ST(0) = 1.0

fstp  qword [dst]    ; pop ST(0), store to memory
fst   qword [dst]    ; store ST(0) without popping

x87 Example: Hypotenuse¶

; hypotenuse(a, b) = sqrt(a² + b²)
; a in [rbp-8], b in [rbp-16]

fld   qword [rbp - 8]   ; ST(0) = a
fmul  st0, st0           ; ST(0) = a²
fld   qword [rbp - 16]  ; ST(0) = b, ST(1) = a²
fmul  st0, st0           ; ST(0) = b²
fadd  st0, st1           ; ST(0) = a² + b²
fsqrt                    ; ST(0) = √(a²+b²)
fstp  qword [rbp - 24]  ; store result

x87 is largely superseded by SSE2 for scalar float/double. Prefer SSE in new code.

SSE and SSE2 — Scalar Float/Double¶

SSE (Streaming SIMD Extensions) operates on XMM0–XMM15 (128-bit registers).

For scalar (single value) operations:

Scalar Single-Precision (float — 32-bit)¶

movss xmm0, [src]         ; load 32-bit float
addss xmm0, xmm1          ; xmm0 += xmm1 (scalar single)
subss xmm0, xmm1
mulss xmm0, xmm1
divss xmm0, xmm1
sqrtss xmm0, xmm1         ; xmm0 = sqrt(xmm1)
movss [dst], xmm0         ; store 32-bit float

Scalar Double-Precision (double — 64-bit)¶

movsd xmm0, [src]         ; load 64-bit double
addsd xmm0, xmm1          ; xmm0 += xmm1 (scalar double)
subsd xmm0, xmm1
mulsd xmm0, xmm1
divsd xmm0, xmm1
sqrtsd xmm0, xmm1
movsd [dst], xmm0

Floating-Point Comparison (SSE)¶

ucomisd xmm0, xmm1    ; compare doubles (unordered), sets ZF/PF/CF
; Then use:
je   .equal
jb   .less_than      ; CF=1 → xmm0 < xmm1
ja   .greater_than   ; CF=0 and ZF=0 → xmm0 > xmm1
jp   .unordered      ; PF=1 → NaN involved

Type Conversion¶

; Integer ↔ Float
cvtsi2sd xmm0, rax        ; convert int64 → double
cvtsi2ss xmm0, eax        ; convert int32 → float
cvttsd2si rax, xmm0       ; convert double → int64 (truncate)
cvttss2si eax, xmm0       ; convert float  → int32 (truncate)

; Float ↔ Double
cvtss2sd xmm0, xmm1       ; float → double
cvtsd2ss xmm0, xmm1       ; double → float (precision loss)

SIMD — Packed Operations¶

SIMD processes multiple values in parallel using one instruction.

XMM Register Layout¶

An XMM register (128 bits) can hold:

4 × 32-bit floats    [float3 | float2 | float1 | float0]
2 × 64-bit doubles   [double1          | double0        ]
16 × 8-bit integers  [b15|b14|...|b1|b0]
8 × 16-bit integers  [w7|w6|w5|w4|w3|w2|w1|w0]
4 × 32-bit integers  [d3|d2|d1|d0]
2 × 64-bit integers  [q1|q0]

YMM Register Layout (AVX, 256 bits)¶

8 × 32-bit floats
4 × 64-bit doubles

Packed Arithmetic Naming Convention¶

[operation][precision][suffix]
 addp       s           — packed single (4 floats in XMM)
 addp       d           — packed double (2 doubles in XMM)
 vadds      s           — scalar single (AVX version)

SSE Packed Float Examples¶

Load, Add, Store¶

section .data
    a  dd 1.0, 2.0, 3.0, 4.0    ; 4 floats
    b  dd 5.0, 6.0, 7.0, 8.0

section .text
    movaps xmm0, [a]    ; load 4 aligned floats into xmm0
    movaps xmm1, [b]    ; load 4 aligned floats into xmm1
    addps  xmm0, xmm1   ; xmm0 = {6.0, 8.0, 10.0, 12.0}
    movaps [a],  xmm0   ; store result

Alignment Note¶

MOVAPS / MOVAPD — aligned (16-byte boundary required, crashes if misaligned)
MOVUPS / MOVUPD — unaligned (slower, works anywhere)
In .data section, use align 16 before SIMD data:

section .data
    align 16
    vec_a  dd 1.0, 2.0, 3.0, 4.0

Packed Integer Instructions (SSE2)¶

movdqu xmm0, [src]        ; load 128-bit unaligned (integers)
paddb  xmm0, xmm1         ; add 16 bytes (8-bit each)
paddw  xmm0, xmm1         ; add 8 words (16-bit each)
paddd  xmm0, xmm1         ; add 4 dwords (32-bit each)
paddq  xmm0, xmm1         ; add 2 qwords (64-bit each)
psubb  xmm0, xmm1         ; subtract bytes
pmullw xmm0, xmm1         ; multiply 8 × 16-bit (low result)
pand   xmm0, xmm1         ; bitwise AND
por    xmm0, xmm1         ; bitwise OR
pxor   xmm0, xmm1         ; bitwise XOR

AVX — 256-bit Operations¶

AVX uses 256-bit YMM registers (8 floats or 4 doubles at once). Prefix V is required.

; Requires: -march=avx or checking CPUID for AVX support
vmovaps ymm0, [src]       ; load 8 aligned floats
vaddps  ymm0, ymm1, ymm2  ; ymm0 = ymm1 + ymm2 (3-operand, non-destructive)
vmulps  ymm0, ymm1, ymm2  ; ymm0 = ymm1 * ymm2
vdivps  ymm0, ymm1, ymm2
vsqrtps ymm0, ymm1        ; element-wise sqrt
vmovaps [dst], ymm0       ; store

3-operand form: AVX allows dst ≠ src1, unlike SSE which always overwrites an input.

Horizontal Operations¶

Operating across lanes within a single register:

; Sum all 4 floats in XMM0
haddps xmm0, xmm0    ; {a+b, c+d, a+b, c+d}
haddps xmm0, xmm0    ; {a+b+c+d, a+b+c+d, ...}  → XMM0[0] = sum

Complete Example: Dot Product¶

; dot_product: sum of element-wise products
; a, b: arrays of 4 floats (16-byte aligned)
; result in xmm0

section .data
    align 16
    vec_a  dd 1.0, 2.0, 3.0, 4.0
    align 16
    vec_b  dd 5.0, 6.0, 7.0, 8.0

section .text
global _start

_start:
    movaps xmm0, [vec_a]    ; xmm0 = {1, 2, 3, 4}
    movaps xmm1, [vec_b]    ; xmm1 = {5, 6, 7, 8}
    mulps  xmm0, xmm1       ; xmm0 = {5, 12, 21, 32}

    ; Horizontal sum: {5+12, 21+32, 5+12, 21+32}
    haddps xmm0, xmm0       ; xmm0 = {17, 53, 17, 53}
    haddps xmm0, xmm0       ; xmm0 = {70, 70, 70, 70}
    ; xmm0[0] = 70 = dot product

    ; Or use SSE4.1 DPPS instruction:
    ; movaps xmm0, [vec_a]
    ; movaps xmm1, [vec_b]
    ; dpps   xmm0, xmm1, 0xFF  ; full dot product in xmm0[0]

    mov rax, 60
    xor rdi, rdi
    syscall

CPUID — Detecting SIMD Support¶

Before using SSE/AVX, check CPU capability:

; Check for SSE4.1 support
mov  eax, 1
cpuid               ; eax=1: returns feature flags in ecx/edx
test ecx, (1 << 19) ; bit 19 of ECX = SSE4.1
jz   .no_sse41
; SSE4.1 is available
.no_sse41:

Common feature bits (CPUID EAX=1): - EDX bit 25: SSE - EDX bit 26: SSE2 - ECX bit 0: SSE3 - ECX bit 19: SSE4.1 - ECX bit 28: AVX

Key Takeaways¶

Feature	Registers	Width	Typical Use
x87 FPU	ST(0)–ST(7)	80-bit	Legacy; avoid in new code
SSE scalar	XMM0–XMM15	32/64-bit	Single float/double
SSE packed	XMM0–XMM15	128-bit	4 floats / 2 doubles / 16 bytes
AVX packed	YMM0–YMM15	256-bit	8 floats / 4 doubles
AVX-512	ZMM0–ZMM31	512-bit	16 floats / 8 doubles

Use SS/SD suffix for scalar SSE ops
Use PS/PD suffix for packed SSE ops
Data must be 16-byte aligned for MOVAPS; use MOVUPS for unaligned
AVX uses 3-operand non-destructive form: vop dst, src1, src2