16 — Optimization¶

Writing correct assembly is necessary. Writing fast assembly requires understanding the CPU's internal pipeline, memory hierarchy, and instruction costs. This module covers the mental model and techniques used by compiler backends and performance engineers.

The Performance Model¶

CPUs are far more complex than a simple fetch-decode-execute machine suggests:

Out-of-order execution — CPU executes instructions in a different order than written if dependencies allow
Superscalar — multiple execution units running in parallel
Speculative execution — CPU predicts branches and executes ahead
Pipeline — multiple stages of multiple instructions overlapped

This means the "cost" of an instruction is not just its own latency — it depends on what surrounds it.

Latency vs. Throughput¶

For every instruction, two numbers matter:

Metric	Definition
Latency	Cycles until result is available for a dependent instruction
Throughput	Reciprocal of how many can execute per cycle (how many ports handle it)

Example (Intel Skylake):

Instruction	Latency	Throughput
`add r, r`	1	0.25 (4/cycle)
`imul r, r`	3	1
`div r`	35–90	21–74
`movsd xmm, mem`	5	0.5
`addsd xmm, xmm`	4	0.5
`mulsd xmm, xmm`	4	0.5
`divsd xmm, xmm`	13–14	4–5
`sqrtsd xmm, xmm`	15–16	7–8

Reference: Agner Fog's instruction tables — essential reading.

Critical Path and ILP¶

Instruction-Level Parallelism (ILP): independent instructions can execute simultaneously.

; Sequential dependency chain — SLOW (latency 4: must wait for each result)
add rax, rbx    ; rax ready in 1 cycle
add rax, rcx    ; must wait for rax
add rax, rdx    ; must wait for rax

; Parallel chains — FAST (only 2 cycles, limited by throughput)
add rax, rbx    ; chain 1
add rcx, rdx    ; chain 2 — runs in parallel with chain 1
add rax, rcx    ; combine results

Break dependency chains by using separate registers for independent computations.

Loop Optimizations¶

Unrolling¶

Process multiple elements per iteration to reduce loop overhead (branch + counter update):

; Original loop (1 element per iteration)
.loop:
    add rax, [rsi]
    add rsi, 8
    dec rcx
    jnz .loop

; Unrolled (4 elements per iteration)
.loop:
    add rax, [rsi]
    add rax, [rsi + 8]
    add rax, [rsi + 16]
    add rax, [rsi + 24]
    add rsi, 32
    sub rcx, 4
    jge .loop
; Handle remaining 0–3 elements after

Strength Reduction¶

Replace expensive operations with cheaper equivalents:

; SLOW: multiply in loop
imul rax, rcx, 8    ; every iteration

; FAST: accumulate with add
add rcx, 8          ; stride instead of multiply

Loop-Invariant Hoisting¶

Move computations that don't change each iteration outside the loop:

; BEFORE: recomputes [base + offset] every iteration
.loop:
    lea rbx, [base + offset]    ; recomputed every iteration
    mov rax, [rbx + rcx*8]
    ...

; AFTER: compute once before the loop
lea rbx, [base + offset]
.loop:
    mov rax, [rbx + rcx*8]     ; rbx is constant
    ...

Branch Prediction¶

Modern CPUs predict which path a branch takes and speculatively execute it. A misprediction costs ~15 cycles (branch flush + pipeline refill).

Techniques to Reduce Mispredictions¶

Predictable branches — loops are well-predicted (taken repeatedly, then not taken once)
Avoid data-dependent branches — use CMOVcc or arithmetic tricks instead
Profile-guided layout — put the hot path in the fall-through direction

; Branch version — may mispredict
cmp rax, 0
jg  .positive
neg rax         ; rax = abs(rax) if rax < 0
.positive:

; Branchless version — always correct, no prediction needed
mov  rbx, rax
sar  rbx, 63    ; arithmetic shift: rbx = 0 if positive, -1 if negative
xor  rax, rbx
sub  rax, rbx   ; rax = |rax|

Memory Access Optimization¶

Cache Alignment¶

section .data
    align 64      ; align to cache line (64 bytes on x86-64)
    hot_array  dq 0, 0, 0, 0, 0, 0, 0, 0    ; fits in one cache line

Minimize Memory Accesses¶

Every load from RAM costs ~200 cycles. Keep hot data in registers.

; BAD: unnecessary reloads
add [counter], 1      ; load + add + store every iteration

; GOOD: accumulate in register
inc rax               ; just increment register
; ...later...
mov [counter], rax    ; store once

Sequential Access Pattern¶

Hardware prefetchers excel at sequential (strided) access. Avoid random access patterns for large arrays.

; Cache-friendly: sequential
.loop:
    mov rax, [rsi]       ; access next element
    add rsi, 8
    dec rcx
    jnz .loop

; Cache-unfriendly: random (linked list traversal)
.loop:
    mov rsi, [rsi]       ; pointer chasing — cache miss every time
    dec rcx
    jnz .loop

Prefetching¶

Hint to the CPU to preload data into cache before it's needed:

prefetcht0 [rsi + 256]    ; prefetch 256 bytes ahead into L1 cache
; ... process [rsi] ...
add rsi, 8

prefetcht0 — into L1 cache
prefetcht1 — into L2 cache
prefetcht2 — into L3 cache
prefetchnta — non-temporal (bypass cache, useful for streaming data)

Instruction Selection¶

Avoid DIV — It's Expensive¶

; SLOW: division
mov  rax, x
xor  rdx, rdx
mov  rbx, 4
div  rbx        ; ~40 cycles

; FAST: shift for power-of-2 divisor
shr  rax, 2     ; rax /= 4  (1 cycle)

; For compile-time constant non-power-of-2 divisors,
; compilers use a multiply-shift trick (magic numbers):
; x / 7 → (x * 0x2492492492492493) >> 64  (approximately)

Use LEA for Arithmetic¶

; Multiply rax by 3 (no imul needed)
lea rax, [rax + rax*2]    ; rax = rax + 2*rax = 3*rax

; Multiply by 5
lea rax, [rax + rax*4]    ; rax = 5*rax

; Multiply by 9
lea rax, [rax + rax*8]    ; rax = 9*rax

; Add two values and multiply: (a + b) * 4
lea rax, [rax + rbx]
shl rax, 2

XOR to Zero a Register¶

xor eax, eax     ; faster than mov rax, 0 (2 bytes vs 7 bytes, breaks dependency)

Writing eax also zeros the upper 32 bits of rax (x86-64 zero-extension).

SIMD for Throughput¶

Vectorize loops to process N elements per instruction:

; Scalar: add 8 floats one at a time (8 iterations)
.scalar_loop:
    addss xmm0, [rsi]
    add rsi, 4
    dec rcx
    jnz .scalar_loop

; SIMD (AVX): add 8 floats in ONE instruction
vmovups ymm0, [a]       ; load 8 floats
vmovups ymm1, [b]       ; load 8 floats
vaddps  ymm0, ymm0, ymm1  ; add 8 floats simultaneously
vmovups [c], ymm0       ; store 8 floats

8× throughput improvement for this operation.

Measuring Performance¶

RDTSC — Cycle Counter¶

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

uint64_t start = rdtsc();
// ... code to measure ...
uint64_t cycles = rdtsc() - start;

Linux perf¶

perf stat ./program          # high-level stats (cycles, IPC, cache misses)
perf record ./program        # sample execution profile
perf report                  # view hotspots
perf stat -e cache-misses,branch-misses,instructions ./program

Microarchitecture-Specific Counters¶

perf stat -e cycles,instructions,L1-dcache-misses,LLC-load-misses ./program

Optimization Checklist¶

#	Technique	Impact
1	Reduce memory accesses — keep data in registers	High
2	Sequential access patterns — maximize cache hits	High
3	Break dependency chains — use independent registers	High
4	Replace DIV with SHR/multiply trick	High
5	Use SIMD for data-parallel loops	Very High
6	Branchless code for unpredictable conditions	Medium
7	Align hot data to cache lines (64 bytes)	Medium
8	Loop unrolling — reduce loop overhead	Low–Medium
9	Prefetch future data	Low–Medium
10	Use LEA for multiply-add	Low

Series Complete¶

Congratulations on completing the Assembly Language Tutorial Series. You now have the knowledge to:

Level	Skills Acquired
Beginner	CPU architecture, number systems, registers, memory, basic instructions
Intermediate	Control flow, procedures, system calls, bitwise ops, strings/arrays
Advanced	Calling conventions, SIMD, inline asm, performance optimization

Suggested Next Steps¶

Read compiler output — compile C with gcc -O2 -S and study the generated asm
Reverse engineer a binary — use objdump or radare2 on any ELF executable
Write a toy OS — implement a bootloader in assembly
Contribute to a project — look for hot paths in open-source software that benefit from hand-written asm
Study exploit techniques — buffer overflows, ROP chains, and shellcode require deep assembly knowledge

16 — Optimization¶

The Performance Model¶

Latency vs. Throughput¶

Critical Path and ILP¶

Loop Optimizations¶

Unrolling¶

Strength Reduction¶

Loop-Invariant Hoisting¶

Branch Prediction¶

Techniques to Reduce Mispredictions¶

Memory Access Optimization¶

Cache Alignment¶

Minimize Memory Accesses¶

Sequential Access Pattern¶

Prefetching¶

Instruction Selection¶

Avoid DIV — It's Expensive¶

Use LEA for Arithmetic¶

XOR to Zero a Register¶

SIMD for Throughput¶

Measuring Performance¶

RDTSC — Cycle Counter¶

Linux perf¶

Microarchitecture-Specific Counters¶

Optimization Checklist¶

Further Reading¶

Series Complete¶

Suggested Next Steps¶

Back to Index ¶

16 — Optimization¶

The Performance Model¶

Latency vs. Throughput¶

Critical Path and ILP¶

Loop Optimizations¶

Unrolling¶

Strength Reduction¶

Loop-Invariant Hoisting¶

Branch Prediction¶

Techniques to Reduce Mispredictions¶

Memory Access Optimization¶

Cache Alignment¶

Minimize Memory Accesses¶

Sequential Access Pattern¶

Prefetching¶

Instruction Selection¶

Avoid DIV — It's Expensive¶

Use LEA for Arithmetic¶

XOR to Zero a Register¶

SIMD for Throughput¶

Measuring Performance¶

RDTSC — Cycle Counter¶

Linux perf¶

Microarchitecture-Specific Counters¶

Optimization Checklist¶

Further Reading¶

Series Complete¶

Suggested Next Steps¶

Back to Index¶

Back to Index ¶