Skip to content

16 — Optimization

Writing correct assembly is necessary. Writing fast assembly requires understanding the CPU's internal pipeline, memory hierarchy, and instruction costs. This module covers the mental model and techniques used by compiler backends and performance engineers.


The Performance Model

CPUs are far more complex than a simple fetch-decode-execute machine suggests:

  • Out-of-order execution — CPU executes instructions in a different order than written if dependencies allow
  • Superscalar — multiple execution units running in parallel
  • Speculative execution — CPU predicts branches and executes ahead
  • Pipeline — multiple stages of multiple instructions overlapped

This means the "cost" of an instruction is not just its own latency — it depends on what surrounds it.


Latency vs. Throughput

For every instruction, two numbers matter:

Metric Definition
Latency Cycles until result is available for a dependent instruction
Throughput Reciprocal of how many can execute per cycle (how many ports handle it)

Example (Intel Skylake):

Instruction Latency Throughput
add r, r 1 0.25 (4/cycle)
imul r, r 3 1
div r 35–90 21–74
movsd xmm, mem 5 0.5
addsd xmm, xmm 4 0.5
mulsd xmm, xmm 4 0.5
divsd xmm, xmm 13–14 4–5
sqrtsd xmm, xmm 15–16 7–8

Reference: Agner Fog's instruction tables — essential reading.


Critical Path and ILP

Instruction-Level Parallelism (ILP): independent instructions can execute simultaneously.

; Sequential dependency chain — SLOW (latency 4: must wait for each result)
add rax, rbx    ; rax ready in 1 cycle
add rax, rcx    ; must wait for rax
add rax, rdx    ; must wait for rax

; Parallel chains — FAST (only 2 cycles, limited by throughput)
add rax, rbx    ; chain 1
add rcx, rdx    ; chain 2 — runs in parallel with chain 1
add rax, rcx    ; combine results

Break dependency chains by using separate registers for independent computations.


Loop Optimizations

Unrolling

Process multiple elements per iteration to reduce loop overhead (branch + counter update):

; Original loop (1 element per iteration)
.loop:
    add rax, [rsi]
    add rsi, 8
    dec rcx
    jnz .loop

; Unrolled (4 elements per iteration)
.loop:
    add rax, [rsi]
    add rax, [rsi + 8]
    add rax, [rsi + 16]
    add rax, [rsi + 24]
    add rsi, 32
    sub rcx, 4
    jge .loop
; Handle remaining 0–3 elements after

Strength Reduction

Replace expensive operations with cheaper equivalents:

; SLOW: multiply in loop
imul rax, rcx, 8    ; every iteration

; FAST: accumulate with add
add rcx, 8          ; stride instead of multiply

Loop-Invariant Hoisting

Move computations that don't change each iteration outside the loop:

; BEFORE: recomputes [base + offset] every iteration
.loop:
    lea rbx, [base + offset]    ; recomputed every iteration
    mov rax, [rbx + rcx*8]
    ...

; AFTER: compute once before the loop
lea rbx, [base + offset]
.loop:
    mov rax, [rbx + rcx*8]     ; rbx is constant
    ...

Branch Prediction

Modern CPUs predict which path a branch takes and speculatively execute it. A misprediction costs ~15 cycles (branch flush + pipeline refill).

Techniques to Reduce Mispredictions

  1. Predictable branches — loops are well-predicted (taken repeatedly, then not taken once)
  2. Avoid data-dependent branches — use CMOVcc or arithmetic tricks instead
  3. Profile-guided layout — put the hot path in the fall-through direction
; Branch version — may mispredict
cmp rax, 0
jg  .positive
neg rax         ; rax = abs(rax) if rax < 0
.positive:

; Branchless version — always correct, no prediction needed
mov  rbx, rax
sar  rbx, 63    ; arithmetic shift: rbx = 0 if positive, -1 if negative
xor  rax, rbx
sub  rax, rbx   ; rax = |rax|

Memory Access Optimization

Cache Alignment

section .data
    align 64      ; align to cache line (64 bytes on x86-64)
    hot_array  dq 0, 0, 0, 0, 0, 0, 0, 0    ; fits in one cache line

Minimize Memory Accesses

Every load from RAM costs ~200 cycles. Keep hot data in registers.

; BAD: unnecessary reloads
add [counter], 1      ; load + add + store every iteration

; GOOD: accumulate in register
inc rax               ; just increment register
; ...later...
mov [counter], rax    ; store once

Sequential Access Pattern

Hardware prefetchers excel at sequential (strided) access. Avoid random access patterns for large arrays.

; Cache-friendly: sequential
.loop:
    mov rax, [rsi]       ; access next element
    add rsi, 8
    dec rcx
    jnz .loop

; Cache-unfriendly: random (linked list traversal)
.loop:
    mov rsi, [rsi]       ; pointer chasing — cache miss every time
    dec rcx
    jnz .loop

Prefetching

Hint to the CPU to preload data into cache before it's needed:

prefetcht0 [rsi + 256]    ; prefetch 256 bytes ahead into L1 cache
; ... process [rsi] ...
add rsi, 8
  • prefetcht0 — into L1 cache
  • prefetcht1 — into L2 cache
  • prefetcht2 — into L3 cache
  • prefetchnta — non-temporal (bypass cache, useful for streaming data)

Instruction Selection

Avoid DIV — It's Expensive

; SLOW: division
mov  rax, x
xor  rdx, rdx
mov  rbx, 4
div  rbx        ; ~40 cycles

; FAST: shift for power-of-2 divisor
shr  rax, 2     ; rax /= 4  (1 cycle)

; For compile-time constant non-power-of-2 divisors,
; compilers use a multiply-shift trick (magic numbers):
; x / 7 → (x * 0x2492492492492493) >> 64  (approximately)

Use LEA for Arithmetic

; Multiply rax by 3 (no imul needed)
lea rax, [rax + rax*2]    ; rax = rax + 2*rax = 3*rax

; Multiply by 5
lea rax, [rax + rax*4]    ; rax = 5*rax

; Multiply by 9
lea rax, [rax + rax*8]    ; rax = 9*rax

; Add two values and multiply: (a + b) * 4
lea rax, [rax + rbx]
shl rax, 2

XOR to Zero a Register

xor eax, eax     ; faster than mov rax, 0 (2 bytes vs 7 bytes, breaks dependency)

Writing eax also zeros the upper 32 bits of rax (x86-64 zero-extension).


SIMD for Throughput

Vectorize loops to process N elements per instruction:

; Scalar: add 8 floats one at a time (8 iterations)
.scalar_loop:
    addss xmm0, [rsi]
    add rsi, 4
    dec rcx
    jnz .scalar_loop

; SIMD (AVX): add 8 floats in ONE instruction
vmovups ymm0, [a]       ; load 8 floats
vmovups ymm1, [b]       ; load 8 floats
vaddps  ymm0, ymm0, ymm1  ; add 8 floats simultaneously
vmovups [c], ymm0       ; store 8 floats

8× throughput improvement for this operation.


Measuring Performance

RDTSC — Cycle Counter

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

uint64_t start = rdtsc();
// ... code to measure ...
uint64_t cycles = rdtsc() - start;

Linux perf

perf stat ./program          # high-level stats (cycles, IPC, cache misses)
perf record ./program        # sample execution profile
perf report                  # view hotspots
perf stat -e cache-misses,branch-misses,instructions ./program

Microarchitecture-Specific Counters

perf stat -e cycles,instructions,L1-dcache-misses,LLC-load-misses ./program

Optimization Checklist

# Technique Impact
1 Reduce memory accesses — keep data in registers High
2 Sequential access patterns — maximize cache hits High
3 Break dependency chains — use independent registers High
4 Replace DIV with SHR/multiply trick High
5 Use SIMD for data-parallel loops Very High
6 Branchless code for unpredictable conditions Medium
7 Align hot data to cache lines (64 bytes) Medium
8 Loop unrolling — reduce loop overhead Low–Medium
9 Prefetch future data Low–Medium
10 Use LEA for multiply-add Low

Further Reading


Series Complete

Congratulations on completing the Assembly Language Tutorial Series. You now have the knowledge to:

Level Skills Acquired
Beginner CPU architecture, number systems, registers, memory, basic instructions
Intermediate Control flow, procedures, system calls, bitwise ops, strings/arrays
Advanced Calling conventions, SIMD, inline asm, performance optimization

Suggested Next Steps

  1. Read compiler output — compile C with gcc -O2 -S and study the generated asm
  2. Reverse engineer a binary — use objdump or radare2 on any ELF executable
  3. Write a toy OS — implement a bootloader in assembly
  4. Contribute to a project — look for hot paths in open-source software that benefit from hand-written asm
  5. Study exploit techniques — buffer overflows, ROP chains, and shellcode require deep assembly knowledge

Back to Index