16 — Optimization¶
Writing correct assembly is necessary. Writing fast assembly requires understanding the CPU's internal pipeline, memory hierarchy, and instruction costs. This module covers the mental model and techniques used by compiler backends and performance engineers.
The Performance Model¶
CPUs are far more complex than a simple fetch-decode-execute machine suggests:
- Out-of-order execution — CPU executes instructions in a different order than written if dependencies allow
- Superscalar — multiple execution units running in parallel
- Speculative execution — CPU predicts branches and executes ahead
- Pipeline — multiple stages of multiple instructions overlapped
This means the "cost" of an instruction is not just its own latency — it depends on what surrounds it.
Latency vs. Throughput¶
For every instruction, two numbers matter:
| Metric | Definition |
|---|---|
| Latency | Cycles until result is available for a dependent instruction |
| Throughput | Reciprocal of how many can execute per cycle (how many ports handle it) |
Example (Intel Skylake):
| Instruction | Latency | Throughput |
|---|---|---|
add r, r |
1 | 0.25 (4/cycle) |
imul r, r |
3 | 1 |
div r |
35–90 | 21–74 |
movsd xmm, mem |
5 | 0.5 |
addsd xmm, xmm |
4 | 0.5 |
mulsd xmm, xmm |
4 | 0.5 |
divsd xmm, xmm |
13–14 | 4–5 |
sqrtsd xmm, xmm |
15–16 | 7–8 |
Reference: Agner Fog's instruction tables — essential reading.
Critical Path and ILP¶
Instruction-Level Parallelism (ILP): independent instructions can execute simultaneously.
; Sequential dependency chain — SLOW (latency 4: must wait for each result)
add rax, rbx ; rax ready in 1 cycle
add rax, rcx ; must wait for rax
add rax, rdx ; must wait for rax
; Parallel chains — FAST (only 2 cycles, limited by throughput)
add rax, rbx ; chain 1
add rcx, rdx ; chain 2 — runs in parallel with chain 1
add rax, rcx ; combine results
Break dependency chains by using separate registers for independent computations.
Loop Optimizations¶
Unrolling¶
Process multiple elements per iteration to reduce loop overhead (branch + counter update):
; Original loop (1 element per iteration)
.loop:
add rax, [rsi]
add rsi, 8
dec rcx
jnz .loop
; Unrolled (4 elements per iteration)
.loop:
add rax, [rsi]
add rax, [rsi + 8]
add rax, [rsi + 16]
add rax, [rsi + 24]
add rsi, 32
sub rcx, 4
jge .loop
; Handle remaining 0–3 elements after
Strength Reduction¶
Replace expensive operations with cheaper equivalents:
; SLOW: multiply in loop
imul rax, rcx, 8 ; every iteration
; FAST: accumulate with add
add rcx, 8 ; stride instead of multiply
Loop-Invariant Hoisting¶
Move computations that don't change each iteration outside the loop:
; BEFORE: recomputes [base + offset] every iteration
.loop:
lea rbx, [base + offset] ; recomputed every iteration
mov rax, [rbx + rcx*8]
...
; AFTER: compute once before the loop
lea rbx, [base + offset]
.loop:
mov rax, [rbx + rcx*8] ; rbx is constant
...
Branch Prediction¶
Modern CPUs predict which path a branch takes and speculatively execute it. A misprediction costs ~15 cycles (branch flush + pipeline refill).
Techniques to Reduce Mispredictions¶
- Predictable branches — loops are well-predicted (taken repeatedly, then not taken once)
- Avoid data-dependent branches — use
CMOVccor arithmetic tricks instead - Profile-guided layout — put the hot path in the fall-through direction
; Branch version — may mispredict
cmp rax, 0
jg .positive
neg rax ; rax = abs(rax) if rax < 0
.positive:
; Branchless version — always correct, no prediction needed
mov rbx, rax
sar rbx, 63 ; arithmetic shift: rbx = 0 if positive, -1 if negative
xor rax, rbx
sub rax, rbx ; rax = |rax|
Memory Access Optimization¶
Cache Alignment¶
section .data
align 64 ; align to cache line (64 bytes on x86-64)
hot_array dq 0, 0, 0, 0, 0, 0, 0, 0 ; fits in one cache line
Minimize Memory Accesses¶
Every load from RAM costs ~200 cycles. Keep hot data in registers.
; BAD: unnecessary reloads
add [counter], 1 ; load + add + store every iteration
; GOOD: accumulate in register
inc rax ; just increment register
; ...later...
mov [counter], rax ; store once
Sequential Access Pattern¶
Hardware prefetchers excel at sequential (strided) access. Avoid random access patterns for large arrays.
; Cache-friendly: sequential
.loop:
mov rax, [rsi] ; access next element
add rsi, 8
dec rcx
jnz .loop
; Cache-unfriendly: random (linked list traversal)
.loop:
mov rsi, [rsi] ; pointer chasing — cache miss every time
dec rcx
jnz .loop
Prefetching¶
Hint to the CPU to preload data into cache before it's needed:
prefetcht0— into L1 cacheprefetcht1— into L2 cacheprefetcht2— into L3 cacheprefetchnta— non-temporal (bypass cache, useful for streaming data)
Instruction Selection¶
Avoid DIV — It's Expensive¶
; SLOW: division
mov rax, x
xor rdx, rdx
mov rbx, 4
div rbx ; ~40 cycles
; FAST: shift for power-of-2 divisor
shr rax, 2 ; rax /= 4 (1 cycle)
; For compile-time constant non-power-of-2 divisors,
; compilers use a multiply-shift trick (magic numbers):
; x / 7 → (x * 0x2492492492492493) >> 64 (approximately)
Use LEA for Arithmetic¶
; Multiply rax by 3 (no imul needed)
lea rax, [rax + rax*2] ; rax = rax + 2*rax = 3*rax
; Multiply by 5
lea rax, [rax + rax*4] ; rax = 5*rax
; Multiply by 9
lea rax, [rax + rax*8] ; rax = 9*rax
; Add two values and multiply: (a + b) * 4
lea rax, [rax + rbx]
shl rax, 2
XOR to Zero a Register¶
Writing eax also zeros the upper 32 bits of rax (x86-64 zero-extension).
SIMD for Throughput¶
Vectorize loops to process N elements per instruction:
; Scalar: add 8 floats one at a time (8 iterations)
.scalar_loop:
addss xmm0, [rsi]
add rsi, 4
dec rcx
jnz .scalar_loop
; SIMD (AVX): add 8 floats in ONE instruction
vmovups ymm0, [a] ; load 8 floats
vmovups ymm1, [b] ; load 8 floats
vaddps ymm0, ymm0, ymm1 ; add 8 floats simultaneously
vmovups [c], ymm0 ; store 8 floats
8× throughput improvement for this operation.
Measuring Performance¶
RDTSC — Cycle Counter¶
static inline uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)hi << 32) | lo;
}
uint64_t start = rdtsc();
// ... code to measure ...
uint64_t cycles = rdtsc() - start;
Linux perf¶
perf stat ./program # high-level stats (cycles, IPC, cache misses)
perf record ./program # sample execution profile
perf report # view hotspots
perf stat -e cache-misses,branch-misses,instructions ./program
Microarchitecture-Specific Counters¶
Optimization Checklist¶
| # | Technique | Impact |
|---|---|---|
| 1 | Reduce memory accesses — keep data in registers | High |
| 2 | Sequential access patterns — maximize cache hits | High |
| 3 | Break dependency chains — use independent registers | High |
| 4 | Replace DIV with SHR/multiply trick | High |
| 5 | Use SIMD for data-parallel loops | Very High |
| 6 | Branchless code for unpredictable conditions | Medium |
| 7 | Align hot data to cache lines (64 bytes) | Medium |
| 8 | Loop unrolling — reduce loop overhead | Low–Medium |
| 9 | Prefetch future data | Low–Medium |
| 10 | Use LEA for multiply-add | Low |
Further Reading¶
- Agner Fog — Optimizing Assembly: agner.org/optimize — comprehensive guide to x86 optimization
- Intel Intrinsics Guide: software.intel.com/sites/landingpage/IntrinsicsGuide
- uops.info — per-instruction latency/throughput data for all major CPUs
- LLVM-MCA — static analysis tool that simulates CPU pipeline:
Series Complete¶
Congratulations on completing the Assembly Language Tutorial Series. You now have the knowledge to:
| Level | Skills Acquired |
|---|---|
| Beginner | CPU architecture, number systems, registers, memory, basic instructions |
| Intermediate | Control flow, procedures, system calls, bitwise ops, strings/arrays |
| Advanced | Calling conventions, SIMD, inline asm, performance optimization |
Suggested Next Steps¶
- Read compiler output — compile C with
gcc -O2 -Sand study the generated asm - Reverse engineer a binary — use
objdumporradare2on any ELF executable - Write a toy OS — implement a bootloader in assembly
- Contribute to a project — look for hot paths in open-source software that benefit from hand-written asm
- Study exploit techniques — buffer overflows, ROP chains, and shellcode require deep assembly knowledge