15 — Inline Assembly¶
Inline assembly lets you embed assembly instructions directly inside C or C++ code. This gives you low-level control without writing a separate .asm file — ideal for performance-critical sections, hardware intrinsics, or accessing instructions unavailable in C.
GCC Extended Inline Assembly Syntax¶
GCC uses the AT&T/GAS syntax for inline assembly by default. The general form:
asm volatile (
"assembly template" // instruction(s), one per line
: output operands // [optional]
: input operands // [optional]
: clobbers // [optional]
);
volatiletells the compiler: do not optimize away or reorder this block- Output operands: C variables that will be written by the asm
- Input operands: C variables that provide values to the asm
- Clobbers: registers/memory that the asm modifies (so compiler doesn't assume they're unchanged)
Basic Syntax¶
Inline NOP¶
Move instruction¶
int x = 5;
asm("movl %0, %%eax" : : "r"(x));
// %0 refers to first operand, %% escapes to a single %
Operand Constraints¶
Operands use constraint strings to tell GCC where to place variables.
Common Constraints¶
| Constraint | Meaning |
|---|---|
r |
Any general-purpose register |
m |
Memory operand |
i |
Immediate integer constant |
n |
Immediate integer (known at compile time) |
a |
RAX/EAX specifically |
b |
RBX/EBX |
c |
RCX/ECX |
d |
RDX/EDX |
S |
RSI/ESI |
D |
RDI/EDI |
0–9 |
Same location as Nth operand |
Modifiers on Output Constraints¶
| Modifier | Meaning |
|---|---|
= |
Write-only (value overwritten) |
+ |
Read-write (value read and modified) |
& |
Early clobber (written before inputs are consumed) |
Practical Examples¶
Add two integers¶
int a = 10, b = 20, result;
asm ("addl %2, %1" // %1 += %2 (AT&T: src, dst)
: "=r"(result) // output: result in any register
: "0"(a), "r"(b) // inputs: %0=a (same reg as result), %1=b
);
// result = 30
Swap two variables¶
Read the timestamp counter (RDTSC)¶
uint64_t read_tsc(void) {
uint32_t lo, hi;
asm volatile ("rdtsc"
: "=a"(lo), "=d"(hi)); // rdtsc: result in EDX:EAX
return ((uint64_t)hi << 32) | lo;
}
CPUID¶
void cpuid(uint32_t leaf, uint32_t *eax, uint32_t *ebx,
uint32_t *ecx, uint32_t *edx) {
asm volatile ("cpuid"
: "=a"(*eax), "=b"(*ebx), "=c"(*ecx), "=d"(*edx)
: "a"(leaf));
}
Bit scan (find lowest set bit)¶
int lowest_set_bit(unsigned int x) {
int pos;
asm ("bsfl %1, %0" // BSF: bit scan forward
: "=r"(pos)
: "r"(x));
return pos;
}
Clobber List¶
Tell GCC which registers your asm destroys (other than the explicit operands):
asm volatile (
"push %%rbx\n\t"
"mov $42, %%rbx\n\t"
"pop %%rbx"
:
:
: "rbx", "memory" // rbx is clobbered; memory may be modified
);
Special clobbers:
| Clobber | Meaning |
|---|---|
"memory" |
Asm may read or write any memory — compiler must reload all cached values |
"cc" |
Asm modifies the flags register (RFLAGS) |
"rax", "rbx", ... |
Specific registers clobbered |
Use "memory" whenever your asm stores to arbitrary memory (not a named output).
Intel Syntax in GCC Inline Asm¶
GCC defaults to AT&T syntax. To use Intel syntax:
Or compile with -masm=intel:
Clang Extended ASM¶
Clang supports the same GCC inline asm syntax. No changes needed for basic usage.
MSVC Inline Assembly (Windows, 32-bit only)¶
MSVC has its own inline asm syntax using __asm:
// MSVC — 32-bit only (not supported in 64-bit MSVC!)
int x = 5;
__asm {
mov eax, x
add eax, 10
mov x, eax
}
// x = 15
MSVC dropped inline asm support for 64-bit targets. Use intrinsics or separate .asm files instead.
Compiler Intrinsics (Preferred Alternative)¶
For SIMD and special instructions, intrinsics are safer than inline asm — they look like function calls but map directly to instructions, and the compiler handles register allocation.
#include <immintrin.h> // SSE/AVX intrinsics
// Add 4 floats in parallel
__m128 a = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);
__m128 b = _mm_set_ps(8.0f, 7.0f, 6.0f, 5.0f);
__m128 c = _mm_add_ps(a, b); // {12, 10, 8, 6}
| Header | Extension |
|---|---|
<mmintrin.h> |
MMX |
<xmmintrin.h> |
SSE |
<emmintrin.h> |
SSE2 |
<pmmintrin.h> |
SSE3 |
<smmintrin.h> |
SSE4.1 |
<nmmintrin.h> |
SSE4.2 |
<immintrin.h> |
AVX, AVX2, AVX-512 |
Common intrinsics:
// Load / store
__m256 _mm256_load_ps(float const *p); // aligned
__m256 _mm256_loadu_ps(float const *p); // unaligned
void _mm256_store_ps(float *p, __m256 a);
// Arithmetic (8 floats in parallel)
__m256 _mm256_add_ps(__m256 a, __m256 b);
__m256 _mm256_mul_ps(__m256 a, __m256 b);
__m256 _mm256_sqrt_ps(__m256 a);
// Comparison
__m256 _mm256_cmp_ps(__m256 a, __m256 b, int imm8); // returns mask
// Horizontal
__m128 _mm_hadd_ps(__m128 a, __m128 b);
When to Use Inline Assembly vs. Intrinsics¶
| Situation | Use |
|---|---|
| SIMD/vectorization | Intrinsics — compiler handles allocation |
| Specific CPU instructions (CPUID, RDTSC) | Inline asm or intrinsics |
| Atomic operations | Use <stdatomic.h> or <atomic> |
| Syscalls from C | glibc wrappers (or inline asm as last resort) |
| Full control, no C overhead | Separate .asm file |
Complete Example: Fast Absolute Value (SIMD)¶
#include <immintrin.h>
#include <stdio.h>
// Compute absolute value of 8 floats using AVX
void abs_vec8(float *dst, const float *src) {
__m256 v = _mm256_loadu_ps(src);
__m256 mask = _mm256_set1_ps(-0.0f); // sign bit mask
__m256 result = _mm256_andnot_ps(mask, v); // clear sign bit
_mm256_storeu_ps(dst, result);
}
int main(void) {
float src[] = {-1.0f, 2.0f, -3.0f, 4.0f, -5.0f, 6.0f, -7.0f, 8.0f};
float dst[8];
abs_vec8(dst, src);
for (int i = 0; i < 8; i++) printf("%.1f ", dst[i]);
// 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
}
Build:
Key Takeaways¶
- GCC inline asm format:
asm("instructions" : outputs : inputs : clobbers) - Operands use constraint strings (
r,m,a,b, etc.) to specify placement - Use
volatileto prevent the compiler from removing or reordering your asm - Add
"memory"clobber when asm reads/writes arbitrary memory "cc"clobber when asm modifies flags- Prefer intrinsics over inline asm for SIMD — safer and more portable
- MSVC does not support 64-bit inline asm; use intrinsics or
.asmfiles