Attention is all you need
Overview¶
The Transformer is a novel network architecture proposed by researchers at Google that relies entirely on self-attention mechanisms, dispensing with recurrence (RNNs) and convolutions entirely.
Key Achievements¶
-
State-of-the-Art Results: Achieved 28.4 BLEU on WMT 2014 English-to-German and 41.0 BLEU on English-to-French tasks.
-
Efficiency: Significant reduction in training time and costs compared to previous models.
-
Parallelization: Unlike RNNs, the architecture allows for massive parallelization during training.
Model Architecture¶
The Transformer follows a classic encoder-decoder structure using stacked self-attention and point-wise, fully connected layers.
1. The Encoder¶
-
Composed of a stack of identical layers.
-
Each layer has two sub-layers:
- Multi-head self-attention mechanism.
- Position-wise fully connected feed-forward network.
-
Employs residual connections followed by layer normalization around each sub-layer.
2. The Decoder¶
-
Also consists of a stack of identical layers.
-
Includes a third sub-layer that performs multi-head attention over the encoder output.
-
Uses masking in its self-attention layer to ensure predictions for position only depend on known outputs at positions less than .
Core Mechanisms¶
Scaled Dot-Product Attention¶
The attention function can be described as mapping a query \((Q)\) and a set of key \((K)\)-value \((V)\) pairs to an output, where the query, keys, values, and output are all vectors.
The dot products are scaled by \(\frac{1}{\sqrt{d_k}}\) to prevent them from growing too large and pushing the softmax into regions with small gradients.
Multi-Head Attention¶
Instead of a single attention function, the model performs parallel attention layers (heads).
- Allows the model to jointly attend to information from different representation subspaces at different positions.
- Each head uses reduced dimensions \((d_k = d_v = 64)\), keeping total computational cost similar to single-head attention.
Positional Encoding¶
Since there is no recurrence or convolution, the model uses sine and cosine functions of different frequencies to inject information about the relative or absolute position of tokens.
Why Self-Attention?¶
The paper identifies three main reasons for choosing self-attention over recurrent or convolutional layers:
-
Computational Complexity: Self-attention layers are faster than recurrent layers when sequence length is smaller than representation dimensionality .
-
Parallelization: Connects all positions with a constant number of sequentially executed operations.
-
Long-Range Dependencies: The maximum path length between any two positions is , making it easier to learn dependencies regardless of distance.
| Layer Type | Complexity per Layer | Sequential Operations | Max Path Length |
|---|---|---|---|
| Self-Attention | O(n² · d) | O(1) | O(1) |
| Recurrent | O(n · d²) | O(n) | O(n) |
| Convolutional | O(k · n · d²) | O(1) | O(logₖ(n)) |
Training Hardware
The base models were trained for 12 hours on 8 NVIDIA P100 GPUs. The big models were trained for 3.5 days.
To implement the Transformer architecture from the paper "Attention Is All You Need", we will use Python and PyTorch. This implementation focuses on the Scaled Dot-Product Attention and Multi-Head Attention mechanisms described in the document.
1. Scaled Dot-Product Attention¶
The output is computed as a weighted sum of the values \((V)\), where the weight assigned to each value is computed by a compatibility function of the query \((Q)\) with its corresponding key \((K)\).
import torch
import torch.nn as nn
import math
class ScaledDotProductAttention(nn.Module):
def __init__(self, d_k):
super().__init__()
self.scale = math.sqrt(d_k)
def forward(self, q, k, v, mask=None):
# Q*K^T / sqrt(d_k)
scores = torch.matmul(q, k.transpose(-2, -1)) / self.scale
if mask is not None:
# Masking for decoder self-attention to prevent leftward info flow
scores = scores.masked_fill(mask == 0, -1e9)
weights = torch.softmax(scores, dim=-1)
return torch.matmul(weights, v), weights
2. Multi-Head Attention¶
Instead of one attention function, the model linearly projects into parallel "heads" to attend to information from different representation subspaces.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, h=8):
super().__init__()
self.h = h
self.d_k = d_model // h
# Projections matrices W_Q, W_K, W_V and W_O
self.w_q = nn.Linear(d_model, d_model)
self.w_k = nn.Linear(d_model, d_model)
self.w_v = nn.Linear(d_model, d_model)
self.w_o = nn.Linear(d_model, d_model)
self.attention = ScaledDotProductAttention(self.d_k)
def forward(self, q, k, v, mask=None):
batch_size = q.size(0)
# 1. Linear projections and split into h heads
q = self.w_q(q).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
k = self.w_k(k).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
v = self.w_v(v).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
# 2. Apply Scaled Dot-Product Attention
x, self.weights = self.attention(q, k, v, mask)
# 3. Concatenate and project
x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
return self.w_o(x)
3. Position-wise Feed-Forward Network¶
The Position-Wise Feed-Forward Network (FFN) in a Transformer applies the same two-layer MLP independently to each position.
Given:
- Input tensor \(x \in \mathbb{R}^{n \times d_{\text{model}}}\)
- First weight matrix \(W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}\)
- First bias \(b_1 \in \mathbb{R}^{d_{\text{ff}}}\)
- Second weight matrix \(W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}\)
- Second bias \(b_2 \in \mathbb{R}^{d_{\text{model}}}\)
Step-by-Step Mathematical Expansion¶
For a single position vector \(x_i \in \mathbb{R}^{d_{\text{model}}}\):
1First Linear Transformation¶
Expanded element-wise:
ReLU Activation¶
Second Linear Transformation¶
Expanded element-wise:
Compact Form¶
The full transformation for each position:
Or for the full sequence matrix:
Intuition (Very Important in Transformers)¶
- This operation is position-wise → no interaction between tokens.
- It expands dimension:
- Acts like a learned non-linear feature transformation.
If you want, I can also derive:
- Computational complexity of FFN
- Why \(d_{\text{ff}} = 4 \times d_{\text{model}}\)
- Matrix form including batch dimension
- Backpropagation gradient equations
Each layer in the encoder and decoder contains a fully connected feed-forward network applied to each position separately and identically.
class PositionWiseFeedForward(nn.Module):
def __init__(self, d_model=512, d_ff=2048):
super().__init__()
# Two linear transformations with a ReLU activation in between
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()
def forward(self, x):
return self.w_2(self.relu(self.w_1(x)))
4. Positional Encoding¶
To utilize sequence order without recurrence, sinusoidal positional encodings are added to input embeddings.
class PositionalEncoding(nn.Module):
def __init__(self, d_model=512, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
# Using frequencies of sine and cosine
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
# Add positional encoding to embedding
return x + self.pe[:, :x.size(1)]
Summary of Parameters (Base Model)¶
As per the Attention Is All You Need paper, these are the default configurations used for the base Transformer architecture:
- \(d_{model}\) (Model Dimension): 512
- \(N\) (Number of Layers): 6
- \(h\) (Number of Heads): 8
- \(d_{ff}\) (Feed-Forward Network Dimension): 2048
- Dropout Rate: 0.1
Calculated Sub-Parameters¶
Beyond the base configuration, the paper also defines dimensions for each individual head to ensure the total dimension remains consistent:
- \(d_{k}\) (Key Dimension): 64
- \(d_{v}\) (Value Dimension): 64
- Formula: \(d_{k} = d_{v} = d_{model} / h\)
5. The Encoder Layer¶
Each encoder layer consists of two sub-layers: Multi-Head Attention and a Feed-Forward Network.
class EncoderLayer(nn.Module):
def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, h)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask):
# Sub-layer 1: Multi-Head Attention + Residual Connection
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# Sub-layer 2: Feed Forward + Residual Connection
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
6. The Decoder Layer¶
The decoder adds a third sub-layer to perform attention over the encoder's output.
class DecoderLayer(nn.Module):
def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, h)
self.encoder_attn = MultiHeadAttention(d_model, h)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, enc_output, src_mask, tgt_mask):
# Sub-layer 1: Masked Self-Attention
x = self.norm1(x + self.dropout(self.self_attn(x, x, x, tgt_mask)))
# Sub-layer 2: Encoder-Decoder Attention
# Queries from decoder, Keys/Values from encoder
x = self.norm2(x + self.dropout(self.encoder_attn(x, enc_output, enc_output, src_mask)))
# Sub-layer 3: Feed Forward
x = self.norm3(x + self.dropout(self.feed_forward(x)))
return x
Full Model Summary (Base Configuration)¶
When you instantiate these, remember the specific dimensions used in the research:
- \(N = 6\) layers for both the Encoder and Decoder stacks.
- \(d_{model} = 512\) for all sub-layers and embedding layers.
- \(h = 8\) parallel attention heads.
- \(d_{k} = d_{v} = 64\) dimensionality per head (calculated as \(d_{model} / h\)).
- \(d_{ff} = 2048\) inner-layer dimensionality for the Position-wise Feed-Forward Networks (FFN).
7. The Full Transformer Model¶
This implementation follows the architecture where the output of the encoder serves as the "memory" (Keys and Values) for the decoder's multi-head attention sub-layer.
class Transformer(nn.Module):
def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, n_layers=6, h=8, d_ff=2048, dropout=0.1):
super().__init__()
# 1. Embeddings and Positional Encoding
self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, dropout)
# 2. Encoder Stack (N=6 identical layers)
self.encoder_layers = nn.ModuleList([
EncoderLayer(d_model, h, d_ff, dropout) for _ in range(n_layers)
])
# 3. Decoder Stack (N=6 identical layers)
self.decoder_layers = nn.ModuleList([
DecoderLayer(d_model, h, d_ff, dropout) for _ in range(n_layers)
])
# 4. Final Linear and Softmax layer
self.fc_out = nn.Linear(d_model, tgt_vocab_size)
self.dropout = nn.Dropout(dropout)
def encode(self, src, src_mask):
# Initial embedding + positional encoding
x = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
# Pass through each layer in the stack
for layer in self.encoder_layers:
x = layer(x, src_mask)
return x
def decode(self, tgt, enc_output, src_mask, tgt_mask):
# Initial embedding + positional encoding
x = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
# Pass through each layer in the stack
for layer in self.decoder_layers:
x = layer(x, enc_output, src_mask, tgt_mask)
return x
def forward(self, src, tgt, src_mask, tgt_mask):
# Step 1: Encode the source sequence
enc_output = self.encode(src, src_mask)
# Step 2: Decode into the target representation
dec_output = self.decode(tgt, enc_output, src_mask, tgt_mask)
# Step 3: Project to vocabulary size for token prediction
return self.fc_out(dec_output)
Note
-
Weight Sharing: The paper mentions sharing the same weight matrix between the two embedding layers and the pre-softmax linear transformation. You can implement this by setting
self.fc_out.weight = self.decoder_embedding.weight. -
Dimensionality: Note that all sub-layers produce outputs of dimension to facilitate the residual connections.
-
Parallelization: Because there is no recurrence, this model processes all input tokens in the
srcsequence simultaneously.