Attention is all you need

Overview¶

The Transformer is a novel network architecture proposed by researchers at Google that relies entirely on self-attention mechanisms, dispensing with recurrence (RNNs) and convolutions entirely.

Key Achievements¶

State-of-the-Art Results: Achieved 28.4 BLEU on WMT 2014 English-to-German and 41.0 BLEU on English-to-French tasks.
Efficiency: Significant reduction in training time and costs compared to previous models.
Parallelization: Unlike RNNs, the architecture allows for massive parallelization during training.

Model Architecture¶

The Transformer follows a classic encoder-decoder structure using stacked self-attention and point-wise, fully connected layers.

1. The Encoder¶

Composed of a stack of identical layers.
Each layer has two sub-layers:
1. Multi-head self-attention mechanism.
2. Position-wise fully connected feed-forward network.
Employs residual connections followed by layer normalization around each sub-layer.

2. The Decoder¶

Also consists of a stack of identical layers.
Includes a third sub-layer that performs multi-head attention over the encoder output.
Uses masking in its self-attention layer to ensure predictions for position only depend on known outputs at positions less than .

Core Mechanisms¶

Scaled Dot-Product Attention¶

The attention function can be described as mapping a query \((Q)\) and a set of key \((K)\)-value \((V)\) pairs to an output, where the query, keys, values, and output are all vectors.

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

The dot products are scaled by \(\frac{1}{\sqrt{d_k}}\) to prevent them from growing too large and pushing the softmax into regions with small gradients.

Multi-Head Attention¶

Instead of a single attention function, the model performs parallel attention layers (heads).

Allows the model to jointly attend to information from different representation subspaces at different positions.
Each head uses reduced dimensions \((d_k = d_v = 64)\), keeping total computational cost similar to single-head attention.

Positional Encoding¶

Since there is no recurrence or convolution, the model uses sine and cosine functions of different frequencies to inject information about the relative or absolute position of tokens.

Why Self-Attention?¶

The paper identifies three main reasons for choosing self-attention over recurrent or convolutional layers:

Computational Complexity: Self-attention layers are faster than recurrent layers when sequence length is smaller than representation dimensionality .
Parallelization: Connects all positions with a constant number of sequentially executed operations.
Long-Range Dependencies: The maximum path length between any two positions is , making it easier to learn dependencies regardless of distance.

Layer Type	Complexity per Layer	Sequential Operations	Max Path Length
Self-Attention	O(n² · d)	O(1)	O(1)
Recurrent	O(n · d²)	O(n)	O(n)
Convolutional	O(k · n · d²)	O(1)	O(logₖ(n))

Training Hardware

The base models were trained for 12 hours on 8 NVIDIA P100 GPUs. The big models were trained for 3.5 days.

To implement the Transformer architecture from the paper "Attention Is All You Need", we will use Python and PyTorch. This implementation focuses on the Scaled Dot-Product Attention and Multi-Head Attention mechanisms described in the document.

1. Scaled Dot-Product Attention¶

The output is computed as a weighted sum of the values \((V)\), where the weight assigned to each value is computed by a compatibility function of the query \((Q)\) with its corresponding key \((K)\).

import torch
import torch.nn as nn
import math

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super().__init__()
        self.scale = math.sqrt(d_k)

    def forward(self, q, k, v, mask=None):
        # Q*K^T / sqrt(d_k)
        scores = torch.matmul(q, k.transpose(-2, -1)) / self.scale

        if mask is not None:
            # Masking for decoder self-attention to prevent leftward info flow
            scores = scores.masked_fill(mask == 0, -1e9) 

        weights = torch.softmax(scores, dim=-1)
        return torch.matmul(weights, v), weights

2. Multi-Head Attention¶

Instead of one attention function, the model linearly projects into parallel "heads" to attend to information from different representation subspaces.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, h=8):
        super().__init__()
        self.h = h
        self.d_k = d_model // h 

        # Projections matrices W_Q, W_K, W_V and W_O
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

        self.attention = ScaledDotProductAttention(self.d_k)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # 1. Linear projections and split into h heads
        q = self.w_q(q).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        k = self.w_k(k).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        v = self.w_v(v).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

        # 2. Apply Scaled Dot-Product Attention
        x, self.weights = self.attention(q, k, v, mask)

        # 3. Concatenate and project
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
        return self.w_o(x)

3. Position-wise Feed-Forward Network¶

The Position-Wise Feed-Forward Network (FFN) in a Transformer applies the same two-layer MLP independently to each position.

Given:

Input tensor \(x \in \mathbb{R}^{n \times d_{\text{model}}}\)
First weight matrix \(W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}\)
First bias \(b_1 \in \mathbb{R}^{d_{\text{ff}}}\)
Second weight matrix \(W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}\)
Second bias \(b_2 \in \mathbb{R}^{d_{\text{model}}}\)

Step-by-Step Mathematical Expansion¶

For a single position vector \(x_i \in \mathbb{R}^{d_{\text{model}}}\):

1First Linear Transformation¶

\[ z_i^{(1)} = x_i W_1 + b_1 \]

Expanded element-wise:

\[ z_{i,j}^{(1)} = \sum_{k=1}^{d_{\text{model}}} x_{i,k} W_{1,kj} + b_{1,j} \]

ReLU Activation¶

\[ a_{i,j}^{(1)} = \text{ReLU}(z_{i,j}^{(1)}) = \max(0, z_{i,j}^{(1)}) \]

Second Linear Transformation¶

\[ y_i = a_i^{(1)} W_2 + b_2 \]

Expanded element-wise:

\[ y_{i,m} = \sum_{j=1}^{d_{\text{ff}}} a_{i,j}^{(1)} W_{2,jm} + b_{2,m} \]

Compact Form¶

The full transformation for each position:

\[ \text{FFN}(x_i) = W_2 , \text{ReLU}(W_1 x_i + b_1) + b_2 \]

Or for the full sequence matrix:

\[ \text{FFN}(X) = \text{ReLU}(X W_1 + b_1) W_2 + b_2 \]

Intuition (Very Important in Transformers)¶

This operation is position-wise → no interaction between tokens.
It expands dimension:

\[ d_{\text{model}} \rightarrow d_{\text{ff}} \rightarrow d_{\text{model}} \]

Acts like a learned non-linear feature transformation.

If you want, I can also derive:

Computational complexity of FFN
Why \(d_{\text{ff}} = 4 \times d_{\text{model}}\)
Matrix form including batch dimension
Backpropagation gradient equations

Each layer in the encoder and decoder contains a fully connected feed-forward network applied to each position separately and identically.

class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model=512, d_ff=2048):
        super().__init__()
        # Two linear transformations with a ReLU activation in between
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.w_2(self.relu(self.w_1(x)))

4. Positional Encoding¶

To utilize sequence order without recurrence, sinusoidal positional encodings are added to input embeddings.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model=512, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        # Using frequencies of sine and cosine
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        # Add positional encoding to embedding
        return x + self.pe[:, :x.size(1)]

Summary of Parameters (Base Model)¶

As per the Attention Is All You Need paper, these are the default configurations used for the base Transformer architecture:

\(d_{model}\) (Model Dimension): 512
\(N\) (Number of Layers): 6
\(h\) (Number of Heads): 8
\(d_{ff}\) (Feed-Forward Network Dimension): 2048
Dropout Rate: 0.1

Calculated Sub-Parameters¶

Beyond the base configuration, the paper also defines dimensions for each individual head to ensure the total dimension remains consistent:

\(d_{k}\) (Key Dimension): 64
\(d_{v}\) (Value Dimension): 64
Formula: \(d_{k} = d_{v} = d_{model} / h\)

5. The Encoder Layer¶

Each encoder layer consists of two sub-layers: Multi-Head Attention and a Feed-Forward Network.

class EncoderLayer(nn.Module):
    def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, h)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        # Sub-layer 1: Multi-Head Attention + Residual Connection
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Sub-layer 2: Feed Forward + Residual Connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

6. The Decoder Layer¶

The decoder adds a third sub-layer to perform attention over the encoder's output.

class DecoderLayer(nn.Module):
    def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, h)
        self.encoder_attn = MultiHeadAttention(d_model, h)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask, tgt_mask):
        # Sub-layer 1: Masked Self-Attention
        x = self.norm1(x + self.dropout(self.self_attn(x, x, x, tgt_mask)))

        # Sub-layer 2: Encoder-Decoder Attention
        # Queries from decoder, Keys/Values from encoder
        x = self.norm2(x + self.dropout(self.encoder_attn(x, enc_output, enc_output, src_mask)))

        # Sub-layer 3: Feed Forward
        x = self.norm3(x + self.dropout(self.feed_forward(x)))
        return x

Full Model Summary (Base Configuration)¶

When you instantiate these, remember the specific dimensions used in the research:

\(N = 6\) layers for both the Encoder and Decoder stacks.
\(d_{model} = 512\) for all sub-layers and embedding layers.
\(h = 8\) parallel attention heads.
\(d_{k} = d_{v} = 64\) dimensionality per head (calculated as \(d_{model} / h\)).
\(d_{ff} = 2048\) inner-layer dimensionality for the Position-wise Feed-Forward Networks (FFN).

7. The Full Transformer Model¶

This implementation follows the architecture where the output of the encoder serves as the "memory" (Keys and Values) for the decoder's multi-head attention sub-layer.

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, n_layers=6, h=8, d_ff=2048, dropout=0.1):
        super().__init__()

        # 1. Embeddings and Positional Encoding
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, dropout)

        # 2. Encoder Stack (N=6 identical layers)
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, h, d_ff, dropout) for _ in range(n_layers)
        ])

        # 3. Decoder Stack (N=6 identical layers)
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(d_model, h, d_ff, dropout) for _ in range(n_layers)
        ])

        # 4. Final Linear and Softmax layer
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def encode(self, src, src_mask):
        # Initial embedding + positional encoding
        x = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        # Pass through each layer in the stack
        for layer in self.encoder_layers:
            x = layer(x, src_mask)
        return x

    def decode(self, tgt, enc_output, src_mask, tgt_mask):
        # Initial embedding + positional encoding
        x = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
        # Pass through each layer in the stack
        for layer in self.decoder_layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return x

    def forward(self, src, tgt, src_mask, tgt_mask):
        # Step 1: Encode the source sequence
        enc_output = self.encode(src, src_mask)
        # Step 2: Decode into the target representation
        dec_output = self.decode(tgt, enc_output, src_mask, tgt_mask)
        # Step 3: Project to vocabulary size for token prediction
        return self.fc_out(dec_output)

Note

Weight Sharing: The paper mentions sharing the same weight matrix between the two embedding layers and the pre-softmax linear transformation. You can implement this by setting self.fc_out.weight = self.decoder_embedding.weight.
Dimensionality: Note that all sub-layers produce outputs of dimension to facilitate the residual connections.
Parallelization: Because there is no recurrence, this model processes all input tokens in the src sequence simultaneously.