Dropout: A Simple Way to Prevent Neural Networks from Overfitting¶

Srivastava et al., Journal of Machine Learning Research, 15(56):1929–1958 (2014) (Journal of Machine Learning Research)

Introduction¶

Deep neural nets with many parameters often overfit when training data is limited. In Dropout, each training presentation randomly drops (i.e., sets to zero) a subset of neurons (hidden units) independently, thereby training many “thinned” networks that share weights. Dropout approximates model averaging over an exponential number of sub-networks while keeping training computation reasonable. (Journal of Machine Learning Research)

Mathematical Description of Dropout¶

Feedforward with Dropout¶

Consider a standard feedforward neural network with $L$ hidden layers. Let:

$y^{(l)}$ be the vector of neuron outputs at layer $l$
$W^{(l)}$ and $b^{(l)}$ be weights and biases at layer $l$
$f(\cdot)$ be an activation function (e.g., sigmoid, ReLU)

The standard feedforward operation for a unit $i$ in layer $l + 1$ is:

\[ z^{(l+1)}_i = W^{(l+1)}_i, y^{(l)} + b^{(l+1)}_i,\quad y^{(l+1)}_i = f(z^{(l+1)}_i) \]

With dropout, define a binary mask vector $r^{(l)}$ of independent Bernoulli random variables:

\[ r_j^{(l)} \sim \text{Bernoulli}(p), \]

where $p$ is the retention probability (probability a unit remains active). Then the input to layer $l + 1$ becomes:

\[ \widetilde{y}^{(l)} = r^{(l)} \odot y^{(l)}, \]

where $\odot$ is element-wise multiplication. The dropout feedforward becomes:

\[ z^{(l+1)}_i = W^{(l+1)}_i, (\widetilde{y}^{(l)}) + b^{(l+1)}_i,\quad y^{(l+1)}_i = f(z^{(l+1)}_i) \]

The mask $r^{(l)}$ randomly zeroes out units independently during training.

Expected Output Matching at Test Time¶

At test time, dropout is not applied, but weights are scaled. To maintain the same expected activation as during training, output weights are multiplied by the retention probability $p$:

\[ W^{(l)}*{\text{test}} = p,W^{(l)}*{\text{trained}} \]

This ensures that the expected output of each unit under dropout matches its deterministic output at test time:

At training time: $$ \mathbb{E}[r_j^{(l)} \cdot y_j^{(l)}] = p, y_j^{(l)} $$

At test time:

\[ \widetilde{y}*j^{(l)} = y_j^{(l)}, \quad W*{\text{test}} = p,W_{\text{trained}} \]

This scaling effectively performs approximate model averaging across all thinned networks.

Training with Dropout¶

Backpropagation with Dropout¶

Training a dropout network uses standard stochastic gradient descent (SGD) with the dropout masks applied per training case. For a mini-batch:

For each example:
Generate dropout masks $r^{(l)}$ for all layers
Forward propagate using masks
Backward propagate errors through the thinned network
Accumulate gradients (zero contribution if weight not used)

The cost function remains unchanged; only the structure of the network varies stochastically.

Max-norm Regularization¶

To stabilize training with dropout, max-norm constraints on incoming weight vectors can be enforced:

\[ |W^{(l)}_i|_2 \leq c, \]

for a fixed constant $c$. After each SGD update, project weights onto the ball of radius $c$ if violated. This is found to improve generalization with dropout.

Hyperparameters and Practical Choices¶

Typical settings from the paper:

Hidden layer retention probability: $p = 0.5$
Input layer retention probability: $p \approx 0.8$
Weight constraint radius $c$: tuned via validation
Activations: ReLU often outperforms sigmoid
Combining dropout with max-norm, high momentum and decaying learning rate yields best results.

Experimental Results Summary¶

Dropout consistently improves performance across domains:

Dataset	Task	Improvement (approx.)
MNIST	Handwritten digits	Error ~ 1.60% → 0.95%
CIFAR-10	Images	Error ~ 14.98% → 12.61%
CIFAR-100	Images	Error ~ 43.48% → 37.20%
SVHN	Street numbers	Error ~ 3.95% → 2.55%
ImageNet	Large-scale vision	16% Top-5	state-of-art

(Full experimental tables available in the Appendix of the paper)

Pure Python Implementation¶

Below is a minimal pure Python example for a feedforward net with dropout using NumPy (vectorized but not optimized):

import numpy as np

def dropout_forward(X, W, b, p=0.5):
    mask = (np.random.rand(*X.shape) < p) / p
    return (X * mask) @ W + b, mask

def relu(x):
    return np.maximum(0, x)

def train_dropout(X, y, layers, p=0.5, lr=1e-3):
    # layers: list of (W, b) tuples
    for epoch in range(1000):
        idx = np.random.permutation(len(X))
        Xb, yb = X[idx], y[idx]
        activations, masks = [Xb], []
        # forward
        for W, b in layers:
            out, mask = dropout_forward(activations[-1], W, b, p)
            masks.append(mask)
            activations.append(relu(out))
        # backward (simple MSE)
        grad = activations[-1] - yb
        for i in reversed(range(len(layers))):
            W, b = layers[i]
            grad_W = activations[i].T @ grad * masks[i]
            grad_bias = grad.sum(axis=0)
            grad = grad @ W.T
            layers[i] = (W - lr*grad_W, b - lr*grad_bias)

This code is for illustrative purposes only and omits many practical details (e.g., vectorized backward pass, loss functions such as cross-entropy, numerical stability, regularization). It illustrates the core idea of dropout forward propagation and update rule.

Discussion and Insights¶

Dropout acts as a stochastic regularizer that:

Reduces co-adaptation of features by forcing units to learn robust representations
Implicitly performs model averaging over $2^n$ sub-networks
Encourages sparsity and diverse feature detectors
Works synergistically with other regularizers (e.g., max-norm)

Limitations:

Makes test cost function deterministic but training cost stochastic; gradient checking must disable dropout (i.e., $p = 1$) for consistency.
High dropout rates ($p$ near 0) can cause underfitting.