Dropout: A Simple Way to Prevent Neural Networks from Overfitting¶
Srivastava et al., Journal of Machine Learning Research, 15(56):1929–1958 (2014) (Journal of Machine Learning Research)
Introduction¶
Deep neural nets with many parameters often overfit when training data is limited. In Dropout, each training presentation randomly drops (i.e., sets to zero) a subset of neurons (hidden units) independently, thereby training many “thinned” networks that share weights. Dropout approximates model averaging over an exponential number of sub-networks while keeping training computation reasonable. (Journal of Machine Learning Research)
Mathematical Description of Dropout¶
Feedforward with Dropout¶
Consider a standard feedforward neural network with \(L\) hidden layers. Let:
- \(y^{(l)}\) be the vector of neuron outputs at layer \(l\)
- \(W^{(l)}\) and \(b^{(l)}\) be weights and biases at layer \(l\)
- \(f(\cdot)\) be an activation function (e.g., sigmoid, ReLU)
The standard feedforward operation for a unit \(i\) in layer \(l + 1\) is:
With dropout, define a binary mask vector \(r^{(l)}\) of independent Bernoulli random variables:
where \(p\) is the retention probability (probability a unit remains active). Then the input to layer \(l + 1\) becomes:
where \(\odot\) is element-wise multiplication. The dropout feedforward becomes:
The mask \(r^{(l)}\) randomly zeroes out units independently during training.
Expected Output Matching at Test Time¶
At test time, dropout is not applied, but weights are scaled. To maintain the same expected activation as during training, output weights are multiplied by the retention probability \(p\):
This ensures that the expected output of each unit under dropout matches its deterministic output at test time:
At training time: $$ \mathbb{E}[r_j^{(l)} \cdot y_j^{(l)}] = p, y_j^{(l)} $$
At test time:
This scaling effectively performs approximate model averaging across all thinned networks.
Training with Dropout¶
Backpropagation with Dropout¶
Training a dropout network uses standard stochastic gradient descent (SGD) with the dropout masks applied per training case. For a mini-batch:
-
For each example:
-
Generate dropout masks \(r^{(l)}\) for all layers
- Forward propagate using masks
- Backward propagate errors through the thinned network
- Accumulate gradients (zero contribution if weight not used)
The cost function remains unchanged; only the structure of the network varies stochastically.
Max-norm Regularization¶
To stabilize training with dropout, max-norm constraints on incoming weight vectors can be enforced:
for a fixed constant \(c\). After each SGD update, project weights onto the ball of radius \(c\) if violated. This is found to improve generalization with dropout.
Hyperparameters and Practical Choices¶
Typical settings from the paper:
- Hidden layer retention probability: \(p = 0.5\)
- Input layer retention probability: \(p \approx 0.8\)
- Weight constraint radius \(c\): tuned via validation
- Activations: ReLU often outperforms sigmoid
- Combining dropout with max-norm, high momentum and decaying learning rate yields best results.
Experimental Results Summary¶
Dropout consistently improves performance across domains:
| Dataset | Task | Improvement (approx.) | ||
|---|---|---|---|---|
| MNIST | Handwritten digits | Error ~ 1.60% → 0.95% | ||
| CIFAR-10 | Images | Error ~ 14.98% → 12.61% | ||
| CIFAR-100 | Images | Error ~ 43.48% → 37.20% | ||
| SVHN | Street numbers | Error ~ 3.95% → 2.55% | ||
| ImageNet | Large-scale vision | 16% Top-5 | state-of-art |
(Full experimental tables available in the Appendix of the paper)
Pure Python Implementation¶
Below is a minimal pure Python example for a feedforward net with dropout using NumPy (vectorized but not optimized):
import numpy as np
def dropout_forward(X, W, b, p=0.5):
mask = (np.random.rand(*X.shape) < p) / p
return (X * mask) @ W + b, mask
def relu(x):
return np.maximum(0, x)
def train_dropout(X, y, layers, p=0.5, lr=1e-3):
# layers: list of (W, b) tuples
for epoch in range(1000):
idx = np.random.permutation(len(X))
Xb, yb = X[idx], y[idx]
activations, masks = [Xb], []
# forward
for W, b in layers:
out, mask = dropout_forward(activations[-1], W, b, p)
masks.append(mask)
activations.append(relu(out))
# backward (simple MSE)
grad = activations[-1] - yb
for i in reversed(range(len(layers))):
W, b = layers[i]
grad_W = activations[i].T @ grad * masks[i]
grad_bias = grad.sum(axis=0)
grad = grad @ W.T
layers[i] = (W - lr*grad_W, b - lr*grad_bias)
This code is for illustrative purposes only and omits many practical details (e.g., vectorized backward pass, loss functions such as cross-entropy, numerical stability, regularization). It illustrates the core idea of dropout forward propagation and update rule.
Discussion and Insights¶
Dropout acts as a stochastic regularizer that:
- Reduces co-adaptation of features by forcing units to learn robust representations
- Implicitly performs model averaging over \(2^n\) sub-networks
- Encourages sparsity and diverse feature detectors
- Works synergistically with other regularizers (e.g., max-norm)
Limitations:
- Makes test cost function deterministic but training cost stochastic; gradient checking must disable dropout (i.e., \(p = 1\)) for consistency.
- High dropout rates (\(p\) near 0) can cause underfitting.