ANN Series 11 - Regularization in Neural Networks

Overfitting

Regularization in neural networks is a crucial technique used to prevent overfitting, which occurs when a model learns the training data too well, including its noise and outliers, leading to poor performance on unseen data. Overfitting happens especially when the network is too complex relative to the amount and variety of the training data.

Regularization techniques modify the learning process to reduce the complexity of the model, encouraging it to learn more general patterns that can generalize better to new, unseen data.

Common Techniques

Here are some common regularization techniques used in neural networks:

1. L1 Regularization (Lasso Regression):

Adds a penalty equal to the absolute value of the magnitude of coefficients. This can lead to sparse models where some weights become exactly zero, effectively removing some features/weights. Lasso can struggle in situations where the number of predictors is much larger than the number of observations or when several predictors are highly correlated.

2. L2 Regularization (Ridge Regression):

Adds a penalty equal to the square of the magnitude of coefficients. This discourages large weights but does not set them to zero, leading to models where weights are evenly distributed and smaller.

Shrinkage:

The L2 penalty term in the ridge regression cost function encourages the coefficients of the regression model to shrink towards zero, but not exactly zero. This shrinkage effect is stronger on coefficients of less important variables, helping to reduce model variance without substantially increasing bias.

Correlated Variables:

When variables are highly correlated, OLS can lead to large coefficients as it tries to fit the data closely. However, these large coefficients can vary widely for small changes in the model or the data, leading to overfitting. Ridge regression counteracts this by penalizing large coefficients, effectively distributing the weight among correlated variables and reducing the variance in their estimated coefficients. This results in a more stable model where the influence of any single variable is moderated.

Bias-Variance Trade-off:

By introducing the regularization term, ridge regression accepts a slight increase in bias (through shrinkage of coefficients) in exchange for a significant decrease in variance of coefficient estimates. This trade-off often leads to better model performance on unseen data.

3. Elastic Net Regularization:

A combination of L1 and L2 regularization. It adds penalties from both L1 and L2, controlling the mixture via a ratio parameter. This method combines the feature selection from L1 with the regularization of L2.

The first 3 techniques add a penalty term to the loss function.

The Detailed Explanation for each method can be found here:

4. Dropout:

Dropout is a regularization technique that randomly drops (sets to zero) a proportion of the neurons in the network during training. Each time a batch of data is passed through the network (i.e., during each forward pass), dropout randomly selects a different subset of neurons to "drop" by setting their outputs to zero. This means that the specific neurons dropped change with every iteration of the training process.

Impact of Dropout

Reduction of Co-adaptations: By randomly dropping out neurons, dropout reduces the chances of neurons becoming overly reliant on the presence of particular other neurons. This encourages each neuron to independently extract useful features, making the model more robust.

Co-Adaptations:

In the context of neural networks, co-adaptations occur when neurons in a layer adjust their weights during training in such a way that they become highly specialized to recognize specific patterns or features in combination with other neurons. While this might sound beneficial, it can lead to a model that performs well on the training data by memorizing these complex patterns (including noise) but fails to generalize to new, unseen data.

Complex Co-Adaptations:

These are intricate dependencies that form among neurons over the course of training. For example, one neuron's output might become highly reliant on the specific outputs of several other neurons in a way that is unique to the training set. Such dependencies make the network less flexible in adapting to new data that doesn't exactly match the training set's patterns.

Ensemble Effect:

Dropout can be interpreted as a way of training a large ensemble of neural networks with shared weights. Each forward pass uses a different "thinned" network, and the final model can be seen as an averaging of these thinned networks. This ensemble approach helps in reducing overfitting.

Improved Generalization:

Because the network cannot rely on any single set of neurons, dropout forces the network to learn more generalized representations that are useful across different "versions" of the network seen during training. This leads to better generalization to unseen data

Usually in an MLP, dropout is applied to the input and hidden layers.

Dropout at Input Layer

This forces the network to not rely too heavily on any input feature since each feature can be zeroed with a certain probability. The dropout rate for the input layer is usually set to a lower value compared to hidden layers (e.g., 0.1 to 0.2) to avoid losing too much input information.

Dropout at Hidden Layers

Dropout is most frequently and effectively applied to hidden layers in an MLP. By randomly omitting a subset of neurons within the hidden layers, dropout prevents the network from becoming too dependent on any specific neuron, encouraging the network to learn more robust features that are generalizable across different subsets of the data. The dropout rate for hidden layers can vary, but typical values range from 0.2 to 0.5. It's important to tune this rate based on the specific dataset and model architecture.

Example for dropout rate

Practically, for every batch of data processed during training, a random 20% of the nodes in that hidden layer will be turned off. So, on average, 1 out of the 5 nodes (20% of 5) will be zeroed out for any given training step, although the specific node(s) that are dropped change randomly with each iteration.

Note that dropout is applied only during the training phase and not during the testing phase.

Visual Example

Consider the following MLP.

Let us apply a dropout rate of 0.5 in the first hidden layer. This will result in 2 nodes being dropped from the network since 2 is 50% of 4. Similarly, let the dropout rate at the second hidden layer be 0.2. 1 node will be dropped during every iteration since 20% of 5 is 1, resulting the following network.

5. Early Stopping:

Monitors the model's performance on a validation set and stops training when performance starts to degrade (e.g., when the validation loss starts to increase), even if the training loss continues to decrease. This helps in preventing overfitting by not allowing the model to train for too long.

Common Parameters to consider in Early Stopping:

monitor:

This parameter specifies the metric to be monitored, such as 'val_loss' for validation loss or 'val_accuracy' for validation accuracy.

Minimum Delta:

The minimum change in the monitored metric to qualify as an improvement. This parameter accounts for the noise in the training process and helps to ignore small changes treated as no improvement.

patience:

The number of epochs with no improvement after which training will be stopped. Setting this parameter requires balancing the desire to train long enough to reach an optimal model against the risk of overfitting.

mode:

Determines whether the monitor metric should be minimized or maximized. Common values include 'min' for metrics like loss, where a lower value is better, and 'max' for metrics like accuracy, where a higher value is better.

baseline:

An optional value for the monitored metric. If set, training will stop if the model doesn't achieve a performance better than this baseline within the patience period.

restore_best_weights:

Whether to restore model weights from the epoch with the best value of the monitored metric. When training is stopped early, the model weights may be at a state that is not optimal; restoring the best weights helps to ensure the model retains the best learned features.

6. Batch Normalization:

Although primarily used to help with training speed and stability, batch normalization can also have a regularizing effect. Batch Normalization helps in stabilizing the learning process and significantly reduces the number of training epochs required to train deep networks.

It normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation, thereby ensuring that the inputs to activation functions are standardized. This can also allow for higher learning rates and smoother convergence.

Steps

Why Apply Batch Normalization Before Activation?

Stabilizes Learning:

By normalizing the inputs to the activation functions, batch normalization helps to ensure that these inputs do not become too high or too low, which can lead to vanishing or exploding gradients and hinder the learning process.

Improves Optimization:

It makes the optimization landscape smoother. This can lead to faster convergence during training and allows the use of higher learning rates.

Reduces Internal Covariate Shift:

By normalizing the inputs to each layer, batch normalization reduces the internal covariate shift, which is the change in the distribution of network activations due to the change in network parameters during training. This helps to stabilize and speed up training.

7. Weight Constraints:

Limiting the size of weights directly by imposing constraints on their magnitude (e.g., a maximum norm constraint) can also serve as a form of regularization, ensuring that no weight can grow excessively large.

Steps for Maximum Norm Constraint

Step1 : Initialize Weights:

Create a sample weights matrix.

Step 2 : Calculate L2 Norm:

Compute the L2 norm of the weights.

Step 3: Apply Maximum Norm Constraint:

If the L2 norm exceeds max_val, scale down the weights so that their new L2 norm equals max_val.

8. Weight Clipping

Weight clipping is a straightforward method where weights are directly clipped to be within a specific range [−c,c] after each gradient update.

Code

with torch.no_grad():

for param in model.parameters():

param.data = torch.clamp(param.data, -c, c)

Other Techniques

1. Dataset Augmentation

Dataset augmentation increases the diversity of the training data through transformations like rotation, scaling, cropping, or flipping images, and even textual modifications for NLP tasks. This helps in regularization by effectively increasing the size and variability of the training dataset, which encourages the model to learn more general features rather than memorizing the training data.

- Effect: By training on this augmented dataset, the model becomes more robust to variations in the input data, improving its generalization ability.

2. Parameter Sharing and Tying

Parameter sharing reduces the total number of free parameters, forcing the model to represent the data with fewer parameters. This is commonly seen in convolutional neural networks (CNNs) where the same weights are used for different parts of the input, significantly reducing the number of unique weights. Parameter tying, often used in autoencoders, involves using the same parameters (weights) in multiple parts of a model.

- Effect: Both techniques reduce the complexity of the model and the likelihood of overfitting by constraining the model to learn more general features.

3. Ensemble Methods

Ensemble methods combine the predictions of several base estimators (e.g., different neural networks) to improve robustness and generalization over a single estimator. Techniques like bagging, boosting, and stacking are ways to create ensembles.

- Effect: They help in regularization by averaging out biases, reducing variance, and making the model less likely to overfit to the training data. The diversity among the models in the ensemble leads to more reliable predictions.

4 Adding Noise to Input/Output

Adding noise to inputs or outputs during training can make the model more robust to slight variations and imperfections in the data. For inputs, this might involve adding random noise to the training data. For outputs, it could mean adding noise to the targets in regression tasks or using techniques like label smoothing in classification.

- Effect: This encourages the model not to rely too heavily on any single feature or pattern in the data, promoting the learning of more general patterns that are invariant to small changes, thereby improving generalization.

Each of these techniques can be used alone or in combination to prevent overfitting. The choice of regularization method(s) can depend on the specific problem, the nature of the data, and the neural network architecture. Effective use of regularization can significantly enhance the generalization ability of neural networks.