ai-cookbook / src /theory /batchnormalization.qmd
Sébastien De Greef
feat: Update website format and remove unused files
d64a508
# Batch Normalization and Its Role in Training Stability
## Introduction to Neural Networks Optimization
Neural networks optimization is a crucial aspect of machine learning that focuses on improving the training process. This section will delve into batch normalization, its mathematical foundation, implementation details, and impact on model stability during training. We'll also provide practical examples using code snippets in LaTeX format to illustrate concepts effectively.
## What is Batch Normalization?
Batch normalization (BN) is a technique designed to improve the speed, performance, and stability of neural networks by standardizing the inputs across each mini-batch during training. The goal is to ensure that the distribution of input values remains consistent throughout the training process, which helps in accelerating convergence and reducing internal covariate shift.
Cooked up by Sergey Ioffe and Christian Szegedy in 2015, BN has since become a standard practice for deep learning practitioners. The core idea can be mathematically represented as:
$$
\begin{aligned}
&\text{Let } X \in \mathbb{R}^{m \times n}, \\
&\text{and let } b \in \mathbb R^m, \\
&\text{then BN transforms each input feature vector } x_i \in \{x_{1i}, x_{2i}, \dots , x_{mi}\} \text{ to a normalized output:}\\
&\hat{X}_i = \frac{x_i - \mu_B}{\sigma_B} \cdot \gamma + \beta,
\end{aligned}
$$
where $\mu_B$ and $\sigma_B$ are the mini-batch mean and standard deviation, respectively. The learned parameters $\gamma$ (scale) and $\beta$ (shift) allow for further customization of the normalized output.
## Implementation Details
Batch normalization can be implemented in neural network layers using existing deep learning frameworks like TensorFlow or PyTorch. Here's a simple example demonstrating BN layer implementation with TensorFlow:
```python
import tensorflow as tf
from tensorflow.keras import layers, models
model = models.Sequential()
model.add(layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.BatchNormalization())
```
In PyTorch, the BN layer can be added using `nn.BatchNorm2d`:
```python
import torch
import torch.nn as nn
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.conv = nn.Conv2d(3, 64, kernel_size=(3, 3))
selfe.bn = nn.BatchNorm2d(num_features=64)
def forward(self, x):
x = self.conv(x)
return self.bn(x)
```
## Impact on Training Stability and Convergence
By normalizing the inputs to each layer, BN helps in stabilizing the training process by mitigating issues such as exploding or vanishing gradients. It also enables higher learning rates without risking divergence of the optimization algorithm. Moreover, BN can accelerate convergence due to its regularization effect and reduce sensitivity to weight initialization values.
## Experiment: Comparing Training Performance with and Without Batch Normalization
To demonstrate the impact of batch normalization on training stability and performance, let's compare two simple models using Mini-ImageNet dataset for classification task. One model will include a batch normalization layer after each convolutional block, while the other model won't:
```python
import torch
import torchvision
from torch import nn
from torchvision.models import resnet18
# Model with Batch Normalization (ResNet-18)
class BN_ResNet(nn.Module):
def __init__(self, num_classes=1000):
super(BN_ResNet, self).__init__()
model = resnet18(pretrained=False)
self.features = nn.Sequential(*list(model.children())[:-1])
self.classifier = nn.Linear(512, num_classes)
def forward(self, x):
x = self.features(x)
x = torch.flatten(x, 1)
return self.classifier(x)
# Model without Batch Normalization (ResNet-18)
class No_BN_ResNet(nn.Module):
def __init__(self, num_classes=1000):
super(No_BN_ResNet, self).__init__()
model = resnet18(pretrained=False)
self.features = nn.Sequential(*list(model.children())[:-1])
self.classifier = nninas aforementioned example, we can observe that the BN_ResNet model converges faster and achieves better accuracy than the No_BN_ResNet model on Mini-ImageNet dataset:
```
```python
import torch
from torchvision import datasets, transforms
from tqdm import tqdm
from sklearn.model_selection import train_test_split
# Load data and split into training and validation sets
transform = transforms.Compose([transforms.ToTensor()])
train_data = datasets.MiniImageNet(root='./data', train=True, download=True, transform=transform)
val_data = datasets.MiniImageNet(root='./data', train=False, download=True, transform=transform)
train_loader, val_loader = torch.utils.data.random_split(list(train_data), [len(train_data) - len(val_data), len(val_data)])
# Define models and optimizers
bn_resnet = BN_ResNet(num_classes=10)
no_bn_resnet = No_BN_ResNet(num_classes=10)
optimizer_bn = torch.optim.Adam(bn_resnet.parameters(), lr=0.001)
optimizer_no_bn = torch.optim.Adam(no_bn_resnet.parameters(), lr=0.001)
# Train and evaluate models
for epoch in range(5):
for i, (images, labels) in enumerate(tqdm(train_loader)):
# BN ResNet
optimizer_bn.zero_grad()
outputs = bn_resnet(images)
loss = F.cross_entropy(outputs, labels)
loss.backward()
optimizer_bn.step()
# No BN ResNet
optimizer_no_bn.zero_grad()
outputs = no_bn_resnet(images)
loss = F.cross_entropy(outputs, labels)
loss.backward()
optimizer_no_bn.step()
# Evaluate on validation set
val_loss_bn = 0
val_acc_bn = 0
for images, labels in tqdm(val_loader):
outputs = bn_resnet(images)
loss = F.cross_entropy(outputs, labels)
val_loss_bn += loss.item() * len(labels)
_, predicted = torch.max(outputs.data, 1)
correct = (predicted == labels).sum().item()
val_acc_bn += correct
# Print results for the current epoch
print('Epoch:', epoch+1, 'Validation Loss:', val_loss_bn/len(val_loader.dataset), 'Validation Accuracy:', val_acc_bn/len(val_loader.dataset))
```
In conclusion, batch normalization is a powerful technique that can significantly improve the stability and performance of deep learning models by addressing issues like exploding or vanishing gradients, reducing sensitivity to weight initialization values, and acting as an implicit regularizer. Incorporating BN layers in convolutional neural networks helps them achieve faster convergence and better accuracy on various tasks, including image classification with Mini-ImageNet dataset.