Motivation
Fully connected layers don’t exploit spatial structure. CNNs use local connectivity and weight sharing to efficiently process images.
Convolution Operation
A 2D convolution slides a kernel over the input:
Key Components
Convolutional Layer
Applies learnable filters to extract features (edges, textures, shapes).
Pooling Layer
Reduces spatial dimensions. Max pooling takes the maximum value in each window.
Batch Normalization
Normalizes activations to stabilize training.
Classic Architectures
- LeNet (1998): pioneered CNNs for digit recognition
- AlexNet (2012): deeper, used ReLU and dropout
- VGG (2014): uniform 3x3 convolutions, very deep
- ResNet (2015): skip connections, enabled 100+ layer networks
PyTorch Example
import torch.nn as nn
model = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(64 * 7 * 7, 10),
)