Convolutional Neural Network
ConvNet · CNNA neural network architecture that uses learned convolution filters over local regions of an input grid — images, audio spectrograms, or any tensor with spatial structure.
In one line
A network built from convolutional layers — small learned filters slid across the input — so nearby pixels share structure and the model is translation-equivariant.
What it actually means
A convolution layer applies the same small kernel (say 3x3) at every spatial position of the input, producing a feature map. Stacking convs with strides or pooling shrinks the spatial dimensions and grows the channel count, so later layers see larger receptive fields and more abstract features. Early layers learn edges and textures, middle layers learn parts, top layers learn objects. Weight sharing across positions means CNNs have far fewer parameters than a dense network for the same image, and they generalize better because a cat is a cat whether it’s in the top-left or bottom-right.
Why it matters
From 2012 (AlexNet) to roughly 2020, CNNs owned computer vision. Vision Transformers have taken the frontier since, but CNNs remain the right default for most real-world CV tasks — they’re data-efficient, fast, and run fine on CPU. If you’re shipping an image classifier tomorrow, a pretrained ResNet or EfficientNet is still the sensible starting point.
Example
import torch.nn as nn
backbone = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
)
You’ll hear it when
- Picking a backbone for image classification, detection, or segmentation.
- Reading older CV papers (everything pre-ViT).
- Discussing receptive fields, stride, padding, and pooling.
- Comparing CNN vs ViT on a small dataset.