Problem statement & goal
Very deep CNNs should get better, but in practice optimization gets harder: accuracy can saturate or worsen as you stack more layers. The paper asks: how can we go much deeper without that degradation?
Very deep networks
He et al. · CVPR 2016
Fetching research paper
Downloading PDF from the archive
Original source not responding
We could not fetch or display this PDF. The host may be down, blocking embedding, or your connection may have dropped.
A button will appear below to pick another paper from the lab.
Continue reading
Choose another paper from the research lab.
These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).
Very deep CNNs should get better, but in practice optimization gets harder: accuracy can saturate or worsen as you stack more layers. The paper asks: how can we go much deeper without that degradation?
Instead of asking each stack to learn a full mapping, layers learn a residual—a small correction added to the identity of the input (skip connections). When dimensions change, a projection aligns shapes. Bottleneck blocks keep compute under control.
Training uses ImageNet classification; they also transfer to detection and segmentation benchmarks (COCO, etc.) to show the backbone helps beyond one task.
Top-1 / top-5 error drops as depth increases (e.g., 18 to 34 layers) where plain nets fail. Students should look at curves: residuals fix the “deeper is worse” phenomenon on the training side too.
Depth and width trade off compute and memory. Extremely deep nets need careful initialization, batch norm, and schedules (later work refines further). Transfer isn’t free—you still need task-specific heads and tuning.
They compare to VGG-style nets and earlier ImageNet winners and cite highway nets and other shortcut ideas. ResNets become the default backbone for years because they’re simple and reliable.
Architectures, stride rules, and projection shortcuts are specified in depth; ImageNet training recipe follows common practice of the era. Teams routinely reproduce ResNet families in PyTorch tutorials.
Eight highlights per paper—why each part matters before you read dense notation and proofs.
Adding layers to a plain conv stack can increase training error—not just test error. That motivates residual reformulation instead of blindly stacking depth.
Layers learn a perturbation F(x) added to identity x. If the optimal mapping is near identity, F≈0 is easier to learn than reproducing x from scratch each block.
Identity paths carry gradients across blocks, easing optimization in very deep nets. Core intuition behind modern deep vision and many non-vision architectures.
Empirical curves show ResNets train deeper without the degradation cliff. Use them to explain why “just add layers” failed before shortcuts.
1×1–3×3–1×1 stacks cut FLOPs while preserving depth. ResNet-50/101/152 naming maps to these block counts—useful when papers say “ResNet-50 backbone.”
When spatial size or channels change, shortcuts use projection convs to match shapes. Trace one block diagram through tensor sizes to demystify code.
ResNet backbones power Faster R-CNN, FPN, and many medical imaging nets. Understanding blocks here transfers to almost any ResNet backbone line in methods sections.
Residual connections appear in Transformers (pre-norm), audio, and video models. Stable signal paths through depth are architectural hygiene, not a CV-only trick.