New Self-paced AI courses — learn ML, deep learning, and agents on your schedule. Enroll free

Very deep networks

Deep Residual Learning for Image Recognition

He et al. · CVPR 2016

Paper PDF

Open in new tab

Fetching research paper

Downloading PDF from the archive

If the viewer is blank (blocked by the publisher or your network), use Open in new tab. Scrolling inside the frame moves through the PDF pages when embedding is supported.

Reading map

These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).

Problem statement & goal

Very deep CNNs should get better, but in practice optimization gets harder: accuracy can saturate or worsen as you stack more layers. The paper asks: how can we go much deeper without that degradation?

Methodology & architecture

Instead of asking each stack to learn a full mapping, layers learn a residual—a small correction added to the identity of the input (skip connections). When dimensions change, a projection aligns shapes. Bottleneck blocks keep compute under control.

Datasets & benchmarks

Training uses ImageNet classification; they also transfer to detection and segmentation benchmarks (COCO, etc.) to show the backbone helps beyond one task.

Results & evaluation metrics

Top-1 / top-5 error drops as depth increases (e.g., 18 to 34 layers) where plain nets fail. Students should look at curves: residuals fix the “deeper is worse” phenomenon on the training side too.

Limitations & future work

Depth and width trade off compute and memory. Extremely deep nets need careful initialization, batch norm, and schedules (later work refines further). Transfer isn’t free—you still need task-specific heads and tuning.

Reproducibility

Architectures, stride rules, and projection shortcuts are specified in depth; ImageNet training recipe follows common practice of the era. Teams routinely reproduce ResNet families in PyTorch tutorials.

What to focus on

Eight highlights per paper—why each part matters before you read dense notation and proofs.

Degradation problem

Adding layers to a plain conv stack can increase training error—not just test error. That motivates residual reformulation instead of blindly stacking depth.

Residual mapping

Layers learn a perturbation F(x) added to identity x. If the optimal mapping is near identity, F≈0 is easier to learn than reproducing x from scratch each block.

Shortcut gradients

Identity paths carry gradients across blocks, easing optimization in very deep nets. Core intuition behind modern deep vision and many non-vision architectures.

ImageNet & CIFAR evidence

Empirical curves show ResNets train deeper without the degradation cliff. Use them to explain why “just add layers” failed before shortcuts.

Bottleneck design

1×1–3×3–1×1 stacks cut FLOPs while preserving depth. ResNet-50/101/152 naming maps to these block counts—useful when papers say “ResNet-50 backbone.”

Stride & projection shortcuts

When spatial size or channels change, shortcuts use projection convs to match shapes. Trace one block diagram through tensor sizes to demystify code.

Detection & segmentation reuse

ResNet backbones power Faster R-CNN, FPN, and many medical imaging nets. Understanding blocks here transfers to almost any ResNet backbone line in methods sections.

Beyond vision

Residual connections appear in Transformers (pre-norm), audio, and video models. Stable signal paths through depth are architectural hygiene, not a CV-only trick.

← Back to Research Lab