New Self-paced AI courses — learn ML, deep learning, and agents on your schedule. Enroll free

Transformers

Attention Is All You Need

Vaswani et al. · NeurIPS 2017

Paper PDF

Open in new tab

Fetching research paper

Downloading PDF from the archive

If the viewer is blank (blocked by the publisher or your network), use Open in new tab. Scrolling inside the frame moves through the PDF pages when embedding is supported.

Reading map

These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).

Problem statement & goal

Recurrent and convolutional models process text step by step, which limits parallel training on long sequences. The authors propose a model built only on attention—no RNN—so the whole sentence can be processed more in parallel while still capturing long-range dependencies.

Methodology & architecture

Multi-head self-attention lets each position look at all others to build context. Add position encodings (because there’s no recurrence), residuals, and layer norm, and stack layers for an encoder–decoder (e.g., for translation). The famous Figure 1 is the map of data flow.

Datasets & benchmarks

They train on the WMT English–German and English–French tasks—standard MT benchmarks—so BLEU scores compare directly to prior published systems.

Results & evaluation metrics

BLEU improves over strong RNN + attention baselines, often faster to train per step because of parallelism. Look at training cost vs. quality tables, not only peak BLEU.

Limitations & future work

Attention is quadratic in sequence length (every token attends to every token), so very long documents get expensive in memory. Relative position and later sparse/long-context methods address this.

Reproducibility

Model size, warmup schedule, and hyperparameters are spelled out; the appendix is detailed. Modern reimplementations are everywhere (e.g., The Annotated Transformer), so students can line up code with the paper line by line.

What to focus on

Eight highlights per paper—why each part matters before you read dense notation and proofs.

Why drop recurrence

RNNs serialize time steps; attention lets every position attend to every other in one layer (modulo depth). That unlocks massive parallel training on accelerators.

Scaled dot-product attention

Softmax(QKᵀ/√d)V is the workhorse. The scaling prevents softmax saturation as dimension grows—small detail with large stability impact.

Multi-head attention

Several attention heads in parallel learn different relationship patterns; concatenation and projection mix them. Read it as ensemble of cheap pairwise routers.

Encoder vs. decoder masks

Encoder uses full self-attention; decoder masks future tokens so generation stays causal. Confusing the two breaks autoregressive inference.

Positional encodings

Attention is permutation-invariant without position info. Sinusoidal or learned embeddings inject order—critical for language and later adapted in vision patches.

Residual + layer norm

Pre/post-norm variants differ, but the theme matches ResNet: stabilize deep stacks. Most modern LLM stacks are variations on this sandwich.

Complexity trade-offs

Self-attention is O(n²) in sequence length for full attention. Compare to RNN per-step cost and note why long-context methods (sparse, linear, sliding) exist.

Vocabulary for everything after

Method and experiments define terms reused in BERT, GPT, ViT, and diffusion transformers. Map Figure 1 to tensor shapes once—it pays off across papers.

← Back to Research Lab