New Self-paced AI courses — learn ML, deep learning, and agents on your schedule. Enroll free

Large language models

Language Models are Few-Shot Learners

Brown et al. · NeurIPS 2020

Paper PDF

Open in new tab

Fetching research paper

Downloading PDF from the archive

If the viewer is blank (blocked by the publisher or your network), use Open in new tab. Scrolling inside the frame moves through the PDF pages when embedding is supported.

Reading map

These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).

Problem statement & goal

Huge language models were usually fine-tuned per task. The authors ask: if you scale model, data, and compute enough, can one model do many tasks without weight updates—only by reading examples in the prompt (few-shot)?

Methodology & architecture

Same core idea as GPT-2: Transformer decoder predicting the next token, trained on a massive web + book mix. At use time, you condition on a prompt (instructions + a few input–output examples) and generate the answer—no gradient steps on the task.

Datasets & benchmarks

Training data is broad web text filtered for quality; evaluation spans dozens of datasets (translation, QA, arithmetic, etc.) with held-out tests where possible.

Results & evaluation metrics

Many tasks show smooth improvement with scale—larger models do better at in-context learning, though some tasks stay flat or brittle. Read which skills scale and which don’t; that shapes how you’d deploy LLMs today.

Limitations & future work

Bias, toxicity, and misuse are discussed openly. Environmental cost and uneven access to compute are real. Long prompts cost tokens; hallucinations aren’t solved here.

Reproducibility

Full 175B training is out of reach for most labs; the paper gives enough recipe for researchers but not full public weights in the original release. Smaller open models later democratize pieces of the story.

What to focus on

Eight highlights per paper—why each part matters before you read dense notation and proofs.

In-context learning

No weight updates at task time—only prompt tokens. The model infers the task from examples in the context window, a new axis beyond pre-train/fine-tune.

Scaling laws preview

Performance trends with model size, data, and compute. Smooth curves on some benchmarks support bet-on-scale; jagged failure modes temper that story.

Zero-, one-, few-shot

The paper systematically varies demonstration count. Compare to fine-tuned baselines to see where prompting wins, ties, or loses on 2020-era tasks.

Architecture continuity

GPT-3 is largely scale-up of Transformer LMs with dense attention. Understanding Attention Is All You Need and earlier GPT papers de-mystifies the stack.

Evaluation breadth

Translation, QA, cloze, Winograd, commonsense, and more. Note which tasks improve smoothly with scale vs. which stay brittle or need retrieval/tooling later.

Memorization & contamination

Large web corpora overlap with test sets. The paper discusses detection and caveats—still central when claiming benchmark breakthroughs.

Ethics & deployment

Bias, misuse, and energy show up explicitly. Compare to later model cards, red-teaming, and organizational responsible-AI playbooks.

Bridge to ChatGPT era

Instruction tuning and RLHF came after, but GPT-3 established that base LMs + scale + prompts could look “general” without task-specific training.

← Back to Research Lab