Problem statement & goal
Huge language models were usually fine-tuned per task. The authors ask: if you scale model, data, and compute enough, can one model do many tasks without weight updates—only by reading examples in the prompt (few-shot)?
Large language models
Brown et al. · NeurIPS 2020
Fetching research paper
Downloading PDF from the archive
Original source not responding
We could not fetch or display this PDF. The host may be down, blocking embedding, or your connection may have dropped.
A button will appear below to pick another paper from the lab.
Continue reading
Choose another paper from the research lab.
These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).
Huge language models were usually fine-tuned per task. The authors ask: if you scale model, data, and compute enough, can one model do many tasks without weight updates—only by reading examples in the prompt (few-shot)?
Same core idea as GPT-2: Transformer decoder predicting the next token, trained on a massive web + book mix. At use time, you condition on a prompt (instructions + a few input–output examples) and generate the answer—no gradient steps on the task.
Training data is broad web text filtered for quality; evaluation spans dozens of datasets (translation, QA, arithmetic, etc.) with held-out tests where possible.
Many tasks show smooth improvement with scale—larger models do better at in-context learning, though some tasks stay flat or brittle. Read which skills scale and which don’t; that shapes how you’d deploy LLMs today.
Bias, toxicity, and misuse are discussed openly. Environmental cost and uneven access to compute are real. Long prompts cost tokens; hallucinations aren’t solved here.
They position against BERT-style fine-tuning, smaller GPTs, and task-specific architectures. The narrative: scale + in-context learning changes the engineering trade-off between one big model and many small specialists.
Full 175B training is out of reach for most labs; the paper gives enough recipe for researchers but not full public weights in the original release. Smaller open models later democratize pieces of the story.
Eight highlights per paper—why each part matters before you read dense notation and proofs.
No weight updates at task time—only prompt tokens. The model infers the task from examples in the context window, a new axis beyond pre-train/fine-tune.
Performance trends with model size, data, and compute. Smooth curves on some benchmarks support bet-on-scale; jagged failure modes temper that story.
The paper systematically varies demonstration count. Compare to fine-tuned baselines to see where prompting wins, ties, or loses on 2020-era tasks.
GPT-3 is largely scale-up of Transformer LMs with dense attention. Understanding Attention Is All You Need and earlier GPT papers de-mystifies the stack.
Translation, QA, cloze, Winograd, commonsense, and more. Note which tasks improve smoothly with scale vs. which stay brittle or need retrieval/tooling later.
Large web corpora overlap with test sets. The paper discusses detection and caveats—still central when claiming benchmark breakthroughs.
Bias, misuse, and energy show up explicitly. Compare to later model cards, red-teaming, and organizational responsible-AI playbooks.
Instruction tuning and RLHF came after, but GPT-3 established that base LMs + scale + prompts could look “general” without task-specific training.