Games & planning

professional

Mastering the game of Go with deep neural networks and tree search

Silver et al. · Nature 2016

RL Planning

From paper to practice

Pair this reading with structured exercises in our catalog—concepts, quizzes, and (where available) coding checkpoints so you can apply the ideas, not just skim them.

Open related course: Artificial Intelligence Find a learning path More papers

Paper PDF

Open in new tab

This paper is not embedded here (publisher limits). The original source may still be available in your browser.

Open official source Browse other research papers Open another paper

If the viewer is blank (blocked by the publisher or your network), use Open in new tab. Scrolling inside the frame moves through the PDF pages when embedding is supported.

Reading map

These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).

Problem statement & goal

Go has vastly more positions than chess; brute-force search alone fails. The aim: combine deep learning with tree search so a program can reach strong amateur / professional play using human games plus self-play.

Open official source Page jumps unavailable for this host; use the journal or PDF in your browser.

Methodology & architecture

Policy networks suggest promising moves; value networks estimate who is winning from a position. Monte Carlo tree search rolls out many simulated games using those networks to pick the next move. Training mixes supervised learning from human games with RL from self-play.

Open official source Page jumps unavailable for this host; use the journal or PDF in your browser.

Datasets & benchmarks

Data comes from online human games (KGS) and later self-play games the system plays against itself. Evaluation includes matches against strong humans and other Go programs—not just a static test set.

Open official source Page jumps unavailable for this host; use the journal or PDF in your browser.

Results & evaluation metrics

The headline result: AlphaGo beats a top professional under match conditions. For students, notice both Elo-style ratings and head-to-head evidence; this is as much a systems + RL story as a single metric.

Open official source Page jumps unavailable for this host; use the journal or PDF in your browser.

Limitations & future work

The method is compute-heavy and engineering-intensive; domain knowledge (rules, symmetries) still matters. Transfer to other games isn’t automatic—AlphaZero later removes human data but keeps the search + learning pattern.

Open official source Page jumps unavailable for this host; use the journal or PDF in your browser.

Reproducibility

A full replication needs massive compute and distributed training—not a weekend script. The paper and follow-ups describe architecture and training stages, but this is closer to a lab + cluster project than a single-GPU notebook.

Open official source Page jumps unavailable for this host; use the journal or PDF in your browser.

What to focus on

Eight highlights per paper—why each part matters before you read dense notation and proofs.

Why Go is hard

Enormous branching factor and long games defeat brute-force search alone. AlphaGo shows learned priors and value estimates can shrink the search tree intelligently.

Monte Carlo tree search

MCTS balances exploration and exploitation over possible moves. Deep nets provide move probabilities and position values so each simulation is more informative.

Policy network

Supervised learning on expert games seeds a strong prior over moves. That prior guides which branches MCTS expands first—critical for compute efficiency.

Value network

Predicting win probability from a board state replaces rollouts to the end in later variants. Faster evaluation enables deeper search within the same budget.

Self-play reinforcement

Games against itself generate fresh training targets beyond human databases. The system can improve past the best human data—a key shift for game AI.

Hardware & scale

Distributed rollouts and GPUs made large-scale search plus training feasible in reasonable wall-clock time. Engineering constraints shaped the algorithm as much as theory.

Match narrative

Results vs. Fan Hui and Lee Sedol illustrate reliability and failure modes (e.g., unusual lines). Real competition stress-tests what lab Elo curves might hide.

AlphaZero lineage

Removing human games and training from self-play alone generalizes the recipe to chess and shogi. AlphaGo is the hinge between handcrafted features and tabula-rasa mastery.

Research literacy notes

Capture how you read this paper—claims, brittle assumptions, and what you’d rerun. Notes stay on this browser only (local storage); they’re for your engagement, not grading.

Private to your device · cleared if you erase site data

Main claim (one tight paragraph)

Fragile assumption

Experiment I’d rerun or inspect

← Back to Research Lab