Problem statement & goal
Go has vastly more positions than chess; brute-force search alone fails. The aim: combine deep learning with tree search so a program can reach strong amateur / professional play using human games plus self-play.
Games & planning
Silver et al. · Nature 2016
This paper is not embedded here (publisher limits). The original source may still be available in your browser.
These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).
Go has vastly more positions than chess; brute-force search alone fails. The aim: combine deep learning with tree search so a program can reach strong amateur / professional play using human games plus self-play.
Policy networks suggest promising moves; value networks estimate who is winning from a position. Monte Carlo tree search rolls out many simulated games using those networks to pick the next move. Training mixes supervised learning from human games with RL from self-play.
Data comes from online human games (KGS) and later self-play games the system plays against itself. Evaluation includes matches against strong humans and other Go programs—not just a static test set.
The headline result: AlphaGo beats a top professional under match conditions. For students, notice both Elo-style ratings and head-to-head evidence; this is as much a systems + RL story as a single metric.
The method is compute-heavy and engineering-intensive; domain knowledge (rules, symmetries) still matters. Transfer to other games isn’t automatic—AlphaZero later removes human data but keeps the search + learning pattern.
They relate to earlier Go programs, chess engines, and RL successes. The novelty is the tight loop between deep nets and large-scale MCTS at superhuman scale.
A full replication needs massive compute and distributed training—not a weekend script. The paper and follow-ups describe architecture and training stages, but this is closer to a lab + cluster project than a single-GPU notebook.
Eight highlights per paper—why each part matters before you read dense notation and proofs.
Enormous branching factor and long games defeat brute-force search alone. AlphaGo shows learned priors and value estimates can shrink the search tree intelligently.
MCTS balances exploration and exploitation over possible moves. Deep nets provide move probabilities and position values so each simulation is more informative.
Supervised learning on expert games seeds a strong prior over moves. That prior guides which branches MCTS expands first—critical for compute efficiency.
Predicting win probability from a board state replaces rollouts to the end in later variants. Faster evaluation enables deeper search within the same budget.
Games against itself generate fresh training targets beyond human databases. The system can improve past the best human data—a key shift for game AI.
Distributed rollouts and GPUs made large-scale search plus training feasible in reasonable wall-clock time. Engineering constraints shaped the algorithm as much as theory.
Results vs. Fan Hui and Lee Sedol illustrate reliability and failure modes (e.g., unusual lines). Real competition stress-tests what lab Elo curves might hide.
Removing human games and training from self-play alone generalizes the recipe to chess and shogi. AlphaGo is the hinge between handcrafted features and tabula-rasa mastery.