Reinforcement learning

beginner

Playing Atari with Deep Reinforcement Learning

Mnih et al. · 2013 (NIPS Deep Learning Workshop)

RL Agents

From paper to practice

Pair this reading with structured exercises in our catalog—concepts, quizzes, and (where available) coding checkpoints so you can apply the ideas, not just skim them.

Open related course: Artificial Intelligence Find a learning path More papers

Paper PDF

Open in new tab

Fetching research paper

Downloading PDF from the archive

If the viewer is blank (blocked by the publisher or your network), use Open in new tab. Scrolling inside the frame moves through the PDF pages when embedding is supported.

Reading map

These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).

Problem statement & goal

Classic reinforcement learning struggled when the “state” was raw pixels and the action space was large. The question: can a single deep network learn to play Atari from screen input only, using rewards as the only supervision?

Methodology & architecture

A CNN approximates the Q-function (expected future reward per action). Experience replay stores past transitions and samples them randomly so training doesn’t chase correlated frames. A target network updates slowly to stabilize the moving target.

Datasets & benchmarks

The Arcade Learning Environment provides dozens of Atari games—each a benchmark with the same interface. The agent sees stacks of frames, skips frames for speed, and clips rewards to keep scales stable.

Results & evaluation metrics

They show human-level or near-human play on several games and learning curves that improve with training. Not every game is solved—some stay hard—but the paper proves deep RL from pixels is possible at all.

Limitations & future work

Sample efficiency is poor by today’s standards: millions of frames per game. Training can be unstable; hyperparameters matter. These limits spurred Double DQN, Rainbow, and many improvements.

Reproducibility

The paper spells out network shape, replay buffer, reward clipping, and frame preprocessing. Researchers reproduced DQN quickly; it’s a standard homework baseline—though full Atari runs need GPU time.

What to focus on

Eight highlights per paper—why each part matters before you read dense notation and proofs.

Pixels to policy

Raw frames go through a conv stack to Q-values per action—no engineered game-specific features. Representation learning sits inside RL, not bolted on.

Q-learning recap

Bellman backups estimate expected return per (state, action). The net predicts those values; argmax over actions yields a greedy policy. Ground the math before the hacks.

Why replay buffer

Consecutive frames correlate strongly; i.i.d. SGD assumes break that correlation. Shuffling past transitions stabilizes gradients like shuffling a dataset in supervised learning.

Target network

A lagged copy of weights defines stable targets for the Bellman update. Without it, chasing a moving target causes divergence—a common failure in naive deep Q-learning.

Reward clipping & preprocessing

Frame stacking, downsampling, and reward clipping normalize difficulty across games. Know which choices are algorithmic vs. engineering convenience.

ε-greedy exploration

Random actions early encourage coverage of the state space. Too little exploration misses good policies; too much slows exploitation of what was learned.

Atari benchmark

One setup across diverse games tests robustness. Compare human-normalized scores to see which titles remain hard (sparse reward, long horizons).

What came next

Double DQN, dueling heads, prioritized replay, and policy-gradient methods address DQN weaknesses. DQN remains the canonical intro to function approximation in value-based RL.

Research literacy notes

Capture how you read this paper—claims, brittle assumptions, and what you’d rerun. Notes stay on this browser only (local storage); they’re for your engagement, not grading.

Private to your device · cleared if you erase site data

Main claim (one tight paragraph)

Fragile assumption

Experiment I’d rerun or inspect

← Back to Research Lab