New Self-paced AI courses — learn ML, deep learning, and agents on your schedule. Enroll free

Sequence modeling

Sequence to Sequence Learning with Neural Networks

Sutskever, Vinyals & Le · NeurIPS 2014

Paper PDF

Open in new tab

Fetching research paper

Downloading PDF from the archive

If the viewer is blank (blocked by the publisher or your network), use Open in new tab. Scrolling inside the frame moves through the PDF pages when embedding is supported.

Reading map

These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).

Problem statement & goal

Machine translation used many hand-tuned pieces. The goal here: one encoder–decoder LSTM that maps a variable-length source sentence to a variable-length target sentence—end-to-end, with minimal linguistic machinery.

Methodology & architecture

An encoder LSTM reads the source words and compresses them into a fixed-size thought vector. A decoder LSTM then generates the translation one word at a time. They also reverse the source sentence, which surprisingly helps alignment.

Datasets & benchmarks

They train on the WMT ’14 English–French task—a standard large translation benchmark—so numbers are comparable to other published systems. Beam search at decode time improves output quality.

Results & evaluation metrics

They beat a strong phrase-based SMT baseline on BLEU—a standard automatic translation score. For students: this was early proof that neural sequence models could win on a major benchmark.

Limitations & future work

The bottleneck is the single fixed vector between encoder and decoder: long sentences stress this design. Attention (a later paper) fixes much of that; here the limitation motivates the next wave of work.

Reproducibility

Training setup, reversal trick, and beam search are described concretely. Reproducing exact numbers needs the same data and compute, but the recipe is learnable in a course lab.

What to focus on

Eight highlights per paper—why each part matters before you read dense notation and proofs.

Sequence transduction

Many tasks map input sequences to output sequences with different lengths (translation, summarization). This paper formalizes one general neural recipe instead of pipeline hacks.

Encoder–decoder split

Encoder compresses the source into a representation the decoder consumes step by step. That separation is the template later attention and Transformer papers refine.

Why LSTMs

Vanishing gradients break plain RNNs on long sentences. LSTMs carry memory across many time steps—critical before self-attention made long-range mixing easier.

Reverse source words

A simple reordering trick improved BLEU by aligning early decoder outputs with more informative encoder states. Read it as inductive bias, not magic.

Training objective

Maximum likelihood on parallel sentence pairs ties the model to teacher forcing. Notice where exposure bias appears versus how inference uses its own predictions.

Beam search decoding

Greedy decoding is brittle; beam search keeps multiple hypotheses. Trade-off: wider beams cost compute but often lift translation quality.

Scaling data & depth

WMT-scale data and deep stacks matter as much as the headline architecture. Neural MT wins when corpora and compute match the model class.

Bridge to Transformers

Attention is all you need replaces recurrence with self-attention—but the problem statement (seq2seq) and beam decoding culture come straight from this line of work.

← Back to Research Lab