Problem statement & goal
Machine translation used many hand-tuned pieces. The goal here: one encoder–decoder LSTM that maps a variable-length source sentence to a variable-length target sentence—end-to-end, with minimal linguistic machinery.
Sequence modeling
Sutskever, Vinyals & Le · NeurIPS 2014
Fetching research paper
Downloading PDF from the archive
Original source not responding
We could not fetch or display this PDF. The host may be down, blocking embedding, or your connection may have dropped.
A button will appear below to pick another paper from the lab.
Continue reading
Choose another paper from the research lab.
These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).
Machine translation used many hand-tuned pieces. The goal here: one encoder–decoder LSTM that maps a variable-length source sentence to a variable-length target sentence—end-to-end, with minimal linguistic machinery.
An encoder LSTM reads the source words and compresses them into a fixed-size thought vector. A decoder LSTM then generates the translation one word at a time. They also reverse the source sentence, which surprisingly helps alignment.
They train on the WMT ’14 English–French task—a standard large translation benchmark—so numbers are comparable to other published systems. Beam search at decode time improves output quality.
They beat a strong phrase-based SMT baseline on BLEU—a standard automatic translation score. For students: this was early proof that neural sequence models could win on a major benchmark.
The bottleneck is the single fixed vector between encoder and decoder: long sentences stress this design. Attention (a later paper) fixes much of that; here the limitation motivates the next wave of work.
They situate against classic SMT and earlier neural MT attempts. The message: depth + data + end-to-end learning beats a pipeline of separate engineered parts.
Training setup, reversal trick, and beam search are described concretely. Reproducing exact numbers needs the same data and compute, but the recipe is learnable in a course lab.
Eight highlights per paper—why each part matters before you read dense notation and proofs.
Many tasks map input sequences to output sequences with different lengths (translation, summarization). This paper formalizes one general neural recipe instead of pipeline hacks.
Encoder compresses the source into a representation the decoder consumes step by step. That separation is the template later attention and Transformer papers refine.
Vanishing gradients break plain RNNs on long sentences. LSTMs carry memory across many time steps—critical before self-attention made long-range mixing easier.
A simple reordering trick improved BLEU by aligning early decoder outputs with more informative encoder states. Read it as inductive bias, not magic.
Maximum likelihood on parallel sentence pairs ties the model to teacher forcing. Notice where exposure bias appears versus how inference uses its own predictions.
Greedy decoding is brittle; beam search keeps multiple hypotheses. Trade-off: wider beams cost compute but often lift translation quality.
WMT-scale data and deep stacks matter as much as the headline architecture. Neural MT wins when corpora and compute match the model class.
Attention is all you need replaces recurrence with self-attention—but the problem statement (seq2seq) and beam decoding culture come straight from this line of work.