Problem statement & goal
Recurrent and convolutional models process text step by step, which limits parallel training on long sequences. The authors propose a model built only on attention—no RNN—so the whole sentence can be processed more in parallel while still capturing long-range dependencies.