T5: what the text-to-text paper actually says

Reading Raffel et al. (Google, 2020) while choosing between FLAN-T5 and a decoder-only model for a structured extraction pipeline.

The task was contract extraction: given a legal document of ~4,000 tokens, pull out 18 specific fields — governing law, parties, effective date, termination triggers, liability caps, payment terms. The output is a JSON object of about 300 tokens. We needed to run it at 50 documents per second and the GPU budget was fixed.

The question was which model family to fine-tune. Everyone on the team defaulted to a decoder-only LLM because "that's what everything uses now." But every paper I'd seen on structured extraction used encoder-decoder architectures. The argument kept going in circles until I read the original T5 paper — not for the architecture recommendation specifically, but for the systematic evidence about why any of these choices matter.

T5 — "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", Raffel, Shazeer, Roberts et al., Google 2020 — is a 50-page ablation study with a model attached. Most people know the punchline (text-to-text framing) and miss the machinery. The machinery is the part worth understanding.

The problem T5 is actually solving

By 2019, NLP transfer learning was fragmented. BERT did masked language modeling with encoder-only architectures and was fine-tuned for classification, tagging, and extractive QA. GPT did causal language modeling with decoder-only architectures and was fine-tuned for generation. There were separate pretraining recipes for translation (encoder-decoder), summarization (encoder-decoder), and question answering (encoder-only or span extraction). Comparing results across papers was hard because the baselines, architectures, and pretraining data all differed.

The T5 paper asks a clean question: if you hold the training framework constant and vary only one thing at a time — pretraining objective, architecture, dataset, scale — which choices actually matter and by how much? The text-to-text framing is what makes this possible. By treating every task as "given this text, produce this text," you can apply the same training loop, loss function, and inference procedure everywhere.

The practical consequence: a model trained with this framing can be fine-tuned for translation, summarization, classification, and question answering with the same code. No task-specific output heads. No different fine-tuning procedures. The format encodes the task:

Input:  "summarize: [article text]"
Output: "[summary]"

Input:  "translate English to German: The model runs at inference time."
Output: "Das Modell läuft zur Inferenzzeit."

Input:  "mnli premise: The cat sat on the mat. hypothesis: An animal was on a surface."
Output: "entailment"

This looks obvious in retrospect. In 2019, it was not obvious. Most NLP frameworks had task-specific fine-tuning logic baked into the training loop itself.

The architecture

T5 uses an encoder-decoder transformer, close to the original Vaswani et al. design but with three specific changes that matter for production behavior.

Pre-norm instead of post-norm. The original transformer applied LayerNorm after the sublayer (post-norm). T5 applies LayerNorm before (pre-norm). Pre-norm training is more stable at large scale — gradients don't blow up during early training when layer outputs are large. Most modern LLMs now use pre-norm. If you're debugging training instability in a post-norm model, this is worth knowing.

Relative position embeddings. The original transformer used absolute sinusoidal position encodings: position 1 gets one vector, position 2 gets another. T5 uses relative position encodings: the attention weight between token $i$ and token $j$ gets a bias that depends on the distance $(i - j)$, not the absolute positions. The bias values are learned and shared across layers. The practical effect: T5 generalizes better to sequences longer than it saw during training, because the model sees relative distances not absolute indices. Not as good as RoPE for extrapolation, but a meaningful improvement over absolute positions.

No bias terms in weight matrices. T5 drops bias parameters from most linear layers (query, key, value, FFN weights). This reduces parameter count by a small amount, simplifies weight tying, and has no meaningful impact on downstream performance. Worth knowing if you're debugging weight initialization or loading pretrained weights — the shapes differ from a standard transformer.

Vocabulary and tokenization. SentencePiece with unigram language model, 32,000 token vocabulary, shared across encoder input, decoder input, and decoder output. Shared embeddings reduce parameter count and help with rare tokens that appear in both input and output.

Model sizes:

T5-Small: 60M parameters
T5-Base: 220M parameters
T5-Large: 770M parameters
T5-XL: 3B parameters
T5-XXL: 11B parameters

C4 and the pretraining data decisions

The paper introduces C4 — Colossal Clean Crawled Corpus — as its pretraining dataset. C4 is Common Crawl filtered through a sequence of heuristics:

Retained only text ending in terminal punctuation (period, exclamation, question mark)
Removed pages with fewer than 3 sentences
Removed pages with the word "javascript" (catches pages with inline JS rather than actual prose)
Removed pages with curly braces (removes code)
Removed pages on a blocklist of adult content terms
Removed non-English content (using langdetect)
Removed duplicate 3-sentence spans across the corpus

The result is about 750GB of text after filtering. The paper's ablation on data quality is one of the clearest results: quality filtering significantly outperforms raw Common Crawl, but the specific filtering heuristics matter less than doing filtering at all. Removing near-duplicate text is the most important single step — web-crawled text has enormous amounts of boilerplate (menus, headers, footers, cookie notices) that repeat across millions of pages.

One result that surprised me: training on the filtered C4 was competitive with but not dramatically better than training on "WebText"-style data (high-quality curated sources). The gap between a good general corpus and a highly curated one was smaller than expected. The gap between raw crawl and either filtered option was large.

The pretraining objective ablation

This is where the paper earns its length. The authors compare seven different pretraining objectives and measure downstream task performance across 23 benchmarks.

Standard language modeling (predict the next token left-to-right, GPT-style): competitive on generation tasks but weak on classification and understanding tasks. The problem is that a causal LM processes each token only with left context. Classification tasks often require bidirectional understanding — whether a sentence is an entailment of another depends on both the premise and hypothesis simultaneously.

Masked language modeling (BERT-style, mask 15% of tokens, predict the masked tokens): strong on classification and understanding, weaker on generation. The encoder-only setup means you can't directly generate output sequences without a separate decoder.

Prefix LM (encoder processes a prefix bidirectionally, decoder generates the suffix causally): a hybrid. Better at generation than encoder-only BERT, worse at pure classification than BERT, roughly competitive with encoder-decoder on generation tasks. The paper's framing of this as a comparison point is prescient — this is essentially the architecture used in GPT-NeoX and similar.

Span corruption (their final choice): instead of masking individual tokens, mask contiguous spans of tokens. Replace each span with a single sentinel token (<extra_id_0>, <extra_id_1>, ...). The model must predict the content of each sentinel in the decoder output.

Original: "The quick brown fox jumps over the lazy dog."
Input:    "The quick <extra_id_0> jumps over <extra_id_1> dog."
Target:   "<extra_id_0> brown fox <extra_id_1> the lazy <extra_id_2>"

The sentinel design means the target sequence is significantly shorter than the input — you're only predicting the masked content, not the full document. This improves training efficiency: the same compute budget trains on more input examples because each forward pass spends fewer decoder steps on unmasked tokens. The paper shows this is the best-performing objective across their benchmark suite and the most efficient in terms of compute per unit of downstream performance.

A practical note on sentinel tokens: T5 adds 100 extra tokens to the vocabulary (<extra_id_0> through <extra_id_99>) as sentinel IDs. During fine-tuning, these tokens are rarely seen and don't interfere with task-specific output. But if you're doing structured output (JSON, XML) with fine-tuned T5, be careful that your format doesn't accidentally overlap with sentinel token patterns — some fine-tuning frameworks emit sentinel tokens in outputs when the model is confused about task format.

What scale actually shows

T5-11B set state-of-the-art on SuperGLUE (88.9), GLUE (90.3), SQuAD v1.1 (90.1 exact match), CNN/DailyMail ROUGE-2 (21.55), and WMT English-German translation (30.4 BLEU) at the time of publication in 2020.

The scaling results show the expected pattern: more parameters, more pretraining steps, and more data all help, with no clear plateau at the scales the paper tested. The T5-3B model performs meaningfully better than T5-Large, and T5-11B better than T5-3B, on almost every benchmark. This isn't surprising in retrospect — it's the same finding as the scaling laws papers — but T5 demonstrated it empirically across a diverse task suite rather than just perplexity on a held-out corpus.

The compute-optimal analysis wasn't the paper's focus (Chinchilla does that better) but the directional result holds: given a fixed compute budget for pretraining, a larger model trained for fewer steps underperforms a smaller model trained longer, until you hit the compute-optimal point.

Production tradeoffs nobody mentions in the benchmark posts

Encoder KV can be precomputed. If your input is fixed or slowly changing — a document in storage, a system prompt that doesn't change per-request — the encoder forward pass runs once and the encoder hidden states are cached. Every decoder step reads the same encoder KV via cross-attention. At high QPS, this is a real serving advantage: the expensive part (encoding 4K tokens) runs once per document, not once per query. For the contract extraction task I started with, each document encodes once and the decoder runs 18 separate extraction queries against the same encoder cache.

Cross-attention cache is separate from self-attention cache. The T5 decoder has two KV caches: the self-attention cache (over generated tokens so far, grows with output length) and the cross-attention cache (over encoder hidden states, fixed per request). Most modern serving frameworks have excellent KV cache management for self-attention — PagedAttention, continuous batching, chunked prefill. Cross-attention KV is typically held static in GPU memory for the duration of a request. This is usually fine, but it means the cross-attention cache doesn't benefit from PagedAttention's memory defragmentation the same way self-attention does. For workloads with many concurrent long documents, you can hit GPU memory pressure from cross-attention KV before self-attention becomes the bottleneck.

Serving infrastructure prefers decoder-only. vLLM, TGI (text-generation-inference), TensorRT-LLM, and most modern serving stacks are deeply optimized for decoder-only transformer architectures. T5/encoder-decoder support exists but is often a secondary path: FlashAttention integration for the cross-attention layers, tensor parallelism for the encoder, and continuous batching across encoder and decoder steps. If you're deploying with FLAN-T5 through a standard serving stack, verify that the encoder is actually running with FlashAttention and that cross-attention isn't materializing full attention matrices. I've seen deployments where the decoder benefited from FA but the encoder ran a naive attention kernel — the encoder was the bottleneck and nobody noticed because the profiler attributed wall clock time to the decoder steps.

FLAN-T5 is what you should actually use. Wei et al. (2022) applied instruction tuning on top of T5, training on over 1,800 NLP tasks described as natural language instructions. The result — FLAN-T5 — dramatically outperforms raw T5 on zero-shot and few-shot benchmarks. FLAN-T5-Large (780M parameters) matches or beats GPT-3 (175B parameters) on several standard benchmarks under zero-shot conditions. If you're building any application on top of T5 today, start with FLAN-T5 — the instruction tuning adds essentially no serving cost and meaningfully improves output quality and instruction following.

Encoder-decoder has different scaling behavior for generation than decoder-only. For fixed parameter count, encoder-decoder allocates roughly half its parameters to the encoder and half to the decoder. A 770M parameter encoder-decoder model has ~385M parameters of decoder — about the same as GPT-2 medium. For generation tasks, the effective "generator" inside the encoder-decoder is smaller than a decoder-only model of the same total parameter count. This matters when you're comparing FLAN-T5-Large (770M) to Mistral-7B (7B decoder): the parameter count comparison is misleading. The 7B decoder model has 10× more parameters doing generation. For extraction and classification tasks, the encoder quality matters as much as decoder size, which is why T5 punches above its weight on structured tasks despite the parameter count asymmetry.

Failure modes in practice

The multitask interference problem. T5 experimented with multi-task pretraining — fine-tuning on many downstream tasks simultaneously rather than doing standard single-task fine-tuning. The results showed a "multitask tax": tasks with large datasets (translation, summarization) dominated training, hurting low-resource tasks (some GLUE tasks). The fix in the paper was temperature-scaled sampling, giving each task a probability of being chosen that's proportional to T^{0.7} of its dataset size, where T is a temperature parameter. If you're doing multi-task fine-tuning on T5 for your own tasks and one task is dominating loss curves, this is likely the cause. Sample your high-resource tasks at a lower rate.

Fine-tuning rate sensitivity. T5 fine-tuning is more sensitive to learning rate than BERT-style fine-tuning. The paper used a constant learning rate of 10^{-3} for pretraining and 10^{-3} with a warmup for fine-tuning, but downstream task fine-tuning typically works better with 10^{-4} to 5 × 10^{-4}. I've seen T5-Large fine-tuning diverge with 10^{-3} on small datasets — the pre-norm architecture is more stable at large scale but doesn't necessarily mean you can use the same learning rate for fine-tuning on 1,000 examples.

Decoder length sensitivity. T5 uses teacher forcing during fine-tuning — the decoder is given the ground-truth output tokens at each step. At inference, it generates autoregressively. For tasks where output length varies widely, the model can get into degenerate length regimes if it hasn't seen sufficient length variation during fine-tuning. The paper's text-to-text framing means you're typically fine-tuning with specific output templates; if your fine-tuning outputs are consistently short (one-word class labels), the model learns a strong prior for short outputs that can cause premature EOS generation on tasks requiring longer outputs.

Silent truncation on the encoder side. T5's encoder has a maximum input length (512 tokens for T5-Base, 512 tokens for T5-Large — the paper used 512 for most experiments). If you're passing longer inputs, the encoder silently truncates at the token limit in most frameworks. For the contract extraction use case, 512 tokens is far shorter than most contracts. You either need T5 with long-context fine-tuning, FLAN-T5 which can handle 2048 tokens, or Longformer-style modifications. Verify your tokenizer configuration is actually accepting the full sequence before assuming the model sees your complete input.

When not to use T5

Long-form generation. Open-ended text generation — blog posts, creative writing, multi-turn conversation — is where decoder-only models dominate. Instruction-tuned decoder-only models (LLaMA, Mistral, Gemma) are far better calibrated for long-form generation than encoder-decoder T5. The encoder compresses the input into fixed representations; the decoder generates conditional on that compression. For tasks where the output is much longer than the input, this asymmetry hurts.

In-context learning without fine-tuning. T5's pretraining doesn't train for in-context learning the way GPT-style models do. Prepending examples to the encoder doesn't reliably produce the pattern-following behavior that decoder-only models exhibit. If you're prototyping without fine-tuning and need to evaluate few-shot performance, decoder-only models are a better starting point. FLAN-T5 improves this significantly but still lags behind GPT-3-scale models on complex few-shot tasks.

When your serving stack is optimized for decoder-only. If you're already running vLLM or TensorRT-LLM in production with a decoder-only model fleet, adding encoder-decoder T5 requires validating a different code path through the serving stack. The operational overhead of maintaining two serving architectures may not be worth the gain on the task at hand. Run the benchmark first.

When you need instruction following without fine-tuning. Raw T5 (not FLAN-T5) has weak instruction following. The model was pretrained with task prefixes like "summarize:" and "translate:", not natural language instructions. If you're calling T5 without fine-tuning and expecting it to follow complex natural language task descriptions, you'll get worse results than a smaller decoder-only instruction-tuned model.

When your fine-tuning data has high output diversity. T5's span corruption pretraining trains the model to recover structured masked spans. If your fine-tuning task requires highly diverse, long, open-ended outputs that differ substantially from the fine-tuning distribution, T5 may underfit or produce formulaic outputs. Structured extraction with well-defined output schemas is a good fit. Open-ended generation with high variance is not.

What the paper actually gives you

T5's lasting contribution isn't the specific architecture or even the SOTA benchmarks from 2020. It's a demonstration that systematic ablation at scale is possible, and that many assumptions in NLP transfer learning at the time were wrong or underspecified.

The paper showed that span corruption outperforms masked LM — this result influenced BART, mT5, and every subsequent masked pretraining variant. It showed that encoder-decoder outperforms encoder-only or decoder-only for sequence-to-sequence tasks — a finding that was partially overturned as decoder-only models scaled further, but which still holds at matched parameter counts for structured tasks. It showed that data quality matters more than data domain on diverse benchmarks — a result that the LLM training community has repeatedly re-confirmed.

The text-to-text framing proved durable not as an architecture constraint but as a design principle. Every instruction-tuned model — InstructGPT, FLAN, ChatGPT, Claude, LLaMA-chat — uses the same core idea: format the task as a sequence-to-sequence problem with natural language instructions. The models are decoder-only, the architecture is different, but the training signal comes from the same insight T5 operationalized at scale.

For the contract extraction task: we used FLAN-T5-Large with a 2,048-token encoder limit, fine-tuned on ~8,000 labeled contracts. Encoder KV was precomputed and cached per document at ingestion time. Inference was 12ms median latency for the 18-field extraction, on a single A10G. A decoder-only model of competitive accuracy on this task would have been 3-5× larger, slower at inference, and harder to constrain to the output schema. The encoder-decoder tradeoff was the right call — but only because we had fine-tuning data and a structured output requirement. Without both, I'd have used a smaller decoder-only model.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu. JMLR 2020.
Scaling Instruction-Finetuned Language Models (FLAN) — Wei, Hou, Stephenson, Milanova, Zheng, Misra, Zhao, Le, Luong, Chowdhery, Le, et al. 2022.