ZeRO: how Microsoft trained 100B-parameter models without model parallelism

Reading Rajbhandari et al. (Microsoft Research, SC'20) after watching an 8-GPU training job OOM at 6B parameters.

The job failed at step one. Eight A100s, 80 GB each, 640 GB aggregate GPU memory. The model was 6 billion parameters. The math looked fine — 6B parameters at 4 bytes each is 24 GB, well within the budget. We'd run 3B models without issue. The OOM came from somewhere else.

This is the standard trap: model parameter count is not model memory footprint during training. The parameters are only part of what lives on GPU. Once you add the optimizer state, the gradients, and the master-weight copies required for mixed-precision training, a "6B parameter model" consumes closer to 96 GB — per GPU, replicated identically across all eight. You have 640 GB of GPU memory storing the same 96 GB eight times over.

ZeRO — "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models," Rajbhandari, Rasley, Ruwase, and He, Microsoft Research, SC'20 — is the paper that names this problem precisely and eliminates it systematically. It's implemented in DeepSpeed and is now the default training strategy for any serious large-model work. If you're training anything above a few billion parameters without it, you're burning memory budget on redundancy that doesn't need to exist.

The problem the paper is actually solving

Standard data parallelism works by replicating the complete model state across every GPU. Each device processes a different batch, computes gradients independently, and then synchronizes via allreduce. The aggregated gradient update is identical to what you'd get from a single device with a batch N times larger. This is the entire strategy: replicate state, process independently, sync gradients.

The replication is the problem. Consider mixed-precision training with Adam, which is the default for large models. For a model with Ψ parameters:

Parameters: 2Ψ bytes (FP16 for forward and backward passes)

Gradients: 2Ψ bytes (FP16, accumulated during backward)

Optimizer state (Adam):

4Ψ bytes: FP32 master copy of parameters (required for numerical stability)
4Ψ bytes: first moment estimate (momentum m)
4Ψ bytes: second moment estimate (variance v)

Total optimizer state: 12Ψ bytes. Total per GPU: 16Ψ bytes.

For a 7.5B-parameter model: 16 × 7.5B = 120 GB per GPU. An A100 has 80 GB. The model doesn't fit on one GPU, and naïvely it doesn't fit on two. You need model parallelism — tensor parallelism, pipeline parallelism, or both — which introduces significant engineering complexity and communication overhead.

But look at what's actually happening. If you run this across 64 GPUs in standard data parallelism, every GPU stores all 120 GB. You're using 64 × 120 GB = 7.68 TB of aggregate GPU memory to store what is effectively one copy of the model state. The aggregate memory budget scales with GPU count, but the utilization is nearly zero — you're paying for 64 copies of the same data.

ZeRO's observation: none of this redundancy is necessary. The optimizer state, gradients, and parameters all have known access patterns. Every GPU needs all parameters for the forward pass, but only each GPU's assigned parameters need their full optimizer state at any given point. The paper partitions the state rather than replicating it.

The three stages of ZeRO

ZeRO introduces three progressive stages, each partitioning a different component of the model state.

Stage 1: partition the optimizer state

ZeRO-1 divides the optimizer state across all N data-parallel GPUs. Each GPU maintains only 1/N of the Adam first and second moments and FP32 master weights. Parameters and gradients are still replicated in full.

Memory per GPU:

2Ψ (FP16 params) + 2Ψ (FP16 grads) + 12Ψ/N (optimizer state)

At N=64: 4Ψ + 0.19Ψ ≈ 4.19Ψ bytes versus 16Ψ baseline — roughly 4× reduction.

The communication pattern is identical to standard data parallelism. After the backward pass, each GPU has computed gradients for the full model. Rather than allreduce, ZeRO-1 uses reduce-scatter: each GPU only reduces the gradient slice it owns (corresponding to its optimizer state partition). Then each GPU applies its optimizer update locally, producing an updated FP16 parameter slice. Finally, an allgather reconstructs the full parameter set across all GPUs before the next forward pass.

Total communication volume: same as allreduce. No regression.

Stage 2: partition gradients

ZeRO-2 extends Stage 1 by also partitioning gradients. Each GPU only stores the gradients for the parameters in its assigned optimizer state partition. Gradients for other partitions are reduced and discarded after the reduce-scatter.

Memory per GPU:

2Ψ (FP16 params) + 2Ψ/N (FP16 grads) + 12Ψ/N (optimizer state)
= 2Ψ + 14Ψ/N

At N=64: 2Ψ + 0.22Ψ ≈ 2.22Ψ bytes — roughly 8× reduction from baseline.

Communication volume remains the same as standard data parallelism. Gradients are reduced and discarded on the fly during the backward pass rather than accumulated in full, which means peak gradient memory is also lower than the static formula suggests.

Stage 3: partition parameters

ZeRO-3 goes further by partitioning the model parameters themselves. No GPU stores the full parameter set at any time. Each GPU owns 1/N of the parameters persistently.

Memory per GPU:

(2 + 2 + 12)Ψ/N = 16Ψ/N

At N=64: 0.25Ψ bytes — 64× reduction, scaling linearly with device count.

The catch is the forward pass. Attention and feed-forward layers need the full parameter tensor to compute. ZeRO-3 handles this with an allgather before each layer: all GPUs collectively reconstruct the current layer's parameters from their shards, compute the forward pass, then immediately discard the reconstructed weights to free memory. This repeats for every layer, forward and backward.

Communication volume: roughly 1.5× that of standard data parallelism. The allgather in the forward pass adds a new communication round that ZeRO-1 and ZeRO-2 don't incur. For a model with L layers, you have L allgathers (forward) + L reduce-scatters (backward) + L allgathers (parameter update) instead of the one allreduce from vanilla data parallelism.

What the paper actually reported

The headline result: ZeRO-2 achieved 8× the throughput of the previous state-of-the-art (Megatron-LM tensor parallelism) on a 7.5B-parameter model, running at 15.1 petaflops on 512 V100 GPUs.

More importantly: ZeRO-2 enabled training a 7.5B parameter model without any model parallelism on 32 V100 GPUs with 32 GB each. The same job would require tensor parallelism with 16-way splits on standard data parallelism — which involves significant code changes, careful layer slicing, and substantial communication overhead within each node.

ZeRO also produced Turing-NLG, then the largest language model in the world at 17B parameters. The paper demonstrated training of models up to 100B parameters, and the ZeRO analysis shows the theoretical path to 1T parameters with approximately 1024 GPUs.

Production tradeoffs no one mentions in the DeepSpeed tutorial

Interconnect bandwidth is everything for ZeRO-3. ZeRO-1 and ZeRO-2 have the same communication volume as standard data parallelism, so they work at whatever bandwidth you have. ZeRO-3 adds 50% more communication — but that 50% is allgathers that happen every single layer during both forward and backward passes. On NVLink within a node (600 GB/s bidirectional on A100), this is fine. Across nodes on InfiniBand HDR100 (200 Gb/s), ZeRO-3 throughput degrades significantly at large model sizes. On cloud instances connected via Ethernet, ZeRO-3 will often be slower than ZeRO-2 with model parallelism. The speedup numbers in the paper were measured on specialized clusters with high-bandwidth interconnect. Your cloud VMs are not that cluster.

ZeRO does not partition activation memory. The paper is specifically about the persistent model state: parameters, gradients, optimizer state. Activation memory — the intermediate tensors stored during the forward pass for use in backward — scales with batch size and sequence length and is entirely separate. For long-context training (8K+ tokens), activation memory can dominate over model state memory. ZeRO does nothing about this. You need activation checkpointing (gradient checkpointing) separately, which trades memory for compute by recomputing activations during the backward pass rather than storing them.

ZeRO-Offload is a different regime. DeepSpeed's ZeRO-Offload moves the optimizer state and gradients to CPU RAM, enabling training of very large models on limited GPU count. CPU RAM is cheap. The problem: CPU-GPU bandwidth (PCIe 4.0 at ~32 GB/s) is 60× slower than GPU HBM bandwidth. Every optimizer step involves a data movement between CPU and GPU. On long training runs, this introduces per-step overhead that is tolerable for large-model research (where you're compute-bound anyway) and unacceptable for training smaller models quickly. Teams that enable ZeRO-Offload on a 3B-parameter model and find training slower than baseline are hitting this overhead.

Mixing ZeRO with other parallelism strategies is non-trivial. For models that don't fit even with ZeRO-3 (100B+), you combine ZeRO with tensor parallelism and pipeline parallelism — the "3D parallelism" setup. The interaction between ZeRO's parameter partitioning and tensor parallelism's layer splitting requires careful configuration and can produce subtle correctness bugs if the partitioning is wrong. DeepSpeed has defaults, but they don't compose automatically for all architectures. MoE layers in particular require separate handling via ZeRO-Infinity or custom logic because expert weight routing creates non-uniform parameter access patterns.

The stage 3 → stage 2 decision is often wrong in practice. The theoretical 64× memory reduction from ZeRO-3 at 64 GPUs looks compelling. But ZeRO-2 at the same scale gives 8× reduction with no additional communication overhead, and for most models that fit within ZeRO-2's memory budget, ZeRO-2 will have materially better MFU (model FLOPs utilization). Teams that deploy ZeRO-3 because "it uses less memory" and then see 20-30% lower throughput versus ZeRO-2 have optimized the wrong metric. Use the weakest ZeRO stage that fits.

Failure modes in practice

The most common failure I've encountered: silent throughput collapse from misconfigured reduce_bucket_size.

ZeRO's gradient reduction is bucketed — gradients accumulate in a buffer that is flushed (reduced) when it fills. The default reduce_bucket_size in DeepSpeed is 500M parameters. For small models or shallow layers, this means gradients from the entire model may accumulate before any reduction happens, which defeats much of ZeRO-2's benefit by requiring peak gradient memory close to 2Ψ rather than 2Ψ/N. The reduction only overlaps with compute if gradients are flushed incrementally during the backward pass. Teams that benchmark ZeRO-2 and see less memory saving than expected have often not tuned this parameter.

The second failure mode: ZeRO-3 correctness on tied weights. Many language models tie the embedding and output projection weights — the same tensor is used in two different places in the computation graph. ZeRO-3's allgather mechanism assumes each parameter tensor appears once in the model's parameter list. Tied weights break this assumption because the same allgather result is consumed at two different points. DeepSpeed has explicit handling for this, but it requires calling deepspeed.zero.register_external_parameter() for tied weights. If you miss this, the backward pass produces incorrect gradients silently — the optimizer state for the tied parameter gets updated only once when it should be updated twice, causing subtle model quality degradation that doesn't appear as a training crash.

The third failure mode: ZeRO-Infinity on NVMe in a high-throughput training run. ZeRO-Infinity extends ZeRO-3 to offload model state to NVMe storage, theoretically enabling trillion-parameter training on small clusters. NVMe sequential read bandwidth is around 3-7 GB/s per drive. For a 100B model, moving the optimizer state to NVMe and back each step takes tens of seconds. This is appropriate for exploratory research where throughput isn't the goal. Teams that treat ZeRO-Infinity as a path to cheap large-model training get throughput numbers 10-100× worse than GPU-resident training and abandon it without understanding why.

When not to use ZeRO

When your model fits on one GPU with headroom. ZeRO adds a layer of distributed state management. For models under 1-2B parameters where all model state fits on a single high-memory GPU, training with gradient accumulation on a single GPU is simpler, faster to debug, and doesn't require distributed infrastructure. ZeRO is a solution to a specific resource constraint; if the constraint doesn't exist, don't introduce the complexity.

When you're memory-limited by activations, not model state. If your OOM is happening because you're training with long sequences (>4K tokens) at large batch size, ZeRO won't help. The activation memory during a forward pass is roughly proportional to batch_size × seq_len × hidden_dim × num_layers, and ZeRO partitions none of it by default. Activation checkpointing is the right tool here, and it works independently of ZeRO.

When your interconnect can't support ZeRO-3's communication pattern. If you're training across nodes connected by 10GbE or standard cloud networking, ZeRO-3's per-layer allgathers will serialize your forward pass. In this regime, keeping model state in GPU memory and using ZeRO-2 (same communication as standard data parallelism) with more nodes will beat ZeRO-3 with fewer nodes.

When you're debugging a training correctness issue. ZeRO-3 partitions parameters at runtime, allgathers them layer by layer, and discards them immediately after. Any bug that manifests differently depending on whether the full parameter tensor is resident — initialization logic, tied weights, weight norm constraints, custom backward hooks — will be harder to reproduce and inspect. Start debugging in ZeRO-1 or no-ZeRO mode, confirm the bug, then switch back.

When you need deterministic gradient computation for analysis. ZeRO uses different reduction patterns than standard allreduce. Floating-point reduction order affects rounding. If you're doing gradient analysis, gradient SNR monitoring, or loss surface investigation that depends on exact gradient values, ZeRO may produce different numerical results than baseline. The difference is usually negligible for training, but it's not zero.

What the paper actually gives you

ZeRO is a recognition that distributed training's memory problem was largely self-inflicted. Standard data parallelism was designed for an era when models fit comfortably on one GPU; the redundant replication was a side effect of the simplicity, not a requirement. Once you analyze exactly when each component of the model state needs to be where, the redundancy dissolves.

The staged design — partition optimizer state first, then gradients, then parameters — matters for adoption. ZeRO-1 and ZeRO-2 require zero changes to model code, have no communication overhead versus data parallelism, and recover 4-8× of memory. For most teams working in the 7B-30B parameter range today, ZeRO-2 is the default choice: same code, same communication, half to one-eighth the memory. Stage 3 is for the cases where 8× isn't enough, and it brings real engineering tradeoffs.

The 8-GPU training job that started this post? It was ZeRO-1 with Adam. The optimizer state per GPU dropped from 90 GB to under 12 GB at 8-way partitioning. The job ran. The training throughput was identical to standard data parallelism because ZeRO-1 doesn't change the communication at all.

That's the paper's actual contribution: it demonstrates that the memory wall in large-model training is mostly a software problem, not a hardware constraint. The GPUs had enough aggregate memory all along.

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — Rajbhandari, Rasley, Ruwase, He. SC'20 (2020).