Mixture of experts: what the architecture actually does to your inference budget

Reading Fedus et al. (2022) and Jiang et al. (2024) while planning a move from dense to sparse models.

You're running Mixtral 8x7B. The name implies 56B parameters — eight experts, each a 7B model. The actual parameter count is 46.7B, because the experts share embedding layers and attention weights. But here's what matters operationally: for any given token, only two of the eight experts activate. Your per-token compute is closer to a 12.9B dense model, not 46.7B.

This is the MoE trade: you get the stored capacity of a large model with the runtime cost of a smaller one. "Mixture of Experts" sounds like an ensemble technique, but it isn't — it's a parameter efficiency strategy. The model learns to route different kinds of computation to different weight partitions, and only pays for the partition it needs at each step.

Understanding exactly how this works — and where it breaks — changes how you provision hardware, batch requests, and interpret inference benchmarks.

Why dense scaling hits a wall

The standard transformer FFN is a two-layer network applied identically to every token position:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Every token, every position, uses exactly the same W₁ and W₂. The model can't decide to use different parameters for code tokens versus prose tokens versus math tokens. All capacity is applied uniformly to all inputs. This is computationally clean but wasteful — a model that's excellent at code uses its "prose parameters" for code inputs whether or not they contribute anything.

Scaling a dense model means scaling W₁ and W₂. Doubling parameter count doubles compute and memory. The compute-to-capacity ratio is fixed by design.

MoE breaks that ratio. You can multiply the parameter count without multiplying the compute cost, because not all parameters are active for every token.

The routing mechanism

The foundational idea comes from Shazeer et al., 2017 — "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." The practical version used in production models is from Switch Transformers (Fedus et al., 2022) and Mixtral (Jiang et al., 2024).

In an MoE layer, the dense FFN is replaced by E expert networks — each independently parameterized — plus a router:

def moe_layer(x, experts, router):
    # x: (batch_size * seq_len, d_model)
    router_logits = router(x)  # (tokens, num_experts)

    # Top-K selection: activate only K experts per token
    top_k_logits, top_k_indices = torch.topk(router_logits, k=2, dim=-1)
    routing_weights = F.softmax(top_k_logits, dim=-1)  # normalize selected weights

    output = torch.zeros_like(x)
    for expert_idx, expert in enumerate(experts):
        # Tokens assigned to this expert (either slot 0 or slot 1)
        token_mask = (top_k_indices == expert_idx).any(dim=-1)
        if not token_mask.any():
            continue

        expert_out = expert(x[token_mask])

        # Each selected token may have expert_idx in slot 0 or slot 1
        for slot in range(2):
            slot_mask = token_mask & (top_k_indices[:, slot] == expert_idx)
            weight = routing_weights[slot_mask, slot]
            output[slot_mask] += weight.unsqueeze(-1) * expert_out[slot_mask[token_mask]]

    return output

The router is a linear projection from d_model to E outputs — a cheap operation relative to the expert itself. For each token, it scores all E experts and selects the top K (K=2 in Mixtral, K=1 in Switch Transformers). The selected experts process the token; the others are skipped entirely.

The router weights are trained jointly with the expert parameters. The model learns both what to compute (the experts) and what to compute it on (the router). Routing is not hand-designed — it emerges from training.

How Mixtral actually works

Mixtral places MoE layers only at the FFN positions — attention layers are dense and shared across all tokens. This is the standard design: attention handles "where to look in the sequence," experts handle "what computation to perform on each position." The FFN is where most of a transformer's parameters live (roughly 2/3 of total for typical d_ff/d_model ratios), so this is where the efficiency gain is largest.

Mixtral 8x7B specifics from the technical report:

8 experts per MoE layer, each a standard SwiGLU FFN with d_ff = 14,336
Top-2 routing: every token activates exactly 2 experts per layer
32 MoE layers replacing the standard FFN blocks
Shared components: embeddings, attention (Q/K/V/O projections), LayerNorm — these are not replicated per expert

Total stored parameters: 46.7B. Active parameters per token: 12.9B. The 3.6× ratio between stored and activated parameters is the MoE efficiency ratio. Compute tracks activated parameters; memory tracks stored parameters.

The load balancing problem

Here's where theory diverges from production reality: without constraints, the router learns to favor a small subset of experts.

Why? During early training, some experts are marginally better at certain inputs by random initialization. The router routes more tokens to them, they receive more gradient signal, they improve more, the router routes even more tokens to them. Left unchecked, this collapses to 1-2 dominant experts with 6-7 idle ones — you've paid for 8 experts and are running 2.

Switch Transformers introduces an auxiliary load-balancing loss to prevent collapse:

def load_balancing_loss(router_probs, top_k_indices, num_experts):
    # f_i: fraction of tokens dispatched to expert i
    f = torch.zeros(num_experts, device=router_probs.device)
    for i in range(num_experts):
        f[i] = (top_k_indices == i).float().mean()

    # P_i: mean routing probability mass assigned to expert i
    P = router_probs.mean(dim=0)  # (num_experts,)

    # Loss is minimized when both distributions are uniform
    loss = num_experts * (f * P).sum()
    return loss

total_loss = task_loss + alpha * load_balancing_loss(router_probs, top_k_indices, num_experts)

The loss penalizes both unequal token dispatch (f_i) and unequal routing probability mass (P_i). The coefficient alpha — typically 0.01 to 0.1 — controls the trade-off: too high and you force artificial uniformity that hurts quality; too low and you risk expert collapse. Most implementations use alpha = 0.01 as a starting point and tune from there.

In deployment, expert utilization is worth monitoring. If one expert is receiving more than 40% of tokens in a production trace, either training regularization was insufficient or the incoming distribution has shifted significantly from training data.

Expert parallelism: the deployment topology

For dense models, tensor parallelism splits weight matrices across GPUs, requiring an all-reduce after each layer. MoE introduces a different axis: expert parallelism (EP).

In expert parallelism, each expert lives on a different device. A token's forward pass requires:

Compute router logits (all devices, cheap dense operation)
All-to-all dispatch: send each token to the device(s) holding its selected experts
Run the expert FFN locally on each device
All-to-all gather: return expert outputs to the originating device
Weighted combination and continue

The two all-to-all operations are the communication cost you're trading for the compute savings. This is fundamentally different from tensor parallelism's all-reduce. All-to-all is harder to optimize because:

Communication volume is variable per step (depends on routing decisions)
Load imbalance means some devices finish early and wait for others
Network topology matters acutely: NVLink within a node is roughly 10× faster than InfiniBand across nodes, so expert parallelism crossing node boundaries is expensive

For Mixtral 8x7B on 8 GPUs with EP=8, you get one expert per GPU. Throughput scales well; per-request latency is dominated by the two all-to-all hops unless you're on a tightly coupled NVLink fabric.

What the benchmark numbers actually mean

Mixtral 8x7B is benchmarked as "comparable to Llama 2 70B" in quality while using roughly one-third the compute per token. That comparison is real but requires unpacking.

The 12.9B active parameters per token in Mixtral are not equivalent to 12.9B parameters in a dense model. All 46.7B expert weights must be loaded into GPU memory — they're not swapped in and out per-token at inference time. The compute is reduced; the memory footprint is not.

For fp16, you need ~93GB of GPU RAM for Mixtral 8x7B. For fp8/int8, roughly 47GB. A single A100 (80GB) can hold the quantized model but not the full fp16 version. Compare to Llama 2 13B at ~26GB fp16: if your constraint is GPU memory rather than compute, MoE doesn't help you.

Throughput vs. latency also diverges for MoE:

High-batch, throughput workloads — MoE wins clearly. Active compute per token is lower; you process more tokens per second for the same FLOP budget. The all-to-all communication amortizes across the batch.

Low-batch, latency workloads — MoE's advantage shrinks. You're memory-bandwidth bound on loading weights regardless of whether you use 2 or 8 experts per token. At batch size 1, Mixtral will often be slower than a dense 13B model despite lower active compute.

Failure modes in production

Expert collapse during fine-tuning. Retraining or fine-tuning a pretrained MoE model without the load-balancing loss can collapse routing in a few hundred steps. Always include the auxiliary loss during any continued training, even for short fine-tuning runs.

Routing brittleness to prompt changes. MoE routing is input-dependent and discrete (top-K creates hard selection). Small prompt changes — adding a system prompt, changing delimiter format, language switching — can shift which experts are selected and meaningfully change output behavior in ways that aren't predictable from dense model behavior. A/B testing prompt formats on MoE models should treat routing as a hidden variable affecting results.

Expert capacity overflow in batched inference. If a batch contains many semantically similar inputs, one expert may receive far more tokens than expected. Implementations handle this by dropping overflow tokens or using auxiliary "expert overflow" capacity — dropped tokens produce zero-contribution outputs, effectively corrupting those positions. Monitor per-expert capacity utilization per batch in production; add explicit capacity limits with appropriate handling for overflow. This failure is silent by default.

Suboptimal performance on non-MoE-aware serving stacks. Expert parallelism, capacity management, and load-balancing monitoring require infrastructure beyond standard dense model serving. Running Mixtral on inference software designed for dense models — without MoE-specific dispatch optimizations — leaves significant performance on the table while paying the full memory cost. vLLM and TGI both have MoE-specific optimizations; verify they're enabled.

When NOT to use MoE

Latency-sensitive, single-request workloads. If p50 latency on individual requests is your primary constraint, MoE's memory footprint (full stored size, not active size) with routing overhead often makes dense models faster at batch size 1. A well-optimized dense 13B will usually serve individual requests faster than Mixtral 8x7B.

Memory-constrained deployments. If you can fit a 13B dense model but not a 47B MoE model, the trade doesn't exist for you. MoE saves compute, not memory.

Narrow, homogeneous task fine-tuning. MoE's advantage comes from routing different computation to different input types. If you're fine-tuning for a single, uniform task — SQL generation, structured extraction, a specific domain — the routing diversity is wasted. A smaller dense model fine-tuned on your distribution will match or exceed a larger MoE on that narrow task, with less serving complexity.

Teams without MoE-aware serving infrastructure. Expert parallelism, overflow handling, and load-balancing observability are non-trivial to implement correctly. If your team is standing up inference infrastructure for the first time, start with a dense model. The operational complexity of MoE is real and scales with team inexperience.

What actually changed between the papers

Shazeer et al. (2017) proved MoE works at the transformer layer level. But their implementation required noisy top-K gating with a learned noise term to encourage exploration, and training was unstable at scale.

Switch Transformers (2022) simplified routing: K=1 (one expert per token), standard softmax gating, auxiliary load-balancing loss. Simpler routing, more stable training, and they scaled to trillion-parameter models on TPUs. The main quality loss from K=1 versus K=2 is real but acceptable for throughput-dominated workloads.

Mixtral (2024) returned to K=2, recovering quality from the K=1 simplification while keeping expert count manageable. They showed that dense model training practices transfer cleanly to MoE with the auxiliary loss and capacity limits — no exotic stabilization tricks required.

The trend is toward simplicity. The gap between paper descriptions and production requirements has narrowed considerably.

The bottom line

MoE is a genuine architectural improvement for large-scale inference where throughput matters and memory footprint is acceptable. You get the model quality of a much larger dense model at a fraction of the per-token compute cost. The trade-offs are concrete: memory doesn't scale down with sparse activation, routing infrastructure adds operational complexity, and load balancing requires active monitoring.

The right frame: MoE doesn't make a model cheaper to host — it makes hosted capacity more computationally efficient. If your bottleneck is GPU hours per output token, MoE is compelling. If your bottleneck is GPU count or GPU RAM, the case is weaker.

Read the Switch Transformers paper for the theoretical grounding — particularly the capacity factor analysis and the auxiliary loss derivation. Read the Mixtral technical report for the practical implementation decisions that work in production. Both are shorter than you'd expect.