LoRA: what Microsoft's fine-tuning paper actually says about low-rank adaptation

Reading Hu et al. (ICLR 2022) while deciding how to fine-tune production language models.

We needed our model to follow the company's output format reliably. The base model was close — it had the reasoning capability — but it kept producing prose when we needed structured JSON, and it ignored domain-specific terminology we'd spent months defining. The obvious fix was fine-tuning. The obvious problem was that full fine-tuning a 70B model requires storing a 70B optimizer state, a 70B gradient buffer, and the 70B weights themselves. On a single A100 (80GB), that's not possible. Across a rack of GPUs, it's expensive.

LoRA gave us a better trade. After reading the paper, I understood why it works, when it doesn't, and what the implementation choices actually control. Most LoRA tutorials explain what to set without explaining why those settings exist.

The insight the paper is built on

LoRA — Low-Rank Adaptation of Large Language Models, Hu et al., ICLR 2022 — starts from an observation by Aghajanyan et al. (2021): pre-trained language models have a low "intrinsic dimensionality." When you fine-tune a large model, the changes to the weights don't explore the full high-dimensional weight space; they stay close to a much lower-dimensional subspace.

If that's true, you shouldn't need to update all the weights explicitly. You could parameterize the update as a low-rank matrix and recover most of the fine-tuning performance with far fewer parameters.

The paper tests this hypothesis. It holds.

What LoRA actually does to the weight matrices

Take a pretrained weight matrix W₀ ∈ ℝ^(d×k). Full fine-tuning updates it to W₀ + ΔW, training all d×k values in ΔW.

LoRA instead decomposes the update:

ΔW = BA

where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k), and r << min(d, k). The rank r is a hyperparameter you set — typically 4, 8, 16, or 64.

The modified forward pass becomes:

h = W₀x + (α/r) BAx

W₀ is frozen. Only A and B are trained. α is a scaling constant (more on this later).

Initialization: A is initialized with random Gaussian values. B is initialized to zero. This means BA = 0 at the start of training — the adapted model is identical to the base model at initialization. Training from a known starting point (the pretrained model's behavior) rather than from random noise is the right inductive bias for adaptation.

Parameter count: For a single weight matrix with d=k=4096 (typical in transformer attention), full fine-tuning trains 4096×4096 = 16.7M parameters. LoRA with r=4 trains 4×(4096+4096) = 32,768 parameters — 500x fewer, for that layer alone.

Which weight matrices to apply it to

A transformer's attention layer has four weight matrices: W_q (query), W_k (key), W_v (value), W_o (output projection). The FFN layers have two more.

The paper compares applying LoRA to different combinations. Key findings:

Applying to W_q and W_v together gives the best performance per parameter. Applying to all four attention matrices with a lower rank (r=4) outperforms applying to W_q and W_v with a higher rank (r=8) — spreading adapters across more matrices is better than concentrating parameters in fewer.

The FFN layers show smaller improvements from LoRA than the attention layers for most fine-tuning tasks. Most implementations default to attention-only.

In practice: start with W_q and W_v. Add W_k and W_o if you have the parameter budget or see that W_q/W_v adaptation isn't sufficient. Add FFN layers last.

The hyperparameters that actually matter

Rank (r): This is the single most consequential setting. The paper finds that r=4 performs comparably to r=64 on most tasks. This is the empirical confirmation of the low intrinsic dimensionality hypothesis — if the weight updates are truly low-rank, increasing r past 4–8 adds parameters without adding signal.

Where higher r helps: tasks that require the model to learn genuinely new behavioral patterns rather than just adapting existing ones. A model learning to consistently produce JSON benefits from r=4. A model learning to reason about a specialized domain it wasn't trained on at all may need r=16–64. In practice, r>64 rarely justifies its cost on standard fine-tuning tasks.

Alpha (α): The scaling factor in (α/r) BAx. The paper recommends setting α to the first r you try, then not tuning it further. The effective learning rate for the LoRA update is α/r × base_lr. If you double r without changing α, you halve the effective adaptation magnitude. Setting α=r keeps the scaling constant as you sweep r values — cleaner ablations.

Most implementations default to α=r or α=2r. If you're getting underfitting (model barely changed from base), increase α. If you're getting catastrophic forgetting (model overwrites base behavior), decrease α or lower the learning rate.

Target modules: Which weight matrices to apply LoRA to. Defaults: q_proj, v_proj in most Hugging Face PEFT configurations. This maps to the paper's recommendation.

Production serving: merged vs. unmerged adapters

At inference, you have two options:

Merged: Compute W = W₀ + (α/r) BA once, store the merged weights. The forward pass is identical to an unmodified model — no runtime overhead. This is the right choice if you're serving a single fine-tuned model permanently.

Unmerged (live adapters): Keep W₀ and the LoRA matrices BA separate. At runtime, compute W₀x + (α/r) BAx. There's a small overhead for the extra matrix multiply, but you can swap adapters without reloading the base model.

Unmerged adapters unlock a serving architecture that's operationally significant: one copy of the base model weights in GPU memory, N LoRA adapter sets loaded alongside it. You route different users or tasks to different adapters without the memory cost of N full model copies. This is how multi-tenant fine-tuned model serving works at scale.

The constraint: adapter hot-swapping mid-request doesn't work — you commit to an adapter at request start. Cross-adapter batching requires careful scheduling (requests using the same adapter can batch; mixed-adapter batches require separate forward passes or purpose-built kernels).

Training memory reduction

This is where LoRA's practical value is clearest. Full fine-tuning of a 70B model requires:

70B × 2 bytes = 140GB for weights (BF16)
70B × 4 bytes = 280GB for Adam optimizer states (FP32)
70B × 2 bytes = 140GB for gradients

That's roughly 560GB before activations. No single node handles this.

LoRA with r=8 applied to attention Q and V matrices:

140GB for frozen base weights (no gradients needed for W₀)
~50M trainable parameters for adapters → ~400MB for optimizer states
Gradients only for adapter parameters → ~200MB

The base model still occupies GPU memory, but the training overhead is nearly eliminated. On a single 80GB A100, you can fine-tune models that full fine-tuning would require a cluster for.

The catch: the base weights still need to fit in memory for the forward pass. LoRA doesn't help if the model doesn't fit at all — that's where quantized variants like QLoRA (Dettmers et al., 2023) come in, loading the base model in 4-bit while training adapters in full precision.

Production tradeoffs

LoRA vs. full fine-tuning on quality: The paper shows parity on standard benchmarks. Production experience is more nuanced. For behavioral adaptation — output format, tone, domain terminology — LoRA matches full fine-tuning. For deep domain knowledge injection — teaching a model something genuinely outside its pretraining distribution — full fine-tuning consistently wins.

Catastrophic forgetting: LoRA reduces forgetting compared to full fine-tuning because the base weights are frozen. The model retains its general capabilities while adapting on the target distribution. This is usually what you want in production: a specialist that's still a generalist. Full fine-tuning on a narrow dataset can degrade performance on everything else.

Checkpoint size: A LoRA checkpoint is the adapter parameters only — typically 10–100MB rather than tens of gigabytes. For environments where you're iterating quickly or deploying many variants, this matters.

Hyperparameter sensitivity: LoRA introduces r and α on top of standard fine-tuning hyperparameters (learning rate, batch size, epochs). The paper's recommendation to set α=r and leave it is pragmatically useful for initial experiments. The rank sweep (try r=4, 8, 16, compare on a validation set) is necessary and cheap because small r means fast training.

Gradient checkpointing compatibility: LoRA works with gradient checkpointing (recomputing activations during backward pass to save memory). Frozen base weights + LoRA adapters + gradient checkpointing is the practical recipe for fine-tuning large models on limited hardware.

When not to use LoRA

When you need to inject genuinely new knowledge. LoRA adapts how the model uses what it already knows. If you're fine-tuning on a domain where the base model has essentially no pretrained signal — proprietary internal terminology, a new programming language, specialized mathematical notation — the low-rank parameterization may not have the capacity to represent what's needed. Full fine-tuning or RAG is the right answer.

When your task requires very few examples. If you have fewer than a few hundred examples, few-shot prompting often outperforms fine-tuning. The fine-tuning signal isn't sufficient to make gradient-based adaptation reliable. Test prompting first.

When adapter overhead matters at inference. Unmerged adapter inference has overhead. If you're running a latency-critical application at p99, the extra matrix operations appear in your latency budget. Merge the adapters or benchmark the overhead explicitly before going to production.

When you need tight guarantees on output format. LoRA is a soft constraint — it nudges the model's behavior, but the base model's priors still influence outputs significantly. If you need structured extraction that must parse reliably or code generation that must compile, constrained decoding or output validation is a more reliable mechanism than hoping fine-tuning covers all edge cases.

When you're fine-tuning an already-aligned model for sensitive use cases. Fine-tuning a model that's been through RLHF can degrade its safety properties. The base model's refusal behaviors and harmlessness training are part of its weight distribution; LoRA over an aligned model can suppress those behaviors if your fine-tuning data contains outputs the aligned model would normally decline. This is a real concern in production deployments that serve fine-tuned models at scale.

What the paper gets right that implementations miss

The rank sensitivity result deserves more attention than it gets. Most LoRA tutorials say "try r=8 or r=16" without noting that the paper shows r=4 is nearly identical to r=64 on most tasks. The implication: if you're seeing meaningful quality degradation at r=4, it's more likely a learning rate or data quality issue than a rank capacity issue. Chase the data and the learning rate before increasing r.

The α/r scaling convention is there for ablation cleanliness. Once you've settled on a rank, α can be tuned like a learning rate multiplier. The practical guidance — set α=r, don't touch it — is correct for initial experiments; it's not a permanent constraint.

The choice of which weight matrices to adapt is not a default to copy-paste. The paper ran the ablation. For GPT-3 fine-tuning, adapting W_q and W_v across all layers outperformed adapting all matrices in a subset of layers. Your model architecture may behave differently. Run a small ablation if you care about maximizing quality per trainable parameter.

The broader point: LoRA's simplicity is its strength. A low-rank reparameterization of the weight update, zero-initialized so training starts from the pretrained model, scaled to control adaptation magnitude. The implementations that add complexity on top of this (multiple adapter ranks per layer, dynamic rank allocation, adapter merging strategies) are solving real problems, but they're solving problems that the base LoRA formulation mostly doesn't have for standard fine-tuning tasks. Start simple.

References:

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
Aghajanyan, A., Zettlemoyer, L., & Gupta, S. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.