LoRA
Low-Rank AdaptationA parameter-efficient fine-tuning method that freezes the base model and trains small low-rank update matrices on top.
In one line
A parameter-efficient fine-tuning method that freezes the base model and trains small low-rank update matrices on top.
What it actually means
For each weight matrix W you want to adapt, LoRA adds a learned update ΔW = B @ A where A is r × k and B is d × r with r much smaller than d or k (typical r is 8–64). You freeze W and only train A and B. At inference you can either keep them as separate adapters or merge them back into W for free. The trainable parameter count drops by 100–1000x, training fits on a single GPU, and adapters are tiny files you can swap in and out per task or per customer.
Why it matters
LoRA is the default way to fine-tune open-weight LLMs in 2026. It’s the only practical option for most teams that want a custom model without renting an H100 cluster, and it pairs beautifully with quantized base models (QLoRA). It also enables multi-tenant serving: one base model in memory, hundreds of small adapters loaded on demand.
Example
W' = W + (B @ A) * (alpha / r)
shape(W) = (d, k)
shape(A) = (r, k), shape(B) = (d, r), r << min(d, k)
You’ll hear it when
- Fine-tuning Llama, Mistral, Qwen, or any open model.
- Discussing QLoRA and 4-bit base models.
- Designing per-tenant or per-task adapter serving.
- Comparing LoRA to full fine-tuning or DPO.
- Reading about parameter-efficient fine-tuning (PEFT).