LoRA

Low-Rank Adaptation
LLMs

A parameter-efficient fine-tuning method that freezes the base model and trains small low-rank update matrices on top.


In one line

A parameter-efficient fine-tuning method that freezes the base model and trains small low-rank update matrices on top.

What it actually means

For each weight matrix W you want to adapt, LoRA adds a learned update ΔW = B @ A where A is r × k and B is d × r with r much smaller than d or k (typical r is 8–64). You freeze W and only train A and B. At inference you can either keep them as separate adapters or merge them back into W for free. The trainable parameter count drops by 100–1000x, training fits on a single GPU, and adapters are tiny files you can swap in and out per task or per customer.

Why it matters

LoRA is the default way to fine-tune open-weight LLMs in 2026. It’s the only practical option for most teams that want a custom model without renting an H100 cluster, and it pairs beautifully with quantized base models (QLoRA). It also enables multi-tenant serving: one base model in memory, hundreds of small adapters loaded on demand.

Example

W' = W + (B @ A) * (alpha / r)
shape(W) = (d, k)
shape(A) = (r, k), shape(B) = (d, r), r << min(d, k)

You’ll hear it when

  • Fine-tuning Llama, Mistral, Qwen, or any open model.
  • Discussing QLoRA and 4-bit base models.
  • Designing per-tenant or per-task adapter serving.
  • Comparing LoRA to full fine-tuning or DPO.
  • Reading about parameter-efficient fine-tuning (PEFT).

Related terms

See also