Fine-Tuning LLMs
When RAG isn't enough and you actually need to teach the model something new. How to decide, how to do it, and what it'll cost you.
The question you should ask first
Do you actually need to fine-tune?
- Need to inject facts the model doesn’t know? → RAG first. Cheaper, faster, easier to update.
- Need a specific output format or style the model keeps violating? → Better prompts and few-shot examples first.
- Need lower latency or cost at high volume? → Fine-tuning might help (smaller, specialized model).
- Need the model to learn a completely new task or domain language? → Fine-tuning is probably the right tool.
- Compliance requires an on-prem model you control end-to-end? → Fine-tuning is the right tool.
Most teams that ask “should we fine-tune?” should not fine-tune. Start with RAG and prompting.
The fine-tuning spectrum
From cheapest to most expensive:
- Prompt tuning / soft prompts — learn a few continuous prompt vectors. Rare in practice now.
- LoRA / QLoRA (adapter-based) — freeze the base model, train small low-rank matrices. 99% of modern fine-tuning.
- Full fine-tuning — update all weights. Expensive, rarely worth it.
- Continued pre-training — feed the model a huge domain corpus before any instruction tuning. Only useful when the base model has weak coverage of your domain.
LoRA in one paragraph
Instead of updating the weight matrix W directly, you add two small matrices A and B such that the new weight is W + BA, where A and B are much smaller than W. You train A and B while W stays frozen. Result: you update a tiny fraction of the total parameters, saving memory and compute, and you can swap different LoRA adapters on top of the same base model. For most supervised fine-tuning, LoRA gets you 95% of the full fine-tuning quality at 5% of the cost.
QLoRA adds 4-bit quantization of the base model on top, so you can fine-tune a 70B model on a single A100.
The data matters more than the method
Quality of your training examples will determine 80% of the outcome. You can have perfect hyperparameters and still get a worse model than the base if your data is noisy.
Rules of thumb:
- For instruction tuning: 500-5,000 high-quality (prompt, response) pairs often beat 50,000 scraped ones.
- For classification: a few hundred per class is a reasonable starting point.
- Always hold out an eval set before you start iterating.
- Write the eval prompts first. If you can’t articulate what “good” looks like, fine-tuning won’t fix it.
The tooling
- HuggingFace
transformers+peft+trl— the standard open-weight fine-tuning stack. - Axolotl — a config-driven wrapper that’s easier for small teams.
- OpenAI fine-tuning API — easiest possible path for GPT-3.5/4 fine-tuning if you can’t self-host.
- Together, Fireworks, Modal — cloud fine-tuning for open models without managing GPUs.
The mistakes that cost you a week
- Overfitting on too few examples. Train for too many epochs on 200 rows and the model forgets everything else.
- Catastrophic forgetting. Fine-tune on just your domain and the model loses general capability. Mix in some general instructions.
- Evaluating on the training distribution. You’ll think it’s amazing; users will find it’s brittle.
- Forgetting to set a seed. You can’t reproduce your best run.
Cost reality check
- LoRA on a 7B model with 1,000 examples: $5-20 on spot GPUs.
- Full fine-tuning of a 7B model: $100-500 per run.
- Fine-tuning a 70B model: budget-dependent, start small.
- Iteration count to get something decent: usually 5-20 runs. Budget accordingly.