Fine-Tuning LLMs

When RAG isn't enough and you actually need to teach the model something new. How to decide, how to do it, and what it'll cost you.

LLMs & Generative AI intermediate #llms #fine-tuning #lora #peft
Prereqs: Transformer basics, PyTorch basics

The question you should ask first

Do you actually need to fine-tune?

  • Need to inject facts the model doesn’t know? → RAG first. Cheaper, faster, easier to update.
  • Need a specific output format or style the model keeps violating? → Better prompts and few-shot examples first.
  • Need lower latency or cost at high volume? → Fine-tuning might help (smaller, specialized model).
  • Need the model to learn a completely new task or domain language? → Fine-tuning is probably the right tool.
  • Compliance requires an on-prem model you control end-to-end? → Fine-tuning is the right tool.

Most teams that ask “should we fine-tune?” should not fine-tune. Start with RAG and prompting.

The fine-tuning spectrum

From cheapest to most expensive:

  1. Prompt tuning / soft prompts — learn a few continuous prompt vectors. Rare in practice now.
  2. LoRA / QLoRA (adapter-based) — freeze the base model, train small low-rank matrices. 99% of modern fine-tuning.
  3. Full fine-tuning — update all weights. Expensive, rarely worth it.
  4. Continued pre-training — feed the model a huge domain corpus before any instruction tuning. Only useful when the base model has weak coverage of your domain.

LoRA in one paragraph

Instead of updating the weight matrix W directly, you add two small matrices A and B such that the new weight is W + BA, where A and B are much smaller than W. You train A and B while W stays frozen. Result: you update a tiny fraction of the total parameters, saving memory and compute, and you can swap different LoRA adapters on top of the same base model. For most supervised fine-tuning, LoRA gets you 95% of the full fine-tuning quality at 5% of the cost.

QLoRA adds 4-bit quantization of the base model on top, so you can fine-tune a 70B model on a single A100.

The data matters more than the method

Quality of your training examples will determine 80% of the outcome. You can have perfect hyperparameters and still get a worse model than the base if your data is noisy.

Rules of thumb:

  • For instruction tuning: 500-5,000 high-quality (prompt, response) pairs often beat 50,000 scraped ones.
  • For classification: a few hundred per class is a reasonable starting point.
  • Always hold out an eval set before you start iterating.
  • Write the eval prompts first. If you can’t articulate what “good” looks like, fine-tuning won’t fix it.

The tooling

  • HuggingFace transformers + peft + trl — the standard open-weight fine-tuning stack.
  • Axolotl — a config-driven wrapper that’s easier for small teams.
  • OpenAI fine-tuning API — easiest possible path for GPT-3.5/4 fine-tuning if you can’t self-host.
  • Together, Fireworks, Modal — cloud fine-tuning for open models without managing GPUs.

The mistakes that cost you a week

  1. Overfitting on too few examples. Train for too many epochs on 200 rows and the model forgets everything else.
  2. Catastrophic forgetting. Fine-tune on just your domain and the model loses general capability. Mix in some general instructions.
  3. Evaluating on the training distribution. You’ll think it’s amazing; users will find it’s brittle.
  4. Forgetting to set a seed. You can’t reproduce your best run.

Cost reality check

  • LoRA on a 7B model with 1,000 examples: $5-20 on spot GPUs.
  • Full fine-tuning of a 7B model: $100-500 per run.
  • Fine-tuning a 70B model: budget-dependent, start small.
  • Iteration count to get something decent: usually 5-20 runs. Budget accordingly.