MervCodes

Tech Reviews From A Programmer

Fine-Tuning LLMs: Practical Developer Guide

1 min read

Fine-Tuning LLMs: Practical Developer Guide

Fine-tuning has a reputation for being either a magic bullet or a money pit, and both reputations are earned depending on how you go about it. This guide cuts through the hype and walks through what fine-tuning actually buys you, when to reach for it, and how to execute a project end to end without burning weeks of GPU time on a model that ends up worse than a good prompt.

What Fine-Tuning Actually Does

Fine-tuning continues training a pre-trained model on your own examples so it internalizes a behavior, format, or domain style. It does not teach the model new facts reliably, and it is not a substitute for giving the model context at inference time.

Keep this mental model:

  • Prompting changes what you ask in the moment.
  • Retrieval (RAG) changes what the model knows right now by injecting documents.
  • Fine-tuning changes how the model behaves by default.

If your problem is "the model doesn't know our internal docs," that is a retrieval problem. If your problem is "the model knows the answer but won't format it as clean JSON, in our brand voice, every single time," that is a fine-tuning problem.

When to Fine-Tune (and When Not To)

Before you spend anything, exhaust the cheaper options. The order of operations that saves the most money:

  1. Better prompting — clear instructions, few-shot examples, structured output schemas.
  2. Retrieval augmentation — pull in the right context dynamically.
  3. Prompt + RAG combined — most production systems live here happily.
  4. Fine-tuning — only when the above plateau.

Good candidates for fine-tuning:

  • Consistent structured output where prompt-based JSON still drifts.
  • Tone and style that's hard to describe but easy to demonstrate.
  • Latency and cost reduction — a fine-tuned smaller model can replace a large model with a long prompt, cutting per-call cost dramatically.
  • Narrow, repetitive classification or extraction tasks at high volume.

Poor candidates:

  • Injecting fresh or frequently changing facts (use RAG).
  • Tasks where you have fewer than a few hundred quality examples.
  • Anything you haven't first tried to solve with a strong prompt.

Choosing Full Fine-Tuning vs. LoRA

You almost never want full fine-tuning. Parameter-efficient fine-tuning (PEFT), especially LoRA and QLoRA, updates a small set of adapter weights while freezing the base model. The trade-offs:

  • Full fine-tuning: updates all weights, needs the most VRAM, risks catastrophic forgetting, hard to host many variants. Reserve for when you truly need deep behavioral change and have the hardware.
  • LoRA: trains small low-rank matrices, 10–100x cheaper, and you can swap adapters at serve time. This is the default for most teams.
  • QLoRA: LoRA on top of a 4-bit quantized base. Lets you fine-tune surprisingly large models on a single consumer or mid-tier GPU.

Start with QLoRA. Move up only if results demand it.

Preparing Your Dataset

This is where the majority of your success is decided. Models learn what you show them, including your mistakes.

Quality beats quantity. A clean set of 500–2,000 examples usually beats 50,000 noisy ones. Each example should reflect exactly the input/output behavior you want in production.

Use a consistent format. For instruction tuning, the chat/messages format is standard:

{
  "messages": [
    {"role": "system", "content": "You extract invoice fields as JSON."},
    {"role": "user", "content": "Invoice #1042, total [amount], due [due_date]"},
    {"role": "assistant", "content": "{\"invoice_id\": \"1042\", \"total\": \"[amount]\", \"due_date\": \"[due_date]\"}"}
  ]
}

Dataset checklist:

  • Deduplicate aggressively — repeated examples skew the model.
  • Balance your classes or output types so the model doesn't over-predict the majority case.
  • Hold out a meaningful portion of examples as a validation/test set the model never trains on.
  • Include edge cases you actually see in production, not just the happy path.
  • Match inference exactly — the system prompt, formatting, and structure used in training must mirror what you'll send at serve time.

Setting Hyperparameters Without Guessing

You don't need to tune dozens of knobs. Focus on a few:

  • Learning rate: the most important. For LoRA, start around 1e-4 to 2e-4. Too high and the model degrades; too low and it barely learns.
  • Epochs: 1–3 is typical. More than 3 usually means memorization, not learning.
  • Batch size: as large as memory allows; use gradient accumulation to simulate bigger batches.
  • LoRA rank (r): 8–16 is a fine starting point. Higher rank = more capacity and more overfitting risk.
  • Warmup: a short warmup (a few percent of steps) stabilizes early training.

Run a small pilot — a few hundred examples, one epoch — before committing to a full run. It catches format bugs and bad learning rates in minutes instead of hours.

Evaluating the Result

"It looks better" is not an evaluation. Set up measurement before you train so you can compare honestly.

  • Build a fixed eval set of representative prompts with known-good answers.
  • Use task-appropriate metrics: exact-match or F1 for extraction, accuracy for classification, and an LLM-as-judge rubric for open-ended generation.
  • Always compare against baselines: the base model with your best prompt, and the base model with RAG. If fine-tuning doesn't beat both, don't ship it.
  • Watch for regressions: a model tuned for one task can get worse at general reasoning. Test capabilities you care about preserving.

Deploying and Maintaining

Once you have a winner:

  • Version everything — dataset, base model, hyperparameters, and the resulting adapter. Reproducibility matters when something regresses.
  • Serve LoRA adapters alongside the base model so you can host multiple tuned behaviors cheaply.
  • Monitor in production — log inputs and outputs (with privacy controls), and sample them for quality drift.
  • Plan to retune when the base model is upgraded or your data distribution shifts. Fine-tuned models are perishable, not permanent.

A Realistic Workflow

A pragmatic end-to-end path looks like this:

  1. Define the behavior and write an eval set first.
  2. Try hard with prompting and RAG. Record the baseline scores.
  3. Collect 500–2,000 clean examples that mirror production.
  4. Run a QLoRA pilot, fix data/format issues, then a full run.
  5. Evaluate against baselines; iterate on data, not just hyperparameters.
  6. Ship the adapter, monitor, and schedule a retune cadence.

Notice how much of this is data and measurement, not training. That ratio is correct.

FAQ

How many examples do I need? For style and format adaptation, a few hundred high-quality examples can work. For more complex tasks, aim for 1,000–10,000. Past that, returns diminish quickly unless the task is genuinely diverse — quality and coverage matter more than raw count.

Can fine-tuning teach the model new facts? Unreliably. It may memorize some, but it will also hallucinate confidently and forget things under pressure. For factual knowledge that must be accurate or stays current, use retrieval instead.

How much does it cost? With QLoRA, you can fine-tune mid-sized open models on a single rented GPU for a few dollars to low double digits per run. Hosted fine-tuning APIs charge by tokens trained. The bigger hidden cost is your time building and cleaning the dataset.

Will fine-tuning make the model dumber at other things? It can — this is catastrophic forgetting. Mitigate it by using LoRA (which freezes the base), training for fewer epochs, mixing in some general examples, and testing for regressions on tasks you want to preserve.

Should I fine-tune an open model or use a provider's fine-tuning API? Use a provider's API when you want minimal infrastructure and are fine with their hosting. Fine-tune open models when you need full control, on-prem deployment, lower long-run cost at scale, or the ability to host many adapters yourself.

My fine-tuned model is worse than the base model. What happened? Usually one of: learning rate too high, too many epochs (overfitting), a training format that doesn't match inference, or noisy/inconsistent data. Re-check your dataset first — it's almost always the data.

Key Takeaways

Fine-tuning is a precision tool, not a default. Reach for it after prompting and retrieval plateau, invest most of your effort in clean data and honest evaluation, prefer QLoRA to start, and treat the tuned model as something you'll measure, monitor, and eventually retrain. Do that, and fine-tuning stops being a gamble and becomes a dependable part of your stack.

Sources

Related Articles