Fine-tuning2025-12-12

LLM Fine-Tuning Best Practices & Techniques (LoRA, QLoRA, SFT, DPO)

A practical, end-to-end guide to fine-tuning LLMs: choosing LoRA vs QLoRA vs full tuning, data formatting, evals, costs, and deployment pitfalls.

Fine-tuning isn’t magic, and it’s not a rite of passage. It’s an engineering trade: you’re paying compute + data curation + evaluation to buy more reliable model behavior. This guide is written for builders who want a clean, repeatable path to shipping a tuned model—without turning their team into “prompt whisperers.”

Quick answer (for Gemini-style summaries)

  • Use prompting when a few examples solve it and you can tolerate some variance.
  • Use RAG when the “truth” changes and you need citations or per-customer knowledge.
  • Fine-tune when you need stable behavior: format, tool use, tone, domain patterns, and lower prompt complexity.
  • Default to LoRA/QLoRA (PEFT). Try full fine-tuning only after you’ve proven PEFT can’t hit your target.

1) What fine-tuning actually changes (and what it doesn’t)

Fine-tuning teaches the model a mapping: input → preferred output. If your dataset consistently encodes a behavior (structure, style, decision logic), the model will learn it. But fine-tuning is not a great way to “store” a changing knowledge base, and it won’t automatically make the model truthful.

  • Great for: consistent JSON, tool calling patterns, tone/voice, domain-specific workflows, reducing prompt bloat.
  • Bad for: frequently changing facts, per-tenant knowledge, explainability/citations, “just learn our docs.”

2) Decision framework: Prompting vs RAG vs fine-tuning

If you remember one thing, remember this: RAG is for knowledge, fine-tuning is for behavior.

Start here Do you mainly need the model to: • access changing facts? • cite sources? • serve many tenants? Use RAG Truth lives outside the model Use prompting Few-shot + good instructions Fine-tune Behavior lives in weights
A practical default: start with prompting or RAG, then fine-tune once you can prove what “better” means and you have labeled examples.

3) Techniques: full fine-tuning vs LoRA vs QLoRA (what to pick)

Most teams should start with PEFT—Parameter-Efficient Fine-Tuning—because it’s cheaper and easier to iterate. LoRA and QLoRA are the two workhorses.

Fast selection guide

LoRA
Best default when you have enough VRAM.
  • • Frozen base + small adapters
  • • Strong quality/cost tradeoff
  • • Easy to manage variants
QLoRA
When GPU memory is the bottleneck.
  • • Base model kept in 4-bit
  • • Enables larger models on smaller GPUs
  • • Slightly more finicky setup
Full fine-tuning
When PEFT can’t reach target quality.
  • • Highest cost/complexity
  • • Higher risk of catastrophic forgetting
  • • Harder rollback/variant mgmt
If you’re unsure, do LoRA first. If you run out of VRAM, switch to QLoRA. Only consider full fine-tuning after you’ve tried PEFT and can measure the gap.

4) Data: what “good training data” really means

The #1 reason fine-tunes fail isn’t hyperparameters. It’s the dataset. Your model learns exactly what you show it—ambiguity included.

Minimum viable dataset checklist

  • Unambiguous targets: If two annotators disagree, your model will wobble.
  • Representative distribution: Your training set should look like real traffic, including edge cases.
  • Hard negatives: Include “don’t do it” examples (e.g., refuse unsafe tool calls, reject invalid inputs).
  • Held-out eval set: Lock an evaluation set early and don’t “fix” it mid-training.
  • Leakage control: Deduplicate near-duplicates across train/eval (copy-paste kills honest metrics).

If you want a deeper dive into how to design labeling guidelines, source examples from production, and keep datasets high-quality over time, read Data Labeling & Dataset Quality: The Foundation of Reliable LLM Fine-Tuning.

Example: instruction SFT format (simple and readable)

Here’s a format that works well for many apps: a short system rule, a user instruction, and the ideal assistant answer.

{
  "messages": [
    {"role": "system", "content": "You are a support agent. Output valid JSON only."},
    {"role": "user", "content": "Extract intent + entities from: 'Cancel my Pro plan effective tomorrow'."},
    {"role": "assistant", "content": "{"intent":"cancel_subscription","entities":{"plan":"pro","effective_date":"tomorrow"}}"}
  ]
}

Pro tip: If JSON correctness matters, include plenty of counterexamples: missing fields, invalid dates, contradictory text, and require the model to return a stable error object.

5) A practical training recipe (SFT with LoRA/QLoRA)

This is a reliable flow that works for most teams:

  1. Baseline: Evaluate the base model with your exact prompt template.
  2. Small pilot: 200–1,000 examples, run 1–2 short experiments.
  3. Scale data: Improve coverage and labeling, not just volume.
  4. Lock eval: Keep an untouched eval set + a “hard set.”
  5. Train: LoRA/QLoRA SFT, then compare to baseline.
  6. Iterate: Add failure cases to the dataset, re-run.

Key hyperparameters that actually matter

  • Learning rate: Too high = the model “forgets” and gets brittle. Too low = no movement.
  • Epochs: More epochs is not “more better.” Watch eval quality and stop early.
  • Sequence length: Short contexts can train “short attention” habits. Match production length.
  • LoRA target modules: If quality is stuck, consider targeting more linear layers (higher compute, sometimes better adaptation).

Real example: LoRA config (plain-English defaults)

from peft import LoraConfig

peft_config = LoraConfig(
    r=16,                 # capacity: 4–16 is a common starting band
    lora_alpha=32,        # scaling
    lora_dropout=0.05,    # regularization
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

6) Evaluation: what to measure so you don’t lie to yourself

Fine-tuning evaluation should answer one question: Is this model safer, more correct, and more consistent on real traffic?

Minimum eval suite

  • Format correctness: JSON parses, schema validates, tool calls match spec.
  • Task success: Accuracy/F1 or exact-match for structured targets.
  • Regression checks: A stable set the base model already did well on.
  • Edge cases: Ambiguous inputs, noisy text, adversarial prompt injection attempts.

A simple scoring harness (example)

// Pseudocode: score JSON outputs
function score(example, modelOutput) {
  const parsed = safeJsonParse(modelOutput);
  if (!parsed.ok) return { ok: false, reason: "invalid_json" };
  if (!schemaValidate(parsed.value)) return { ok: false, reason: "schema_fail" };
  return { ok: true, reason: "pass" };
}

7) Common failure modes (and how to fix them)

Symptom
  • Model got worse vs baseline
  • Great on eval, bad in prod
  • Format breaks intermittently
  • Over-refuses or under-refuses
Likely cause → Fix
  • Leakage/duplicates → dedupe + rebuild eval
  • Eval not representative → add real traffic + hard cases
  • Inconsistent targets → tighten labeling rules + add validators
  • Bad safety examples → add explicit refusal + escalation patterns

8) Deployment: adapters, merging, and operational reality

Deployment is where “research fine-tuning” becomes “production fine-tuning.” Your choices affect latency, memory, and rollback.

  • Base + adapter loading: flexible, supports many variants, slightly more complexity.
  • Merging adapter weights: simpler deployment, often slightly faster, but you lose the “one base / many adapters” pattern.
  • Versioning: treat datasets, prompts, and weights as versioned artifacts. If you can’t reproduce it, it’s not shipped.

9) How-to: build a fine-tuning dataset that actually works

Most teams start by collecting “a bunch of examples.” A better approach is to design your dataset like an API contract: clear inputs, clear outputs, and explicit failure behavior.

Step-by-step (a practical 2–4 hour first pass)

  1. Pick 1 behavior to improve (start narrow). Examples: “valid JSON every time”, “tool calls follow spec”, “support tone + escalation rules”.
  2. Write a scoring rule before training. If you can’t score it, you can’t iterate it.
  3. Collect 50 real inputs from logs or tickets (anonymize). Don’t cherry-pick “easy” examples.
  4. Write ideal outputs with the same style and schema every time.
  5. Add 20 failure cases (invalid inputs, missing fields, conflicting requirements) and define the correct refusal/error output.
  6. Lock a 20-example eval set and never train on it.

Real example: tool-calling dataset row

Let’s say your model can call a billing tool. You want it to only call the tool when it has the required fields, otherwise ask a clarifying question.

{
  "messages": [
    {"role": "system", "content": "You are a billing assistant. Use tools when needed. Never guess missing identifiers."},
    {"role": "user", "content": "Refund my last invoice."},
    {"role": "assistant", "content": "Before I can refund an invoice, I need your invoice ID or the email on the account. Which one can you share?"}
  ]
}

Dataset design rules that prevent pain

  • One intent per example (at first). Mixed intents produce mixed behavior.
  • Make the “right format” boring: consistent keys, consistent ordering, consistent error object.
  • Include counterexamples: invalid IDs, missing params, disallowed actions, prompt injection attempts.
  • Deduplicate aggressively: near-duplicates inflate eval and teach copy-paste responses.

10) How-to: create preference data for DPO (without overcomplicating it)

DPO is easiest when you can produce two plausible answers and reliably label which is better. You don’t need perfection—you need consistency.

  1. Start with prompts where the model already produces two distinct outputs (temperature helps for generating candidates).
  2. Label one as preferred based on a short rubric (format correctness, groundedness, helpfulness, tone).
  3. Keep examples small and specific; preference learning is sensitive to vague labels.

Mini rubric (copy/paste)

  • 1) Correct format: parses + validates (no exceptions).
  • 2) Follows policy: refuses prohibited actions, asks for missing identifiers.
  • 3) Completeness: answers the question without rambling.
  • 4) Tone: calm, professional, not “overconfident.”

11) Hardware + cost: quick sizing math (so you don’t guess)

You don’t need to be perfect—you need to be in the right order of magnitude. The biggest knobs are: model size, sequence length, batch size, and whether you use QLoRA.

  • If you hit OOM: reduce sequence length, reduce batch size, increase gradient accumulation, or switch to QLoRA.
  • If training is slow: shorten sequences for the pilot, reduce eval frequency, or use smaller models until your dataset is stable.

Rule-of-thumb table (starting points)

Goal Suggested approach Typical data size Common gotcha
Strict JSON / schema LoRA SFT + validators 200–2,000 examples Missing negative cases
Tool calling reliability LoRA SFT + “ask clarifying” examples 500–5,000 examples Model guesses missing IDs
Tone / brand voice SFT + light DPO 500–10,000 examples Inconsistent style targets
Domain workflow reasoning SFT with hard cases + eval 2,000–50,000+ Eval doesn’t match prod

12) “Picture” overview: the production fine-tuning loop

Data Real inputs + labels Hard negatives Train LoRA/QLoRA SFT Optional DPO Eval Format + task Regression suite Deploy Adapter or merged Versioned artifacts Monitor Drift + fails Add to data
This loop is the secret: ship a version, watch failures, add examples, retrain. Fine-tuning is not one-and-done.

13) Deployment checklist (what experienced teams don’t skip)

  • Artifact versioning: dataset hash, prompt template version, base model ID, adapter weights version.
  • Rollback plan: a single config change should revert to the previous model.
  • Safety gates: refusal tests, tool-call allowlist, schema validation in production.
  • Canary routing: route 1–5% of traffic to the new model and compare metrics.
  • Observability: log inputs/outputs + parse failures + tool errors (with privacy controls).

FAQ (direct answers)

How much data do I need?

For behavior tuning (format/style/tool use), start with 200–1,000 high-quality examples, then grow with failures. For robust reasoning shifts, expect more data plus stronger eval design.

What’s the fastest path to a successful fine-tune?

Build a small eval set first, run a small LoRA pilot, then invest in data quality and edge cases. Iteration speed wins.

Should I use DPO?

Use DPO when you can express preference pairs (A is better than B) and you want to shape “style/choice” behavior. Start with SFT; add DPO once you have a clear preference signal.

Bottom line

  • Fine-tuning is a behavior upgrade, not a knowledge base.
  • Default to LoRA/QLoRA for iteration speed and cost.
  • Your biggest lever is dataset quality + evaluation, not “secret hyperparameters.”

Related Topics

Fine-tuningLoRAQLoRAPEFTSFTDPORLHFTRLHugging FaceEvaluation

Ready to put this into practice?

Start building your AI pipeline with our visual DAG builder today.