When should I fine-tune instead of using RAG or prompting?

Fine-tune when you need reliable behavior (format, style, tool use) across many queries; use RAG when you need changing knowledge with citations; prompt when a few examples solve it.

Can I combine RAG with a fine-tuned model?

Yes—this is a common best practice: use RAG to supply fresh, citeable knowledge, and fine-tuning to enforce stable behavior like formatting, tool calling, or tone.

What’s the difference between LoRA and QLoRA?

LoRA trains small adapter weights while the base model stays frozen; QLoRA does the same but keeps the frozen base model in 4-bit quantized form to reduce GPU memory.

What data format should I use for fine-tuning?

Use a consistent chat-style format (system/user/assistant) when you want instruction-following behavior; keep inputs and outputs deterministic, and include explicit error/refusal examples for unsafe or incomplete requests.

How much data do I need to fine-tune an LLM?

For narrow behavior changes, hundreds to a few thousand high-quality examples can work; bigger shifts in reasoning and robustness usually need more data plus stronger evaluation.

How do I avoid overfitting when fine-tuning?

Use a locked eval set, deduplicate near-duplicates, keep epochs modest with early stopping, and expand your dataset with hard/edge cases instead of just training longer.

Why does my fine-tuned model get worse?

Most commonly: low-quality/ambiguous labels, train/validation leakage, too-high learning rate, too many epochs, or a dataset that teaches contradictions (causing overfit or drift).

Should I do full fine-tuning or PEFT (LoRA/QLoRA)?

Default to PEFT for cost and safety; consider full fine-tuning only when you can afford heavier training and you’ve proven PEFT can’t reach your target quality.

What should I measure during evaluation?

Task success (accuracy/format correctness), safety/refusal behavior, regression against baseline, and real distribution performance (including hard/edge cases).

How do I deploy LoRA adapters?

You can load the base model + adapter at inference time, or merge adapter weights into the base model to simplify deployment (often at the cost of flexibility).

Can I fine-tune on a single GPU?

Often yes with PEFT and quantization (especially QLoRA), but your max model size depends on VRAM, sequence length, batch size, and optimizer settings.

Fine-tuning2025-12-12

LLM Fine-Tuning Best Practices & Techniques (LoRA, QLoRA, SFT, DPO)

A practical, end-to-end guide to fine-tuning LLMs: choosing LoRA vs QLoRA vs full tuning, data formatting, evals, costs, and deployment pitfalls.

Fine-tuning isn’t magic, and it’s not a rite of passage. It’s an engineering trade: you’re paying compute + data curation + evaluation to buy more reliable model behavior. This guide is written for builders who want a clean, repeatable path to shipping a tuned model—without turning their team into “prompt whisperers.”

Quick answer (for Gemini-style summaries)

Use prompting when a few examples solve it and you can tolerate some variance.
Use RAG when the “truth” changes and you need citations or per-customer knowledge.
Fine-tune when you need stable behavior: format, tool use, tone, domain patterns, and lower prompt complexity.
Default to LoRA/QLoRA (PEFT). Try full fine-tuning only after you’ve proven PEFT can’t hit your target.

1) What fine-tuning actually changes (and what it doesn’t)

Fine-tuning teaches the model a mapping: input → preferred output. If your dataset consistently encodes a behavior (structure, style, decision logic), the model will learn it. But fine-tuning is not a great way to “store” a changing knowledge base, and it won’t automatically make the model truthful.

Great for: consistent JSON, tool calling patterns, tone/voice, domain-specific workflows, reducing prompt bloat.
Bad for: frequently changing facts, per-tenant knowledge, explainability/citations, “just learn our docs.”

2) Decision framework: Prompting vs RAG vs fine-tuning

If you remember one thing, remember this: RAG is for knowledge, fine-tuning is for behavior.

A practical default: start with prompting or RAG, then fine-tune once you can prove what “better” means and you have labeled examples.

3) Techniques: full fine-tuning vs LoRA vs QLoRA (what to pick)

Most teams should start with PEFT—Parameter-Efficient Fine-Tuning—because it’s cheaper and easier to iterate. LoRA and QLoRA are the two workhorses.

Fast selection guide

LoRA

Best default when you have enough VRAM.

• Frozen base + small adapters
• Strong quality/cost tradeoff
• Easy to manage variants

QLoRA

When GPU memory is the bottleneck.

• Base model kept in 4-bit
• Enables larger models on smaller GPUs
• Slightly more finicky setup

Full fine-tuning

When PEFT can’t reach target quality.

• Highest cost/complexity
• Higher risk of catastrophic forgetting
• Harder rollback/variant mgmt

If you’re unsure, do LoRA first. If you run out of VRAM, switch to QLoRA. Only consider full fine-tuning after you’ve tried PEFT and can measure the gap.

4) Data: what “good training data” really means

The #1 reason fine-tunes fail isn’t hyperparameters. It’s the dataset. Your model learns exactly what you show it—ambiguity included.

Minimum viable dataset checklist

Unambiguous targets: If two annotators disagree, your model will wobble.
Representative distribution: Your training set should look like real traffic, including edge cases.
Hard negatives: Include “don’t do it” examples (e.g., refuse unsafe tool calls, reject invalid inputs).
Held-out eval set: Lock an evaluation set early and don’t “fix” it mid-training.
Leakage control: Deduplicate near-duplicates across train/eval (copy-paste kills honest metrics).

If you want a deeper dive into how to design labeling guidelines, source examples from production, and keep datasets high-quality over time, read Data Labeling & Dataset Quality: The Foundation of Reliable LLM Fine-Tuning.

Example: instruction SFT format (simple and readable)

Here’s a format that works well for many apps: a short system rule, a user instruction, and the ideal assistant answer.

{
  "messages": [
    {"role": "system", "content": "You are a support agent. Output valid JSON only."},
    {"role": "user", "content": "Extract intent + entities from: 'Cancel my Pro plan effective tomorrow'."},
    {"role": "assistant", "content": "{"intent":"cancel_subscription","entities":{"plan":"pro","effective_date":"tomorrow"}}"}
  ]
}

Pro tip: If JSON correctness matters, include plenty of counterexamples: missing fields, invalid dates, contradictory text, and require the model to return a stable error object.

5) A practical training recipe (SFT with LoRA/QLoRA)

This is a reliable flow that works for most teams:

Baseline: Evaluate the base model with your exact prompt template.
Small pilot: 200–1,000 examples, run 1–2 short experiments.
Scale data: Improve coverage and labeling, not just volume.
Lock eval: Keep an untouched eval set + a “hard set.”
Train: LoRA/QLoRA SFT, then compare to baseline.
Iterate: Add failure cases to the dataset, re-run.

Key hyperparameters that actually matter

Learning rate: Too high = the model “forgets” and gets brittle. Too low = no movement.
Epochs: More epochs is not “more better.” Watch eval quality and stop early.
Sequence length: Short contexts can train “short attention” habits. Match production length.
LoRA target modules: If quality is stuck, consider targeting more linear layers (higher compute, sometimes better adaptation).

Real example: LoRA config (plain-English defaults)

from peft import LoraConfig

peft_config = LoraConfig(
    r=16,                 # capacity: 4–16 is a common starting band
    lora_alpha=32,        # scaling
    lora_dropout=0.05,    # regularization
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

6) Evaluation: what to measure so you don’t lie to yourself

Fine-tuning evaluation should answer one question: Is this model safer, more correct, and more consistent on real traffic?

Minimum eval suite

Format correctness: JSON parses, schema validates, tool calls match spec.
Task success: Accuracy/F1 or exact-match for structured targets.
Regression checks: A stable set the base model already did well on.
Edge cases: Ambiguous inputs, noisy text, adversarial prompt injection attempts.

A simple scoring harness (example)

// Pseudocode: score JSON outputs
function score(example, modelOutput) {
  const parsed = safeJsonParse(modelOutput);
  if (!parsed.ok) return { ok: false, reason: "invalid_json" };
  if (!schemaValidate(parsed.value)) return { ok: false, reason: "schema_fail" };
  return { ok: true, reason: "pass" };
}

7) Common failure modes (and how to fix them)

Symptom

Model got worse vs baseline
Great on eval, bad in prod
Format breaks intermittently
Over-refuses or under-refuses

Likely cause → Fix

Leakage/duplicates → dedupe + rebuild eval
Eval not representative → add real traffic + hard cases
Inconsistent targets → tighten labeling rules + add validators
Bad safety examples → add explicit refusal + escalation patterns

8) Deployment: adapters, merging, and operational reality

Deployment is where “research fine-tuning” becomes “production fine-tuning.” Your choices affect latency, memory, and rollback.

Base + adapter loading: flexible, supports many variants, slightly more complexity.
Merging adapter weights: simpler deployment, often slightly faster, but you lose the “one base / many adapters” pattern.
Versioning: treat datasets, prompts, and weights as versioned artifacts. If you can’t reproduce it, it’s not shipped.

9) How-to: build a fine-tuning dataset that actually works

Most teams start by collecting “a bunch of examples.” A better approach is to design your dataset like an API contract: clear inputs, clear outputs, and explicit failure behavior.

Step-by-step (a practical 2–4 hour first pass)

Pick 1 behavior to improve (start narrow). Examples: “valid JSON every time”, “tool calls follow spec”, “support tone + escalation rules”.
Write a scoring rule before training. If you can’t score it, you can’t iterate it.
Collect 50 real inputs from logs or tickets (anonymize). Don’t cherry-pick “easy” examples.
Write ideal outputs with the same style and schema every time.
Add 20 failure cases (invalid inputs, missing fields, conflicting requirements) and define the correct refusal/error output.
Lock a 20-example eval set and never train on it.

Real example: tool-calling dataset row

Let’s say your model can call a billing tool. You want it to only call the tool when it has the required fields, otherwise ask a clarifying question.

{
  "messages": [
    {"role": "system", "content": "You are a billing assistant. Use tools when needed. Never guess missing identifiers."},
    {"role": "user", "content": "Refund my last invoice."},
    {"role": "assistant", "content": "Before I can refund an invoice, I need your invoice ID or the email on the account. Which one can you share?"}
  ]
}

Dataset design rules that prevent pain

One intent per example (at first). Mixed intents produce mixed behavior.
Make the “right format” boring: consistent keys, consistent ordering, consistent error object.
Include counterexamples: invalid IDs, missing params, disallowed actions, prompt injection attempts.
Deduplicate aggressively: near-duplicates inflate eval and teach copy-paste responses.

10) How-to: create preference data for DPO (without overcomplicating it)

DPO is easiest when you can produce two plausible answers and reliably label which is better. You don’t need perfection—you need consistency.

Start with prompts where the model already produces two distinct outputs (temperature helps for generating candidates).
Label one as preferred based on a short rubric (format correctness, groundedness, helpfulness, tone).
Keep examples small and specific; preference learning is sensitive to vague labels.

Mini rubric (copy/paste)

1) Correct format: parses + validates (no exceptions).
2) Follows policy: refuses prohibited actions, asks for missing identifiers.
3) Completeness: answers the question without rambling.
4) Tone: calm, professional, not “overconfident.”

11) Hardware + cost: quick sizing math (so you don’t guess)

You don’t need to be perfect—you need to be in the right order of magnitude. The biggest knobs are: model size, sequence length, batch size, and whether you use QLoRA.

If you hit OOM: reduce sequence length, reduce batch size, increase gradient accumulation, or switch to QLoRA.
If training is slow: shorten sequences for the pilot, reduce eval frequency, or use smaller models until your dataset is stable.

Rule-of-thumb table (starting points)

Goal	Suggested approach	Typical data size	Common gotcha
Strict JSON / schema	LoRA SFT + validators	200–2,000 examples	Missing negative cases
Tool calling reliability	LoRA SFT + “ask clarifying” examples	500–5,000 examples	Model guesses missing IDs
Tone / brand voice	SFT + light DPO	500–10,000 examples	Inconsistent style targets
Domain workflow reasoning	SFT with hard cases + eval	2,000–50,000+	Eval doesn’t match prod

12) “Picture” overview: the production fine-tuning loop

This loop is the secret: ship a version, watch failures, add examples, retrain. Fine-tuning is not one-and-done.

13) Deployment checklist (what experienced teams don’t skip)

Artifact versioning: dataset hash, prompt template version, base model ID, adapter weights version.
Rollback plan: a single config change should revert to the previous model.
Safety gates: refusal tests, tool-call allowlist, schema validation in production.
Canary routing: route 1–5% of traffic to the new model and compare metrics.
Observability: log inputs/outputs + parse failures + tool errors (with privacy controls).

FAQ (direct answers)

How much data do I need?

For behavior tuning (format/style/tool use), start with 200–1,000 high-quality examples, then grow with failures. For robust reasoning shifts, expect more data plus stronger eval design.

What’s the fastest path to a successful fine-tune?

Build a small eval set first, run a small LoRA pilot, then invest in data quality and edge cases. Iteration speed wins.

Should I use DPO?

Use DPO when you can express preference pairs (A is better than B) and you want to shape “style/choice” behavior. Start with SFT; add DPO once you have a clear preference signal.

Bottom line

Fine-tuning is a behavior upgrade, not a knowledge base.
Default to LoRA/QLoRA for iteration speed and cost.
Your biggest lever is dataset quality + evaluation, not “secret hyperparameters.”