📚 Guides

Step-by-step tutorials for common workflows

1. Complete Training Workflow

End-to-end walkthrough from dataset to deployed model

🎯 What you'll learn:

• Upload and prepare training data
• Create and validate training configuration
• Start and monitor training jobs
• Download and deploy trained models

Step 1: Prepare Your Dataset

First, create a JSONL file with your training examples. Each line should be a valid JSON object.

Example: training_data.jsonl

{"messages": [{"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is..."}]}
{"messages": [{"role": "user", "content": "Explain neural networks"}, {"role": "assistant", "content": "Neural networks are..."}]}

Step 2: Upload Dataset

Upload via API:

curl -X POST https://finetunelab.ai/api/training/datasets \
  -F "file=@training_data.jsonl" \
  -F "name=My Training Dataset"

Step 3: Create Training Config

POST /api/training

curl -X POST https://finetunelab.ai/api/training \
  -d '{
    "name": "my-finetuned-model",
    "base_model": "meta-llama/Llama-3.2-1B",
    "dataset_id": "dataset-123",
    "epochs": 3
  }'

Step 4: Start Training

POST /api/training/execute

curl -X POST https://finetunelab.ai/api/training/execute \
  -d '{"id": "config-456"}'

Step 5: Monitor Progress

GET /api/training/metrics/:id

curl https://finetunelab.ai/api/training/metrics/job-789

Tell your AI assistant:

"Walk me through fine-tuning a model on my custom dataset"

2. Dataset Creation Best Practices

Professional guide for creating high-quality training datasets

⭐ The Golden Rule

Quality Over Quantity: A smaller, high-quality dataset is often more effective than a large, noisy one.

Research shows: For every 1% increase in training data error, you may see a ~4% increase in model error (quadratic impact).

Dataset Size Guidelines

Use Case	Minimum	Recommended	Optimal
Simple tasks (formatting, tone)	50-100	200-500	500-1,000
Domain-specific assistant	500	1,000-2,000	5,000-10,000
Technical support bot	1,000	2,000-5,000	10,000-20,000
Complex reasoning	5,000	10,000-50,000	100,000+

The 7-Category Framework

A well-balanced dataset should include these categories in the following ratios:

1. Positive Examples

40-50%

Standard Q&As about features, capabilities, and workflows. What the system DOES do.

2. Adversarial/Negative Examples

15-20%

Explicitly state what the system does NOT support. Critical to prevent hallucinations.

Example: "Q: Can I train on AWS? A: No. Training only runs on RunPod infrastructure."

3. Boundary Examples

10-15%

Define edges of functionality - when it works, when it doesn't. Prevents overgeneralization.

4. Disambiguation Examples

5-10%

Clarify concepts that could be confused. Distinguish similar concepts.

5. Procedural/Navigation Examples

10-15%

Granular UI navigation and step-by-step workflows. Enable self-service.

6. Insider Knowledge Examples

5-10%

Power user details, pricing nuances, performance tips. Expert insights and gotchas.

7. Failure Mode Examples

5-10%

Common errors, troubleshooting, and solutions. Prevent user frustration.

Enterprise Dataset Creation Workflow

Phase 1: Planning (Week 1)

• Define objectives and success criteria
• Identify data sources (docs, support tickets, expert knowledge)
• Estimate target dataset size

Phase 2: Collection (Week 2-3)

• Gather raw data from multiple sources
• Initial categorization into 7 categories
• Flag gaps where categories are underrepresented

Phase 3: Curation (Week 3-4)

• Remove duplicates and low-quality examples
• Validate labels and schema integrity
• Ensure diversity and edge case coverage

Phase 4: Assembly (Week 4)

• Balance categories to match target ratios
• Create 80/20 train/validation split (stratified)
• Never leak validation examples into training

Phase 5: Validation (Week 5)

• Calculate quality metrics (aim for <1% error rate)
• Pilot test on small subset (100-200 examples)
• Iterate based on results

⚠️ Common Pitfalls to Avoid

❌ Focusing Only on Positives (60%+ positive examples)

Model fills gaps with prior knowledge, creates plausible but wrong answers. Maintain 15-20% adversarial examples.

❌ Ignoring Data Quality

1% error in data → ~4% error in model (quadratic). Invest in cleaning upfront.

❌ Not Testing Edge Cases

Dedicate 15-20% of dataset to edge cases and boundary examples. Model fails on unusual inputs otherwise.

❌ Insufficient Adversarial Examples

For every major feature, include "what we DON'T do" examples to prevent confident wrong answers.

🎯 Key Takeaways

• Quality > Quantity: 500 perfect examples beat 5,000 messy ones
• Balance Categories: 40-50% positive, 15-20% adversarial minimum
• Adversarial Examples Critical: Prevent gap-filling with prior knowledge
• Start Small, Iterate: 100-500 examples → pilot → scale based on results
• Clean Data = 4x Impact: 1% error in data → ~4% error in model
• Test Real Responses: Don't trust metrics alone

Tell your AI assistant:

"Help me create a balanced training dataset using the 7-category framework"

📚 Complete Reference Guide

This section is based on comprehensive research from 2025 enterprise best practices. For the full detailed guide with sources and advanced techniques, see:

Professional Dataset Creation Guide →

3. Dataset Preparation

Format, validate, and optimize your training data

⚠️ Dataset Quality Matters

Poor quality data leads to poor models. Spend time cleaning and formatting your dataset properly.

Required Format

Training data must be in JSONL format (JSON Lines) with each line containing a messages array:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "User question here"},
    {"role": "assistant", "content": "Model response here"}
  ]
}

Best Practices

✓ Size Guidelines

Minimum 50 examples, optimal 500-5000 examples

✓ Quality Over Quantity

Better to have 100 high-quality examples than 1000 poor ones

✓ Diverse Examples

Cover different scenarios and edge cases

✓ Consistent Format

Use the same structure and tone across all examples

Validation Checklist

□ Valid JSON on every line
□ Each line has a messages array
□ Messages have role and content fields
□ No empty content strings
□ Reasonable token length (<4000 tokens per example)
□ UTF-8 encoding

4. Hyperparameter Tuning

Optimize training parameters for best results

🎯 Key Parameters:

Learning rate, batch size, and epochs are the most impactful parameters to tune.

Learning Rate

Controls how fast the model learns. Too high = unstable, too low = slow convergence.

1e-5 to 1e-4Recommended for most fine-tuning tasks

1e-4 to 1e-3For smaller models or large datasets

> 1e-3Usually too high, may cause training instability

Batch Size

Number of examples processed before updating model weights.

1-2Small GPU memory (<8GB VRAM)

4-8Medium GPU (8-16GB VRAM)

16-32Large GPU (>16GB VRAM)

Epochs

Number of times the model sees the entire dataset.

1-3Large datasets (>1000 examples)

3-5Medium datasets (100-1000 examples)

5-10Small datasets (<100 examples)

Example Configuration

Optimized Training Config

{
  "learning_rate": 0.0001,
  "batch_size": 4,
  "epochs": 3,
  "warmup_steps": 100,
  "eval_steps": 50,
  "save_steps": 100
}

5. Deploy to Production Inference

Deploy your trained model to cloud inference with RunPod Serverless

🚀 Deployment Options:

• RunPod Serverless - Auto-scaling cloud inference (A4000 to H100 GPUs)
• Budget controls with real-time cost tracking
• Auto-stop protection to prevent overspending
• Production-ready API endpoints

Option A: RunPod Serverless (Recommended)

Deploy to cloud with auto-scaling, pay-per-request pricing, and built-in budget controls.

Step 1: Configure RunPod API Key

Go to Settings → Secrets and add your RunPod API key with name "runpod".

Step 2: Deploy via API

POST /api/inference/deploy

curl -X POST https://finetunelab.ai/api/inference/deploy \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -d '{
    "provider": "runpod-serverless",
    "deployment_name": "my-model-prod",
    "training_job_id": "your-job-id",
    "gpu_type": "NVIDIA RTX A4000",
    "budget_limit": 10.0
  }'

Step 3: Monitor Deployment

Navigate to /inference page to view deployment status, cost tracking, and manage your endpoint.

GET /api/inference/deployments/:id/status

curl https://finetunelab.ai/api/inference/deployments/dep-xyz789/status \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN"

Step 4: Make Inference Requests

Use the endpoint URL from deployment response:

curl -X POST https://api.runpod.ai/v2/endpoint-id/runsync \
  -d '{
    "input": {"prompt": "Hello", "max_tokens": 100}
  }'

✓ RunPod Benefits:

• Auto-scaling: Scale to zero when idle, scale up automatically
• Budget controls: Set limits, get alerts, auto-stop on budget
• Cost tracking: Real-time spend monitoring per request
• GPU variety: Choose from A4000 ($0.0004/req) to H100 ($0.0035/req)
• No infrastructure: Fully managed, no server maintenance

💰 Budget & Cost Management

Budget Limits: Set minimum budget of $1.00, recommended $10-50 for production

Cost Tracking: View real-time spend, request count, and cost per request on /inference page

Budget Alerts: Automatic alerts at 50%, 80%, and 100% budget utilization

Auto-Stop: Deployment automatically stops when budget limit is reached (optional)

GPU Pricing: A4000 ($0.0004/req) | A5000 ($0.0006/req) | A6000 ($0.0008/req) | H100 ($0.0035/req)

Production Deployment Checklist

□ Training completed with best checkpoint available
□ RunPod API key configured in Settings → Secrets
□ Budget limit set appropriate for expected traffic
□ Auto-stop enabled to prevent overspending
□ Test deployment with small budget first ($1-5)
□ Monitor costs on /inference page regularly
□ Set up monitoring for 50% and 80% budget alerts
□ Have plan to scale or adjust budget as needed

6. Performance Monitoring

Track training progress and analyze results

📊 Key Metrics to Track:

• Training loss (should decrease over time)
• Evaluation loss (monitors overfitting)
• Learning rate (changes with warmup/decay)
• GPU utilization (efficiency indicator)

Real-Time Metrics

GET /api/training/metrics/:id

curl https://finetunelab.ai/api/training/metrics/job-789

{
  "job_id": "job-789",
  "current_step": 150,
  "total_steps": 300,
  "train_loss": 0.234,
  "eval_loss": 0.298,
  "learning_rate": 0.00009,
  "epoch": 1.5,
  "gpu_memory_used": "12.5GB"
}

Training Logs

GET /api/training/logs/:id

curl https://finetunelab.ai/api/training/logs/job-789

Job Analytics

GET /api/training/analytics/:id

curl https://finetunelab.ai/api/training/analytics/job-789

Understanding Loss Metrics

✓ Good Training

Training loss decreases steadily, eval loss follows similar pattern

⚠️ Overfitting

Training loss decreases but eval loss increases or plateaus

✗ Training Issues

Loss increases, stays flat, or shows erratic behavior

Diagnosing Model vs Dataset Issues

When training isn't going well, it's crucial to identify whether the problem is your model choice or your dataset. Here's how to tell:

🧠 Signs Your Model is the Problem

• Loss plateaus immediately - Model may be too small to learn the task complexity
• Training loss stays high (>2.0 for language tasks) - Model lacks capacity for your data
• Consistent poor performance across all examples - Wrong model architecture for your task
• Can't memorize even a single training example - Model fundamentally incompatible

💡 Solutions:

• Upgrade to a larger model (1B → 3B → 7B → 13B)
• Try a different model family (Llama, Mistral, Qwen, Phi)
• Use a model pre-trained on similar domain data
• Consider instruction-tuned models for instruction-following tasks

📊 Signs Your Dataset is the Problem

• Erratic loss pattern - Inconsistent or noisy data quality
• Model overfits quickly - Dataset too small or not diverse enough
• High eval loss vs train loss gap - Train/eval split mismatch or data leakage
• Works on some examples, fails on others - Insufficient coverage of edge cases

💡 Solutions:

• Add more training examples (aim for 1,000+ high-quality samples)
• Clean and validate data format consistency
• Balance dataset across different categories/tasks
• Add data augmentation or synthetic examples
• Review and fix labeling errors or inconsistencies

🔬 Quick Diagnostic Test

Overfit test: Train on just 10-20 examples. If loss doesn't reach near-zero, it's likely a model capacity issue.
Baseline comparison: Compare your loss to similar tasks. Language modeling baseline: ~2.5-3.0 initial, ~0.5-1.5 final.
Model size ladder: If stuck, try the next size up. Improvement = model was bottleneck.
Data inspection: Manually review 50 random samples. If you find obvious issues, fix dataset first.

Tell your AI assistant:

"Show me the training metrics and explain if my model is learning properly"

🎓 Ready to Train!

You now have comprehensive guides covering the complete fine-tuning workflow. Start with a small dataset and experiment!

View Full API Reference →See Code Examples →