🔧 Troubleshooting

Common errors and their solutions

🚨 Training Errors

Issues during model training

Training loss is NaN or exploding

Loss increases dramatically or becomes NaN

Causes

  • • Learning rate too high
  • • Bad data (corrupted examples, extreme values)
  • • Numerical instability in model

Solutions

  • Reduce learning rate - Try 10x smaller (e.g., 1e-5 instead of 1e-4){ "learning_rate": 0.00001 }
  • Enable gradient clipping - Add max_grad_norm parameter
  • Check dataset - Run validation script to find corrupted examples
  • Use mixed precision - Enable fp16 or bf16 training

CUDA out of memory

RuntimeError: CUDA out of memory

Solutions

  • Reduce batch size - Most common fix{ "batch_size": 1 } // Start with 1, increase gradually
  • Enable gradient accumulation - Simulate larger batches{ "batch_size": 1, "gradient_accumulation_steps": 4 } // Effective batch_size = 4
  • Reduce sequence length - Shorter sequences use less memory
  • Use LoRA - Fine-tune only adapter layers instead of full model

Training stuck / not progressing

Loss not decreasing after many steps

Causes & Solutions

  • Learning rate too low - Increase by 2-5x
  • Not enough warmup - Add warmup steps (100-500)
  • Dataset too small - Need at least 50-100 quality examples
  • Task too complex - Base model may not have required capabilities

Wrong model loaded when reusing config

Selected new model but training uses old model from config

What's Happening

When reusing a training config, changing the model in the dropdown doesn't automatically update the saved config file. The trainer loads the model from the config JSON, not from your dropdown selection.

Workaround

  1. Select your desired model from the HuggingFace dropdown
  2. Open the config editor (click "Edit Config" or "Advanced Settings")
  3. Save the config - This writes the new model name to the JSON file
  4. Now start training - it will use the correct model

💡 Tip: Always verify the model name in the config editor before starting training when reusing configs.

� Dataset Issues

Problems with training data

Invalid JSON in dataset

JSONDecodeError: Expecting property name

Common Mistakes

❌ Wrong:

{'messages': [{'role': 'user', 'content': 'test'}]} // Single quotes

✅ Correct:

{"messages": [{"role": "user", "content": "test"}]} // Double quotes

Fix It

  • ✓ Use double quotes, not single quotes
  • ✓ Remove trailing commas
  • ✓ Escape special characters in strings
  • ✓ Run validation script before uploading

Dataset validation failed

Missing required fields or invalid structure

Required Structure

{
  "messages": [
    {"role": "user", "content": "Question here"},
    {"role": "assistant", "content": "Answer here"}
  ]
}

Checklist

  • □ Each line has "messages" array
  • □ Messages have "role" and "content"
  • □ No empty content strings
  • □ Valid roles: user, assistant, system

🌐 API Errors

HTTP error codes and fixes

401

Unauthorized

Missing or invalid authentication token

Fix: Include Authorization: Bearer TOKEN header

404

Not Found

Resource doesn't exist or wrong ID

Fix: Verify job ID, config ID, or model ID is correct

409

Conflict

Resource already exists (e.g., duplicate model name)

Fix: Use a different name or delete existing resource first

500

Internal Server Error

Server-side issue or bug

Fix: Check server logs, verify database connection, retry request

🚀 Inference Deployment Issues

RunPod Serverless deployment troubleshooting

RunPod API key not found or invalid

Error: "RunPod API key not configured"

Solutions

  • Add RunPod API key - Go to Settings → Secrets → Add SecretName: runpod
    Value: your-runpod-api-key-here
  • Get API key - Visit RunPod Console → API Keys
  • Check key name - Must be exactly "runpod" (lowercase)
  • Verify permissions - Key must have serverless endpoint permissions

Deployment stuck in "deploying" status

Takes longer than 5 minutes to become active

Common Causes

  • • RunPod provisioning GPU resources (can take 2-5 minutes)
  • • Large model download and initialization
  • • RunPod service issues or capacity constraints

Solutions

  • Wait 5-10 minutes - First deployment can be slow
  • Check RunPod status - Visit status.runpod.io
  • Try different GPU type - Some GPU types have better availability
  • Check deployment status - Use GET /api/inference/deployments/:id/status
  • If stuck >15min - Stop deployment and retry with fresh deployment

Deployment auto-stopped: Budget exceeded

Reached 100% budget utilization

What Happened

Your deployment reached the budget limit you set. This is a safety feature to prevent unexpected costs.

Solutions

  • Review costs - Check /inference page for detailed breakdown
  • Increase budget - Redeploy with higher budget_limit
  • Optimize costs - Use cheaper GPU type (A4000 instead of H100)
  • Scale to zero - Set min_workers=0 to avoid idle costs
  • Monitor usage - Check budget alerts at 50% and 80%

Cannot connect to inference endpoint

Error 503, 504, or connection timeout

Solutions

  • Check deployment status - Must be "active" not "scaling" or "deploying"
  • Wait for cold start - First request after idle can take 30-60 seconds
  • Verify endpoint URL - Check /inference page for correct URL
  • Check request format - Must include "input" field with "prompt"{ "input": { "prompt": "Your text", "max_tokens": 512 } }
  • Test with cURL - Verify endpoint works outside your app

Model failed to load

Deployment status shows "failed" or "error"

Common Causes

  • • Training job checkpoint corrupted or incomplete
  • • Model too large for selected GPU type
  • • Model storage URL inaccessible
  • • Incompatible model format

Solutions

  • Verify training completed - Check training job status is "completed"
  • Use larger GPU - Try A6000 or A100 for larger models
  • Check checkpoint - Download and test checkpoint locally first
  • Use quantization - Deploy 4-bit or 8-bit quantized version
  • Check error message - View deployment.error_message in status API

💡 Cost Optimization Tips

  • • Start with small budget ($1-5) for testing before scaling up
  • • Use A4000 ($0.0004/req) for development, reserve H100 for production
  • • Enable auto_stop_on_budget to prevent overruns
  • • Set min_workers=0 to scale to zero when idle (no idle costs)
  • • Monitor real-time spend on /inference page
  • • Set up budget alerts at 50% and 80% utilization
  • • Stop deployments when not in use (no restart cost)

⚡ Performance Problems

Slow training and optimization tips

Training is very slow

  • Increase batch size - Better GPU utilization (if memory allows)
  • Use mixed precision - Enable fp16/bf16 for 2-3x speedup
  • Reduce sequence length - Shorter sequences = faster training
  • Check GPU usage - Should be >80% during training
  • Use faster data loading - Increase num_workers for data loader

High memory usage

  • Use LoRA - Train adapters instead of full model (90% memory reduction)
  • Enable gradient checkpointing - Trade compute for memory
  • Reduce batch size - Most direct way to reduce memory
  • Clear cache regularly - torch.cuda.empty_cache()

❓ FAQ

Frequently asked questions

How long should training take?

Depends on model size and dataset. For Llama-3.2-1B with 500 examples: ~30-60 minutes on a single GPU. Larger models (7B+) can take several hours or days.

How many training examples do I need?

Minimum 50 examples, but 200-1000 is ideal. Quality matters more than quantity - 100 perfect examples beat 1000 mediocre ones.

Can I pause and resume training?

Yes! Use POST /api/training/pause/:id and POST /api/training/resume/:id. Training will resume from the last checkpoint.

curl -X POST https://finetunelab.ai/api/training/pause/job-789

What GPU do I need?

Minimum 8GB VRAM (RTX 3070, A10) for small models (1B params). 16GB+ recommended for 7B models. 40GB+ for 13B+ models. Can use LoRA to reduce requirements.

How do I know if my model is overfitting?

Watch the eval loss. If training loss decreases but eval loss increases or plateaus, you're overfitting. Solutions: reduce epochs, add more data, increase regularization.

💡 Still Stuck?

Can't find your issue here? Check the full API reference or review the guides for more detailed explanations.