Common errors and their solutions
Issues during model training
Loss increases dramatically or becomes NaN
{ "learning_rate": 0.00001 }RuntimeError: CUDA out of memory
{ "batch_size": 1 } // Start with 1, increase gradually{ "batch_size": 1, "gradient_accumulation_steps": 4 } // Effective batch_size = 4Loss not decreasing after many steps
Selected new model but training uses old model from config
When reusing a training config, changing the model in the dropdown doesn't automatically update the saved config file. The trainer loads the model from the config JSON, not from your dropdown selection.
💡 Tip: Always verify the model name in the config editor before starting training when reusing configs.
Problems with training data
JSONDecodeError: Expecting property name
❌ Wrong:
{'messages': [{'role': 'user', 'content': 'test'}]} // Single quotes✅ Correct:
{"messages": [{"role": "user", "content": "test"}]} // Double quotesMissing required fields or invalid structure
{
"messages": [
{"role": "user", "content": "Question here"},
{"role": "assistant", "content": "Answer here"}
]
}HTTP error codes and fixes
401Missing or invalid authentication token
Fix: Include Authorization: Bearer TOKEN header
404Resource doesn't exist or wrong ID
Fix: Verify job ID, config ID, or model ID is correct
409Resource already exists (e.g., duplicate model name)
Fix: Use a different name or delete existing resource first
500Server-side issue or bug
Fix: Check server logs, verify database connection, retry request
RunPod Serverless deployment troubleshooting
Error: "RunPod API key not configured"
Name: runpod
Value: your-runpod-api-key-hereTakes longer than 5 minutes to become active
Reached 100% budget utilization
Your deployment reached the budget limit you set. This is a safety feature to prevent unexpected costs.
Error 503, 504, or connection timeout
{ "input": { "prompt": "Your text", "max_tokens": 512 } }Deployment status shows "failed" or "error"
Slow training and optimization tips
Frequently asked questions
Depends on model size and dataset. For Llama-3.2-1B with 500 examples: ~30-60 minutes on a single GPU. Larger models (7B+) can take several hours or days.
Minimum 50 examples, but 200-1000 is ideal. Quality matters more than quantity - 100 perfect examples beat 1000 mediocre ones.
Yes! Use POST /api/training/pause/:id and POST /api/training/resume/:id. Training will resume from the last checkpoint.
curl -X POST https://finetunelab.ai/api/training/pause/job-789Minimum 8GB VRAM (RTX 3070, A10) for small models (1B params). 16GB+ recommended for 7B models. 40GB+ for 13B+ models. Can use LoRA to reduce requirements.
Watch the eval loss. If training loss decreases but eval loss increases or plateaus, you're overfitting. Solutions: reduce epochs, add more data, increase regularization.
Can't find your issue here? Check the full API reference or review the guides for more detailed explanations.