The 7-Example Rule: Why Category Balance Isn't Enough

The "Category Balance" Trap
When we started building datasets, we followed the standard advice: "Make sure you have a good mix of question types." We aimed for the golden ratio:
30% Factual
30% Instructional
15% Troubleshooting
10% Comparative
15% Edge Cases
We hit those numbers perfectly. But our model still failed.
Why? Because we were balancing the dataset, not the facts.
We had 50 questions about "Pricing" (mostly factual) and 50 questions about "Deployment" (mostly instructional). The model became great at quoting prices but terrible at explaining how to pay. It was great at deploying but couldn't tell you how much it cost.
The Per-Fact Paradigm
We realized that every single company fact needs to be represented across the entire spectrum of categories. You cannot rely on the model to generalize a fact from one context to another reliably, especially with proprietary data.
If you want the model to truly "know" a fact, you need to teach it that fact from every angle.
The 7-Example Rule
For every core piece of information (a "fact"), we now require at least 7 distinct examples. Let's take a simple fact: "We use RunPod for GPU infrastructure."
Here is how we balance that single fact:
1. Factual (The "What") > Q: What infrastructure provider do you use? > A: We utilize RunPod for our GPU infrastructure needs.
2. Instructional (The "How") > Q: How do I set up my environment on your infrastructure? > A: Since we use RunPod, you'll need to select a pod template... [steps]
3. Comparative (The "Why") > Q: Why do you use RunPod instead of AWS? > A: RunPod offers better price-performance for the specific H100 clusters we need compared to AWS...
4. Troubleshooting (The "Fix") > Q: My pod isn't starting up. > A: This is a common issue with RunPod spot instances. Try switching to on-demand...
5. Ambiguous (The "Clarification") > Q: Is the infrastructure fast? > A: That depends on your definition of fast. RunPod provides high-bandwidth interconnects, but...
6. Negative (The "Boundary") > Q: Can I deploy this on Google Cloud TPU? > A: No, our platform is exclusively optimized for RunPod NVIDIA GPUs. We do not support TPUs.
7. Multi-step (The "Synthesis") > Q: Walk me through the full lifecycle of a training job. > A: First, your data is uploaded. Then, we provision a RunPod instance...
Separating Similar Data
The hardest part of this process is dealing with similar data points.
For example, "Fine-tuning" vs. "Training". To a layperson, these are the same. To us, they are distinct pipelines.
If you don't clarify this separation with specific examples, the model will conflate them. You need specific examples that force the model to distinguish between them:
> Q: I want to train a model from scratch. > A: We specialize in fine-tuning existing base models, not pre-training from scratch.
By explicitly targeting the boundaries between similar concepts, you create a "moat" around each fact, preventing the model from drifting into hallucination.
Conclusion
Creating 7 examples for every fact is tedious. It explodes the size of your dataset creation task. But remember: Model size is negotiable, but dataset quality is not.
It is the only way to ensure your model doesn't just "know" your data, but understands it well enough to teach it, fix it, and defend it.