Case Study
15 min read

The Dataset Quality Myth: What 77 Training Runs Taught Us

We tried every shortcut. Reasoning models, automated pipelines, expensive APIs. All failed. Here's what actually works—and why nobody wants to hear it.

FineTune Lab Team
2025-12-08

The Uncomfortable Truth Nobody Wants to Hear

At FineTune Lab, we've experimented with several different ways of creating quality datasets. Most, if not all, have failed.

Creating quality data that your model can actually learn isn't about volume or the smartest reasoning model. It's about the nuances that only you—or the people within your company—are capable of understanding.

Nobody wants to hear this. They just don't. But the reality is that a dataset's quality is going to dictate the quality of the model you decide to fine-tune.


The Pipeline Fantasy

We designed pipelines around reasoning models. The idea was simple: send in company data and Q&As, let the LLM organize everything, minimal human intervention required.

We started with DeepSeek. Relatively cheap, good reasoning skills. Should do the job, right?

It did a pretty good job. But it didn't work out the way we expected.

What Reasoning Models Excel At:
  • Creating verbose, well-structured responses
  • Formatting data cleanly
  • Following template instructions

    What They Can't Do:
  • Understand the nuances your company data requires
  • Maintain context across hundreds of Q&As
  • Know what your platform actually looks like


    We Didn't Learn Our Lesson

    Sad to say, we didn't learn right away. We kept testing.

    - DeepSeek ❌

  • GPT-5 Mini ❌
  • GPT-5 ❌
  • More expensive models ❌
  • Less expensive models with better reasoning ❌

    They all do a fantastic job of giving you incredibly structured, precise, accurate data based on what you provide. But the issue is context.

    They can't keep it all together at the same time.


    Our First Dataset Was Trash

    And man, we didn't really know it.

    We used it. The model learned a few things. But here's what it couldn't do:

    It couldn't say "no."

    The model wasn't able to tell users what we don't do as a company. Instead, it made up information. Close enough to be believable. Close enough to sound like something a platform like FineTune Lab would offer.

    The UI Navigation Nightmare

    This got especially bad with UI components. The model would reference a "dashboard" we don't have. It would tell users to click buttons that don't exist.

    You can't ask DeepSeek or GPT-5 Pro to understand how to navigate your specific website.


    What "Properly" Actually Means

    Properly means sitting down in front of each and every one of those Q&As, going over them, and making sure they say exactly what you want them to say.

    That's it. That's the secret.

    There's no shortcut. There's no magic prompt. There's no reasoning model smart enough to replace you actually knowing your product.


    Perfect Training ≠ Working Model

    We've had training runs that look perfect on paper. Loss curves trending down beautifully. Eval metrics looking great.

    Then we test in the actual web portal. Garbage.

    Testing is the only real way to say "hey, this works." You may have a flawless training session, but if the model didn't actually learn because:

    - Data was scrambled or incorrect

  • Contradictory information
  • Not enough ambiguous examples
  • No negatives telling the model what NOT to say
  • No adversarials testing edge cases

    ...then it doesn't matter how pretty your loss curve looked.


    Hundreds of Training Runs. 90% Failed.

    Let's be real. Not every run teaches you something. Most of them—probably 90%—fail miserably.

    Setting up training isn't difficult, especially on our platform. What's difficult is running hundreds of experiments trying to find what works.

    But here's what we noticed:

    The One Signal That Actually Matters

    The better the data, the better the training curves.

    When your data is good:

  • Loss curves are smooth, not erratic
  • The gap between train loss and eval loss stays tight
  • Curves trend down together, consistently

    When your data is garbage:

  • Curves are all over the place
  • Train loss drops but eval loss stays high (overfitting on noise)
  • Or both just plateau and go nowhere

    That gap between train loss and eval loss? That's your data quality indicator. Tight gap = model is learning generalizable patterns. Wide gap = model is memorizing garbage.

    We didn't learn this from a paper. We learned it from staring at hundreds of failed training runs in the analytics page, trying to figure out what the hell went wrong.


    What We Actually Learned

    After all the iteration, the money spent, the research:

    We went from shit qualityrelatively good qualityfinal iteration where the model learns 99% of what we're teaching it.

    The GraphRAG Realization

    We don't need GraphRAG for knowledge that doesn't change.

    GraphRAG should be used for data that changes periodically. For stable knowledge about how your platform works? That belongs in the model itself.


    Who We Are

    At FineTune Lab, we're not engineers. We're not data scientists. We're not the most knowledgeable about how AI works.

    But we're passionate. About the future, about technology, about helping people, about offering tools that major corporations either can't offer because it's not profitable, or refuse to offer because they're too good.

    We want to get AI fine-tuning in the hands of the 16-year-old kid studying quantum mechanics, the Tony Starks who don't have money but have know-how.

    We started this company for people like you.


    The Bottom Line

    A quality dataset has nothing to do with the largest, most powerful reasoning model.

    It has everything to do with the nuances that only you—or the people within your company—are capable of answering.

    Quality and balance over quantity. Always.

  • Dataset Curation
    Lessons Learned
    Quality vs Quantity
    Real Talk

    Want to try these techniques?

    Start fine-tuning your own model on FineTune Lab. All experiments in this article were done on our platform.