Data2025-12-12

Training Data Pipelines & ETL: Collect, Clean, Label, and Ship

Design reliable pipelines for LLM training data: sourcing, PII scrubbing, deduplication, normalization, labeling, quality checks, and dataset versioning.

Great models come from great data. Your ETL should be boring, reliable, and measurable: collect → clean → normalize → label → validate → version → ship.

Quick answer

  • Sources: product docs, support data, curated web (licensed).
  • PII: detect + redact/hash; audit and sample-check.
  • Dedupe: minhash/simhash or embeddings to drop near-duplicates.
  • Versioning: immutable manifests + semantic versions.

1) Pipeline blueprint

  • Ingest: connectors for docs, tickets, chats, repos.
  • Normalize: unify encoding, strip boilerplate, fix whitespace.
  • PII scrub: regex + ML; redact/hash; flag uncertain cases.
  • Deduplicate: near-duplicate removal; keep canonical.
  • Label: instructions, preference pairs, extraction fields.
  • Validate: schema checks, coverage, toxicity, leakage.
  • Version + ship: write manifests; store snapshots.

2) How-to: PII scrubbing policy

{
  "fields": {
    "email": { "action": "hash" },
    "phone": { "action": "redact" },
    "ssn": { "action": "drop" }
  },
  "free_text": {
    "rules": ["EMAIL", "PHONE", "IP", "CREDIT_CARD"],
    "action": "redact"
  }
}

3) How-to: near-duplicate removal (minhash)

function jaccard(a: Set, b: Set) {
  const inter = new Set([...a].filter(x => b.has(x))).size;
  const union = new Set([...a, ...b]).size;
  return inter / union;
}
// Tokenize by shingles and drop pairs with Jaccard > 0.9
      

4) Labeling patterns

  • Instructions: prompt → answer pairs with strict schemas.
  • Preference: A/B pairs for DPO (choose better answer).
  • Extraction: field-level labels with types and nullability.

5) Validation checks

  • Schema adherence: % valid examples.
  • Coverage: topics, languages, edge cases.
  • Consistency: inter-annotator agreement (Cohen’s kappa).
  • Toxicity/PII: rates and thresholds.
  • Leakage risk: overlap with eval/test corpora.

6) Dataset manifests (versioning)

{
  "version": "1.3.0",
  "sources": ["docs", "tickets", "curated_web"],
  "filters": { "language": ["en","es"], "date_range": "2024-01..2025-10" },
  "hash": "sha256:...",
  "count": 125342
}

7) Ops checklist

  • Idempotent jobs: resume/retry without duplicate outputs.
  • Audit logs: record scrubs, drops, and label decisions.
  • Cost/latency: batch heavy steps; cache embeddings/labels.
  • Observability: dashboards for throughput, error classes, quality metrics.

8) Try it: minimal ETL runner (idempotent + manifest)

type Manifest = { version: string; sources: string[]; filters: Record; hash: string; count: number };
type JobState = { id: string; step: 'ingest'|'normalize'|'pii'|'dedupe'|'label'|'validate'|'done'; outputs: Record };

const jobStore: Record = {};

function resumeJob(id: string): JobState {
  if (!jobStore[id]) jobStore[id] = { id, step: 'ingest', outputs: {} };
  return jobStore[id];
}

async function runJob(id: string) {
  const job = resumeJob(id);
  if (job.step === 'ingest') {
    job.outputs.ingested = ['doc1','doc2'];
    job.step = 'normalize';
  }
  if (job.step === 'normalize') {
    job.outputs.normalized = job.outputs.ingested;
    job.step = 'pii';
  }
  if (job.step === 'pii') {
    job.outputs.scrubbed = job.outputs.normalized;
    job.step = 'dedupe';
  }
  if (job.step === 'dedupe') {
    job.outputs.unique = job.outputs.scrubbed;
    job.step = 'label';
  }
  if (job.step === 'label') {
    job.outputs.labeled = job.outputs.unique.map((id: string) => ({ id, label: 'ok' }));
    job.step = 'validate';
  }
  if (job.step === 'validate') {
    job.outputs.valid = true;
    job.step = 'done';
  }
  // Write manifest (pseudo)
  const manifest: Manifest = {
    version: '1.0.0',
    sources: ['docs','tickets'],
    filters: { language: ['en'] },
    hash: 'sha256:...',
    count: job.outputs.labeled.length
  };
  return { job, manifest };
}
      

FAQ (direct answers)

What’s a “good” dataset size?

Enough to cover your domain with balance and quality. More is not better if noisy or duplicative. Start small, iterate.

Should we synthesize data?

Use synthesis to fill narrow gaps; validate rigorously and avoid copying evals into training.

Further reading

Related Topics

ETLTraining DataLabelingPIIDeduplicationVersioningQuality

Ready to put this into practice?

Start building your AI pipeline with our visual DAG builder today.