Where should we source training data?

Start with your own product docs, tickets, chats, and curated public corpora that match your domain and license constraints.

How do we handle PII?

Automate detection (regex + ML), redact or hash, add policies per field, and keep audit logs. Validate with sampling.

Is deduplication necessary?

Yes—near-duplicate removal improves generalization and reduces overfitting. Use minhash/SimHash or embedding-based dedupe.

How do we version datasets?

Treat datasets like code: immutable snapshots with manifest files (sources, filters, hashes), semantic versioning, and changelogs.

How do we measure data quality?

Define data checks: schema adherence, coverage by topic, label consistency, toxicity/PII rates, and leakage risk.

Data2025-12-12

Training Data Pipelines & ETL: Collect, Clean, Label, and Ship

Design reliable pipelines for LLM training data: sourcing, PII scrubbing, deduplication, normalization, labeling, quality checks, and dataset versioning.

Great models come from great data. Your ETL should be boring, reliable, and measurable: collect → clean → normalize → label → validate → version → ship.

Quick answer

Sources: product docs, support data, curated web (licensed).
PII: detect + redact/hash; audit and sample-check.
Dedupe: minhash/simhash or embeddings to drop near-duplicates.
Versioning: immutable manifests + semantic versions.

1) Pipeline blueprint

Ingest: connectors for docs, tickets, chats, repos.
Normalize: unify encoding, strip boilerplate, fix whitespace.
PII scrub: regex + ML; redact/hash; flag uncertain cases.
Deduplicate: near-duplicate removal; keep canonical.
Label: instructions, preference pairs, extraction fields.
Validate: schema checks, coverage, toxicity, leakage.
Version + ship: write manifests; store snapshots.

2) How-to: PII scrubbing policy

{
  "fields": {
    "email": { "action": "hash" },
    "phone": { "action": "redact" },
    "ssn": { "action": "drop" }
  },
  "free_text": {
    "rules": ["EMAIL", "PHONE", "IP", "CREDIT_CARD"],
    "action": "redact"
  }
}

3) How-to: near-duplicate removal (minhash)

function jaccard(a: Set, b: Set) {
  const inter = new Set([...a].filter(x => b.has(x))).size;
  const union = new Set([...a, ...b]).size;
  return inter / union;
}
// Tokenize by shingles and drop pairs with Jaccard > 0.9

4) Labeling patterns

Instructions: prompt → answer pairs with strict schemas.
Preference: A/B pairs for DPO (choose better answer).
Extraction: field-level labels with types and nullability.

5) Validation checks

Schema adherence: % valid examples.
Coverage: topics, languages, edge cases.
Consistency: inter-annotator agreement (Cohen’s kappa).
Toxicity/PII: rates and thresholds.
Leakage risk: overlap with eval/test corpora.

6) Dataset manifests (versioning)

{
  "version": "1.3.0",
  "sources": ["docs", "tickets", "curated_web"],
  "filters": { "language": ["en","es"], "date_range": "2024-01..2025-10" },
  "hash": "sha256:...",
  "count": 125342
}

7) Ops checklist

Idempotent jobs: resume/retry without duplicate outputs.
Audit logs: record scrubs, drops, and label decisions.
Cost/latency: batch heavy steps; cache embeddings/labels.
Observability: dashboards for throughput, error classes, quality metrics.

8) Try it: minimal ETL runner (idempotent + manifest)

type Manifest = { version: string; sources: string[]; filters: Record; hash: string; count: number };
type JobState = { id: string; step: 'ingest'|'normalize'|'pii'|'dedupe'|'label'|'validate'|'done'; outputs: Record };

const jobStore: Record = {};

function resumeJob(id: string): JobState {
  if (!jobStore[id]) jobStore[id] = { id, step: 'ingest', outputs: {} };
  return jobStore[id];
}

async function runJob(id: string) {
  const job = resumeJob(id);
  if (job.step === 'ingest') {
    job.outputs.ingested = ['doc1','doc2'];
    job.step = 'normalize';
  }
  if (job.step === 'normalize') {
    job.outputs.normalized = job.outputs.ingested;
    job.step = 'pii';
  }
  if (job.step === 'pii') {
    job.outputs.scrubbed = job.outputs.normalized;
    job.step = 'dedupe';
  }
  if (job.step === 'dedupe') {
    job.outputs.unique = job.outputs.scrubbed;
    job.step = 'label';
  }
  if (job.step === 'label') {
    job.outputs.labeled = job.outputs.unique.map((id: string) => ({ id, label: 'ok' }));
    job.step = 'validate';
  }
  if (job.step === 'validate') {
    job.outputs.valid = true;
    job.step = 'done';
  }
  // Write manifest (pseudo)
  const manifest: Manifest = {
    version: '1.0.0',
    sources: ['docs','tickets'],
    filters: { language: ['en'] },
    hash: 'sha256:...',
    count: job.outputs.labeled.length
  };
  return { job, manifest };
}

FAQ (direct answers)

What’s a “good” dataset size?

Enough to cover your domain with balance and quality. More is not better if noisy or duplicative. Start small, iterate.

Should we synthesize data?

Use synthesis to fill narrow gaps; validate rigorously and avoid copying evals into training.