Skip to main content
ServicesAI Training & Data Annotation

Training data that reflects production reality, not annotation convenience.

The annotation process is where production model quality is won or lost. Inter-annotator agreement measured too late, training distributions that do not cover the production long tail, and RLHF preference data collected without clear quality criteria — these are the data problems that produce models that pass evals and fail on real user inputs. We build annotation processes with the rigor production model quality requires.

AI Training & Data Annotation
The Problem

The annotation process is where model quality is determined — not at training time. A model trained on ambiguously annotated data learns the annotator's inconsistencies, not the underlying task. A training distribution that over-represents common cases and under-represents edge cases produces a model that looks strong on benchmarks and weak on the production long tail.

Inter-annotator agreement (IAA) — Cohen's kappa or Krippendorff's alpha — is the measurement that validates annotation quality. Low IAA is a leading indicator of poor model performance. It means different annotators answered the same question differently, and the training labels contain noise rather than signal. Most annotation projects measure IAA late — after the full dataset is annotated — when the cost of finding systematic disagreement is highest. We measure from the first validation batch.

What makes training data fail in production
  • Annotation guidelines ambiguous on edge cases — different annotators label them differently
  • Training distribution that does not represent the production input distribution
  • Missing adversarial examples — model learns easy surface patterns, not robust signal
  • No long-tail coverage — model fails on rare but important cases
  • RLHF preference data collected without clear quality criteria produces noisy reward signals
  • Label noise from unresolved annotator disagreement
Our Approach

We design annotation processes as software engineering problems: guidelines are versioned, IAA is measured continuously, and edge cases are explicitly catalogued and covered. Annotation guidelines go through a small-batch validation round before full-scale annotation — a 50-100 sample batch with IAA measurement catches guideline ambiguity before it propagates through the full dataset at high cost.

Training dataset composition is designed against the production distribution, not the available data distribution. If production data contains a long tail of rare cases that available data under-represents, we design targeted collection for those cases before training starts. For LLM fine-tuning with RLHF, we design preference comparison workflows where annotators compare model output pairs on defined quality dimensions — the criteria that the reward model will learn from.

Training data development process

01
Task definition and guideline validation

Define the annotation task precisely. Guidelines go through a 50-100 sample validation batch with IAA measurement before full-scale annotation. Low IAA on the validation batch means guideline revision, not full-scale annotation with ambiguous instructions.

02
Production distribution analysis

Analyze what the model will encounter in production. Identify under-represented input types, rare but important cases, and adversarial examples relevant to the task. Design targeted data collection for coverage gaps.

03
Annotation workflow setup

Configure Label Studio or Prodigy for the task type. Set up IAA measurement. Establish annotator calibration — a shared batch where annotators discuss disagreements before independent annotation begins.

04
Annotation with continuous quality control

Continuous IAA monitoring throughout annotation. Disagreements resolved through adjudication by a senior annotator or consensus. Samples below IAA threshold re-annotated. Guidelines updated when systematic disagreements reveal ambiguity.

05
Dataset validation before training

Validate class balance, edge case coverage, adversarial example inclusion, and final IAA metrics. RLHF preference data: validate criteria consistency and pair quality before reward model training.

What Is Included
01

IAA-validated annotation guidelines

Guidelines are validated with a small batch before full-scale annotation. IAA is measured throughout, not just at the end. Low-agreement samples are adjudicated and guidelines updated to reduce future ambiguity. The validation batch is the cheapest point to find systematic disagreements.

02

Production distribution coverage

We analyze the production input distribution and design data collection to cover it — including edge cases and rare inputs that available datasets under-represent. Models trained on production-representative data fail less on the long tail.

03

Adversarial example inclusion

For classification and detection models, we include adversarial examples in the training data: inputs designed to be confusing, near-boundary cases with similar surface features but different labels, and common misclassification patterns observed in earlier model versions.

04

RLHF preference data workflows

For LLM fine-tuning with RLHF, we design preference comparison workflows: annotators compare output pairs and rate which is better on defined quality dimensions. Clear criteria for each dimension are essential — vague criteria produce noisy reward signals that produce noisy reward models.

05

Active learning with selection review

Active learning selects the most informative unlabeled samples for annotation — samples where the current model is most uncertain. We implement selection review to catch distribution-edge artifacts where model uncertainty estimates are unreliable.

Deliverables
  • Annotation guidelines document with edge case specifications, examples, and IAA validation results
  • Annotation platform setup (Label Studio or Prodigy) with IAA measurement and quality control
  • Annotated training dataset with IAA report and adjudication log
  • Production distribution analysis with coverage assessment and gap identification
  • Active learning pipeline setup with selection review process (if applicable)
  • RLHF preference data workflow with comparison criteria and quality validation (if LLM fine-tuning)
Projected Impact

Training data quality is the primary determinant of model quality for supervised learning tasks. Investment in rigorous annotation — IAA measurement, production-representative coverage, adversarial examples — produces models that perform consistently in production rather than regressing on cases the training data did not cover.

FAQ

Common questions about this service.

How do we know when we have enough training data?

Learning curves tell you. Train models on increasing subsets of your data and plot performance vs. dataset size. When the curve flattens — additional data produces diminishing improvement — you have likely reached data sufficiency for the current model architecture. If performance has not reached your target by that point, the problem is more likely model architecture, task framing, or annotation quality than data volume.

Should we use synthetic data generation?

Synthetic data is most valuable for augmenting rare cases in the training distribution — not replacing real data. LLM-generated synthetic examples for text tasks, augmented images for vision tasks. The risk is distributional mismatch: synthetic data that does not match real production inputs adds noise, not signal. Validate synthetic data quality against held-out real samples before including it in training.

What annotation platforms do you work with?

Label Studio (open-source, flexible, self-hostable), Prodigy (tight spaCy integration, active learning support), Scale AI (managed annotation at scale), and Labelbox (enterprise annotation management). Platform selection is driven by task type, budget, data sensitivity requirements, and whether in-house or outsourced annotation is appropriate.

How do you design RLHF preference data collection?

RLHF preference quality depends on the clarity of comparison criteria. We define explicit quality dimensions for each pairwise comparison: accuracy, helpfulness, safety, tone — whatever matters for your use case. Annotators rate which output is better on each dimension, not a single overall preference. Clear criteria reduce annotator variance and produce more useful reward signals.

Ready to get started?

Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.

Start a Conversation

Free 30-minute scoping call. No obligation.