We design annotation processes as software engineering problems: guidelines are versioned, IAA is measured continuously, and edge cases are explicitly catalogued and covered. Annotation guidelines go through a small-batch validation round before full-scale annotation — a 50-100 sample batch with IAA measurement catches guideline ambiguity before it propagates through the full dataset at high cost.
Training dataset composition is designed against the production distribution, not the available data distribution. If production data contains a long tail of rare cases that available data under-represents, we design targeted collection for those cases before training starts. For LLM fine-tuning with RLHF, we design preference comparison workflows where annotators compare model output pairs on defined quality dimensions — the criteria that the reward model will learn from.
Training data development process
01Task definition and guideline validationDefine the annotation task precisely. Guidelines go through a 50-100 sample validation batch with IAA measurement before full-scale annotation. Low IAA on the validation batch means guideline revision, not full-scale annotation with ambiguous instructions.
02Production distribution analysisAnalyze what the model will encounter in production. Identify under-represented input types, rare but important cases, and adversarial examples relevant to the task. Design targeted data collection for coverage gaps.
03Annotation workflow setupConfigure Label Studio or Prodigy for the task type. Set up IAA measurement. Establish annotator calibration — a shared batch where annotators discuss disagreements before independent annotation begins.
04Annotation with continuous quality controlContinuous IAA monitoring throughout annotation. Disagreements resolved through adjudication by a senior annotator or consensus. Samples below IAA threshold re-annotated. Guidelines updated when systematic disagreements reveal ambiguity.
05Dataset validation before trainingValidate class balance, edge case coverage, adversarial example inclusion, and final IAA metrics. RLHF preference data: validate criteria consistency and pair quality before reward model training.