This case study describes a real engagement. Client identity, proprietary details, and specific metrics are anonymized or approximated under NDA.
KYC Document Verification Pipeline
Manual KYC verification taking 48+ hours per application. Compliance team reviewing 200+ documents daily with no structured tooling — document classification, data extraction, and sanctions screening were all done manually in sequence.
OCR and NLP pipeline for document classification, structured data extraction, and automated compliance checks including sanctions list screening. Human review retained for edge cases and final approval; the system handles the mechanical extraction and screening layer.
This engagement automated the intake and pre-screening stages of a KYC verification workflow, reducing the manual workload per application from approximately 45 minutes to under 6 minutes of human review time. The system processes identity documents (passports, national IDs, driver's licenses), proof of address documents, and source-of-funds documentation across 12 document types and 8 countries of issue. Sanctions screening runs against three lists simultaneously (OFAC, UN, and a proprietary watchlist) on every extracted name. The pipeline is designed with a hard-fail on low-confidence extractions: when extraction confidence falls below threshold, the application is routed to the human review queue with the low-confidence fields flagged rather than passed through with potentially incorrect data.
The Challenge
Identity document processing is technically demanding because the documents span multiple countries with different layouts, security features, fonts, and languages. OCR accuracy on passports with machine-readable zones (MRZ) is high, but accuracy on handwritten national ID cards and poorly scanned proof-of-address documents is significantly lower. The extraction pipeline needed per-document-type handling rather than a single general approach. Sanctions screening required handling name transliterations, common spelling variants, and partial name matches — fuzzy matching that is both too loose (false positives that waste human review time) and too tight (false negatives that create compliance exposure) if not tuned carefully. Regulatory requirements meant the pipeline had to produce a full audit trail for every decision, including the specific extracted values, confidence scores, and screening match results that led to either auto-verification or human review routing.
How We Built It
Document taxonomy and OCR pipeline (Weeks 1–3): We catalogued the 12 document types in scope across 8 countries of issue, noting the layout variations, language requirements, and expected field set for each. The OCR pipeline uses Tesseract as the base engine with document-type-specific pre-processing (deskew, contrast, resolution normalization) and post-processing (MRZ parsing for passports, field-boundary detection for structured forms). For documents with poor scan quality, a secondary pass using Anthropic Claude Vision handles handwritten or degraded text that Tesseract cannot reliably parse.
Structured extraction and validation (Weeks 4–6): Field extraction for each document type is handled by Anthropic Claude with per-type extraction prompts, returning structured JSON with field values and confidence scores. Validation rules apply field-level checks (date format, ID number length and checksum, country code validity) and cross-document consistency checks (name matching across identity document and proof of address, date-of-birth consistency). Applications that fail cross-document consistency checks are routed to human review with the inconsistency labeled, rather than auto-rejected, since some inconsistencies reflect legitimate name format differences.
Sanctions screening integration (Weeks 7–9): The screening module extracts all person names from the application, generates transliteration variants and common spelling alternatives, and runs fuzzy match screening against OFAC, UN, and the proprietary watchlist. Match scoring uses a combination of string similarity and phonetic matching (Soundex, Metaphone) to balance false positive and false negative rates. The screening result for each name variant is logged with match scores, and the routing decision (auto-clear, human review required, hard reject) follows configurable threshold rules that the compliance team can adjust without code changes.
Audit trail, dashboard, and deployment (Weeks 10–12): Every pipeline execution generates a structured audit record: document IDs, extracted field values, confidence scores, validation outcomes, screening results, match scores, and routing decision with decision reason. These records are stored in PostgreSQL and queryable by the compliance team for regulatory reporting. A FastAPI endpoint serves the pipeline, with Docker containers deployed on the existing cloud infrastructure. Post-deployment, 91% of applications are being auto-verified within 6 minutes of submission, with the remaining 9% routed to human review for final assessment.
What We Delivered
Auto-verification rate reached 91% of submitted applications in post-deployment operation, with average processing time from document upload to verification decision of 6 minutes. The previous all-manual process averaged 48 hours. The compliance team's manual review effort is now focused on the 9% of cases that genuinely require human judgment rather than all applications.
Extraction accuracy across the 12 document types reached 99.2% on structured fields (MRZ data, printed names, dates) and 94.7% on semi-structured fields (handwritten entries, address parsing). The low-confidence routing mechanism ensures that applications below extraction confidence threshold reach human review rather than passing through with potentially incorrect data — the compliance team reports zero cases of incorrect auto-verification in the first 8 weeks of operation.
The audit trail structure has reduced regulatory reporting preparation time significantly. Previously, producing documentation for a compliance review required manual assembly of records from multiple systems. The pipeline's structured audit records now allow compliance reports to be generated directly from PostgreSQL queries, reducing preparation time from days to hours.
Ready to build something like this?
Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.
Start a ConversationFree 30-minute scoping call. No obligation.