This case study describes a real engagement. Client identity, proprietary details, and specific metrics are anonymized or approximated under NDA.
AI Content Moderation for User-Generated Platforms
50,000+ posts per day requiring moderation review. Manual moderation team averaging a 4-hour review latency. Policy violations slipping through due to volume overwhelm, and moderator burnout from sustained exposure to high-volume harmful content.
Multi-modal content classifier handling text and image inputs, with confidence-based routing that auto-approves clearly safe content, auto-rejects clear violations, and routes edge cases to human review. Human review effort concentrated on ambiguous cases only.
This engagement built a content moderation pipeline that processes posts containing text, images, or both against a configurable policy ruleset. The classifier handles seven violation categories: graphic violence, sexual content, hate speech, harassment, spam, misinformation, and self-harm content. Routing logic applies confidence thresholds per category — auto-approve when all categories score below their safe threshold, auto-reject when any category scores above its reject threshold, and route to human review when any category scores in the ambiguous range. The threshold configuration is maintained in the operations dashboard, allowing policy teams to adjust sensitivity per category without code changes. At production volume, the pipeline processes 50,000+ posts per day with average processing time of 340ms per post.
The Challenge
Content moderation accuracy is uniquely consequential in both directions: false positives suppress legitimate speech, while false negatives allow harmful content to persist. The threshold configuration needed to be different by category — the cost asymmetry between false positives and false negatives is different for graphic violence than for spam. The multi-modal nature of the problem added complexity: text-only classifiers miss image-based violations, and image-only classifiers miss harmful text that uses benign imagery. The system needed to evaluate both modalities and combine their signals correctly. Processing latency was also a hard constraint — posts needed to be classified before public display, which meant the pipeline had to complete in under 500ms end-to-end even during peak posting periods (which create traffic spikes 3–4x above average).
How We Built It
Policy mapping and training data audit (Weeks 1–2): We worked with the platform's policy team to document the 7 violation categories, their definitions, and example cases including edge cases where context determines whether content violates policy. This document was the specification for both the classifier and the human review guidelines. We audited the platform's existing moderation action history to extract a labeled training dataset: 80,000 post-level labels across the 7 categories, with a class-balanced selection to avoid training the model to over-predict the dominant class (safe content, which represents 94% of all posts).
Multi-modal classifier development (Weeks 3–6): We built separate text and image classification branches that feed a fusion layer. The text branch uses a fine-tuned BERT-based model for the 5 text-amenable categories (hate speech, harassment, spam, misinformation, self-harm) and OpenAI Moderation API as an auxiliary signal. The image branch uses OpenAI Vision for graphic violence and sexual content classification, with a custom rule-based layer for known hash-matched content. The fusion layer combines branch scores using a learned weighting that was calibrated against the labeled dataset. End-to-end evaluation on the held-out test set achieved 94% accuracy on the combined classification task.
Kafka pipeline and processing infrastructure (Weeks 7–9): Post submission events are published to a Kafka topic. The moderation service consumes from this topic, runs classification, and produces routing decisions to a second Kafka topic that the content delivery service consumes. At peak posting volume, the classifier service scales horizontally — Kafka decouples ingestion from processing so that traffic spikes queue rather than causing processing delays or timeouts. Redis caches classification results for re-posts and identical content patterns. FastAPI serves the classifier service, with PostgreSQL storing classification results, confidence scores, and routing decisions for audit and model retraining.
Human review interface and operations tooling (Weeks 10): The human review interface presents edge-case posts with the confidence scores and the specific classifier signals that triggered review, helping moderators understand the basis for escalation. Moderator decisions are logged and used for ongoing model calibration. The operations dashboard shows real-time metrics: auto-approve rate, auto-reject rate, human review volume, processing latency by category, and a per-category accuracy estimate based on moderator override rates. Category threshold adjustment is done from this dashboard without code deployment.
What We Delivered
Auto-approve and auto-reject routing handles 78% of post volume, reducing the human review queue from 50,000+ posts per day to approximately 11,000 — a 78% reduction in manual review volume. Average review latency dropped from 4 hours to under 12 minutes for posts routed to human review, and to sub-second for auto-routed content. Classification accuracy on the held-out test set measured 94% across the 7 violation categories.
Average processing time per post is 340ms end-to-end at production volume, within the 500ms pre-display requirement. Kafka-based architecture handles peak traffic spikes (which have reached 4.1x average load during major news events) without processing delays by queuing excess volume. P99 processing time during peak periods is 780ms, which routes the affected posts to a post-display moderation check rather than pre-display classification — a fallback the platform accepted as operationally preferable to pre-display blocking.
Moderator burnout metrics (tracked through HR surveys prior to and following deployment) showed a 34% improvement in reported burnout scores, attributed primarily to the reduction in exposure volume and the shift toward judgment-based review rather than high-speed triage. The platform's policy team has used the moderation data to identify emerging violation patterns that did not exist at the time of model training, and a quarterly retraining process using fresh moderator-labeled data is now part of the operational cadence.
Ready to build something like this?
Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.
Start a ConversationFree 30-minute scoping call. No obligation.