Post-transformer NLP — small models, structured output, function calling.
Function calling is the new NLP paradigm. Structured output (JSON mode) from a well-prompted LLM replaces most of what intent classification and slot-filling did. But high-volume classification at LLM cost does not work. Fine-tuned SLMs run 10-100x cheaper on well-defined tasks. We build NLP systems that pick the right architecture — SLM, spaCy pipeline, or LLM with structured output — for each task.
The post-transformer NLP landscape has restructured the architecture decision space. Traditional NLP pipelines — rule-based entity extraction, intent classifiers, slot-filling models — have largely been replaced by two cleaner regimes. For complex, open-ended tasks: LLMs with structured output (JSON mode) or function calling. For high-volume, well-defined tasks: small language models (SLMs) fine-tuned on domain-specific data.
The architecture mistake in both directions is expensive. Using GPT-4o for a support ticket classifier that runs at high volume burns budget that a fine-tuned DeBERTa-v3 or Phi-3-mini could handle at 50x lower inference cost with equivalent accuracy on the specific task. Using a fine-tuned SLM for a task that requires multi-document reasoning or complex judgment produces poor results where a prompted LLM would handle it correctly. The function calling pattern — where you define a JSON schema and the model populates it — handles a large class of structured extraction tasks that previously required custom NLP pipelines.
- High-volume classification with labeled data → fine-tuned SLM (DeBERTa-v3, Phi-3-mini)
- Structured extraction from documents → LLM with JSON mode or function calling
- Named entity recognition in specialized domains → spaCy pipeline with custom components
- Complex multi-step reasoning → LLM with chain-of-thought prompting
- Semantic search → embedding models (text-embedding-3-large, BGE, E5) + vector index
- On-device NLP with privacy constraints → quantized SLM via ONNX or CoreML
We start every NLP engagement with task characterization: what is the input, what is the required output, what accuracy is acceptable, and what are the latency and throughput constraints. This determines whether the right approach is a fine-tuned SLM, a spaCy extraction pipeline, a RAG system, or a prompted LLM with structured output.
For structured extraction, we use the LLM function calling pattern with well-defined JSON schemas and field descriptions. For high-volume classification where cost is a concern, we fine-tune SLMs from the Hugging Face Hub on domain-specific labeled data. For entity extraction and NLP preprocessing, spaCy production pipelines handle throughput that LLMs cannot approach at viable cost.
NLP system build process
Define input/output specification, accuracy requirements, latency budget, throughput targets, and cost constraints. Select architecture based on requirements — not on what is most technically interesting.
Design JSON schemas for function calling or JSON mode with field descriptions that give the model semantic context. Validate against a sample dataset to catch schema ambiguities before production.
Fine-tune from the Hugging Face Hub on domain-specific labeled data. Evaluate against task-appropriate metrics: F1 for sequence labeling, macro-F1 for classification. Report per-class performance — aggregate accuracy hides class imbalance.
FastAPI serving endpoint with batch processing support, confidence scoring, and human escalation routing for low-confidence outputs. vLLM or TGI for high-throughput SLM serving.
Confidence score distributions and prediction class distributions tracked over time. Distribution shift triggers review before user-visible degradation.
Function calling and JSON mode for structured extraction
The function calling pattern replaces most of what custom NLP pipelines did for structured extraction tasks. We design JSON schemas with field descriptions that give LLMs the semantic context to populate them accurately, and validate output against Zod or Pydantic schemas in production.
SLM fine-tuning for high-volume tasks
For classification, extraction, and labeling tasks that run at high volume, fine-tuned SLMs from the Hugging Face Hub (DeBERTa-v3, Phi-3-mini, Mistral-7B) deliver competitive accuracy at dramatically lower inference cost than GPT-4-class models. We fine-tune, evaluate, and serve using vLLM or TGI.
spaCy production pipelines
spaCy is the standard for production NLP requiring high throughput and low latency. We build custom spaCy components for domain-specific entity types, relation extraction, and text normalization — integrated into the same pipeline infrastructure as transformer-based components.
Semantic search infrastructure
Embedding-based semantic search using text-embedding-3-large, BGE, or E5 models with vector indices — pgvector for Postgres-native deployments, Pinecone or Weaviate for managed search. Search quality evaluated against human-relevance baselines.
On-device NLP
For privacy-sensitive or latency-critical NLP that cannot go to the cloud, we deploy quantized SLMs via ONNX Runtime or CoreML on Apple Silicon. Practical for classification and extraction tasks where accuracy trade-offs from quantization are acceptable.
- Task characterization document with architecture recommendation and cost analysis
- JSON schema design for function calling or structured output (if LLM-based)
- Fine-tuned SLM with per-class evaluation metrics (if applicable)
- Production serving endpoint with confidence scoring and human escalation routing
- Batch processing pipeline for high-volume workloads
- Monitoring setup with prediction distribution and confidence tracking
NLP systems reduce manual effort in text-heavy workflows — document review, support triage, contract analysis — by handling high-confidence cases automatically and routing complex cases to humans. The cost advantage of SLMs over LLMs scales with processing volume and determines whether the unit economics work at scale.
Common questions about this service.
When does fine-tuning an SLM beat prompting a large LLM?
Fine-tuning wins on: high-volume tasks where inference cost matters, tasks requiring consistent structured output format, domain-specific vocabulary or reasoning patterns the base model handles poorly, and latency-sensitive applications. Prompting large LLMs wins on: tasks with insufficient training data, highly varied open-ended inputs, multi-step reasoning across long contexts, and low-volume tasks where engineering time is the dominant cost.
Is function calling reliable enough for production structured extraction?
Yes, for well-defined schemas with clear field descriptions. GPT-4o and Claude 3.5 Sonnet reliably populate well-designed schemas. The reliability drops when schemas are ambiguous, fields have overlapping semantics, or the input text is highly unstructured. We design schemas with field descriptions that eliminate ambiguity, validate outputs with Pydantic or Zod, and route failures to retry or human review.
What is RAG and when is it appropriate?
RAG (retrieval-augmented generation) combines a retrieval system — semantic search over your document corpus — with an LLM that generates answers grounded in the retrieved content. It is appropriate when you need LLM-quality responses about a knowledge base that changes frequently: internal documentation, product catalogs, regulatory content. It is not appropriate for tasks requiring reasoning across the entire corpus simultaneously rather than retrieving relevant passages.
How much labeled data do we need for SLM fine-tuning?
Classification tasks: hundreds to a few thousand examples per class can achieve strong performance with modern pre-trained models — fine-tuning starts from a model that already understands language. Named entity recognition: depends on entity type diversity and domain specificity. We assess data requirements during task characterization and give realistic estimates before committing to a fine-tuning approach.
Ready to get started?
Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.
Start a ConversationFree 30-minute scoping call. No obligation.
