AI-specific due diligence — model risk, data rights, vendor lock-in, demo vs. production gap.

Most AI demos are optimised for vendor-selected test cases, not real user inputs. We run capability tests against inputs the vendor has not pre-optimised for, examine data provenance and training data rights, assess whether the MLOps infrastructure can support the claims being made, and evaluate whether the team can sustain what they have built.

Start a Conversation All Services

The Challenge

AI system due diligence has failure modes that general software due diligence does not. A system can have clean code, strong test coverage, and well-documented infrastructure — and still have AI capabilities that significantly underperform their claimed benchmarks on real-world inputs. Vendor-provided benchmark results are measured on evaluation datasets chosen to show the system favorably. Independent testing against inputs representative of the acquiring party's use case almost always produces different results.

AI-specific technical debt is invisible to code review focused on application quality. MLOps debt — training pipelines that cannot be reproduced, models without lineage, evaluation frameworks that do not reflect production conditions — affects system quality and improvability in ways that application code review does not surface. Data rights issues — training data with unclear licensing, scraping that violated terms of service, or PII in training datasets — are legal and reputational risks that require explicit investigation.

AI due diligence dimensions that code review alone misses

Capability testing against your representative inputs — not vendor-curated benchmarks
MLOps maturity: can the model be retrained, updated, and rolled back reliably?
Training data provenance and rights: can the dataset be audited for licensing and compliance?
Evaluation methodology quality: does the offline evaluation predict production performance?
Vendor dependency risk: what happens if a model API is deprecated or repriced?
The demo vs. production gap: does the system work on real user inputs, not just curated test cases?

Our Approach

We conduct AI technical due diligence in four layers: capability assessment (does the system do what it claims on your inputs?), code and infrastructure quality (is it maintainable and scalable?), AI-specific technical debt (MLOps maturity, data lineage, evaluation quality, data rights), and risk assessment (vendor lock-in, integration risk, operational risk at scale).

Capability assessment is conducted using your specific test cases, not vendor-provided benchmarks. We design a test dataset representative of your intended use and run the system against it, measuring the metrics that matter for your use case. This is the only reliable basis for acquisition decisions — vendor benchmarks are systematically optimistic.

Due diligence engagement structure

Scope definition and access requirements

Define the system components in scope, the acquisition or investment thesis, and the specific capability claims to test. Establish access requirements: API access, code repository, infrastructure documentation, data documentation, interview time with technical leads.

Independent capability testing

Design and execute tests using inputs representative of your use case. Document performance against your test cases and compare to claimed benchmarks. Map the demo vs. production gap explicitly.

Infrastructure and code audit

Review system architecture, code quality, test coverage, deployment processes, and operational procedures. Assess scalability and identify infrastructure risks at target scale.

AI-specific debt and data rights assessment

Audit MLOps maturity: experiment tracking, model registry, retraining pipeline, monitoring. Audit training data provenance, annotation quality, and licensing. Identify data rights risks.

Risk register and findings report

Prioritized findings report separating deal-breaker issues from negotiation-relevant items. Vendor dependency risk, integration risks, and operational cost model at target scale.

What Is Included

01
Independent capability testing against your inputs
We run the system against inputs drawn from your domain — not the vendor's benchmark set. This produces a realistic accuracy and latency baseline that reflects actual post-acquisition performance. The demo vs. production gap is quantified explicitly, with failure mode categories documented.
02
MLOps maturity assessment
We evaluate whether models can be retrained reliably, whether experiments are reproducible, and whether there's a model registry with documented promotion criteria. Production monitoring coverage and alerting posture are assessed separately — MLOps debt is expensive to retrofit and directly determines how fast the system can improve after you own it.
03
Training data provenance and rights audit
We trace training data sources, review annotation quality documentation, and flag compliance exposure — unlicensed scraping, ToS violations, or PII in training sets are legal liabilities that don't disappear at close. Synthetic data pipelines and third-party dataset licenses are reviewed against your intended deployment jurisdiction.
04
Vendor dependency and lock-in analysis
We map which capabilities are bound to specific model APIs (OpenAI, Anthropic, Google, Cohere) and assess deprecation and pricing change risk for each dependency. Abstraction layer quality is evaluated — a well-abstracted LLM client takes days to migrate; a tightly coupled one can take quarters.
05
Infrastructure scalability analysis
We model compute cost and latency at your target request volume, not the vendor's current load. Architecture bottlenecks — synchronous inference chains, no batching, over-reliance on synchronous API calls — are identified with estimated remediation cost. Systems that pass performance tests at 10K req/day frequently fail at 500K.

Deliverables

Independent capability assessment report against representative inputs
Demo vs. production gap analysis with quantified performance delta
MLOps maturity scorecard: tracking, registry, retraining, monitoring
Training data provenance and licensing risk report
Vendor dependency map with migration cost estimates
Risk register with deal-breaker issues and negotiation-relevant findings

Projected Impact

A single undisclosed data rights issue or unscalable MLOps setup can add 3–6 months and significant cost to an acquisition. Identifying these before close — when you still have negotiating leverage — is the practical value of structured technical due diligence.

Selected work

Production work using this service

Anonymized engagements with real metrics — no client names per NDA.

Energy

Industrial Energy Consumption Analytics

19%

Energy Cost Reduction

99.1%

Sensor Data Uptime

4.2s

Alert Latency

“We were making energy management decisions from monthly utility bills. Having real-time sensor data and anomaly detection changed what was even possible — we caught equipment inefficiency that had been running for years without anyone knowing.”

— Head of Facilities Operations, Manufacturing Conglomerate

Read the case

Education

School-Specific Exam Prep Platform with AI Engagement Tracking

30+

Concurrent Students/Session

<500ms

Real-Time Update Latency

Board Affiliations Supported

“The shared component architecture saved the project. Building two separate apps from scratch with one engineer would not have been feasible in the timeline. The shared logic meant we could ship both apps and keep them in sync.”

— CTO, EdTech Startup

Read the case

FAQ

Frequently
asked questions

What is the demo vs. production gap and how do you measure it?

The demo vs. production gap is the difference between how a system performs on vendor-curated test cases and how it performs on real user inputs that the vendor did not select. We measure it by designing a test dataset representative of your intended use — based on your user base, query distribution, and edge cases — and running the system against it. The gap is quantified in the same metrics the vendor used for their benchmark claims.

What are the most common deal-breaker findings in AI due diligence?

We flag these explicitly: capability that significantly underperforms claimed benchmarks on representative inputs, training data with licensing or compliance issues, no model retraining capability (the system cannot improve or be updated), critical security vulnerabilities including exposed training data, or cost models that are not viable at required scale. These are reported separately from negotiation-relevant findings.

Can you assess both API-based AI systems and custom-trained models?

Yes, with different emphases. API-based systems: prompt engineering quality, output handling, vendor dependency risk, and cost model at scale. Custom-trained models add: MLOps maturity, training data quality, evaluation methodology quality, and model portability. We tailor the assessment to the architecture.

What access do we need from the target company?

At minimum: system documentation, architecture diagrams, and API access for capability testing. For full due diligence: code repository access (read-only), infrastructure documentation for cost analysis, data documentation, and interview time with technical leads. We scope the engagement based on available access under the NDA structure in place.

Ready to get started?

Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.

Start a Conversation

Free 30-min scoping call

Explore More

All services

AI-specific due diligence — model risk, data rights, vendor lock-in, demo vs. production gap.

Due diligence engagement structure

Independent capability testing against your inputs

MLOps maturity assessment

Training data provenance and rights audit

Vendor dependency and lock-in analysis

Infrastructure scalability analysis

Production work using this service

Industrial Energy Consumption Analytics

School-Specific Exam Prep Platform with AI Engagement Tracking

Frequentlyasked questions

Ready to get started?

Frequently
asked questions