Skip to main content
ServicesAI Safety & Red Teaming

Find what breaks your AI system before adversarial users do.

Prompt injection is the most exploited vulnerability in LLM-based systems right now. Indirect injection — malicious instructions embedded in documents your agent retrieves — is particularly dangerous for agentic systems because the attack does not require direct user interaction. NeMo Guardrails and Guardrails AI help, but they need to be configured against actual exploit attempts, not theoretical categories. We test first, then configure.

AI Safety & Red Teaming
The Problem

AI systems have attack surfaces that traditional software security testing does not cover. An LLM-based customer support agent can be manipulated via prompt injection to ignore its system prompt and respond as if it has no restrictions. A fine-tuned classifier can be fooled by adversarial examples — inputs crafted to produce a specific misclassification. A RAG system can be attacked via indirect prompt injection: malicious instructions embedded in retrieved documents that the LLM processes as content.

OWASP Top 10 for LLM Applications formalizes the most common vulnerabilities. Prompt injection (LLM01) is the most widely exploited: attacker input that overrides or supplements system prompt instructions. Insecure output handling (LLM02): downstream systems execute LLM output without validation — SQL injection via LLM-generated queries, HTML injection in rendered output. Training data poisoning (LLM03) and model denial of service (LLM04) round out the critical categories. NeMo Guardrails (NVIDIA) and Guardrails AI provide output filtering and policy enforcement — but they need to be configured against the actual exploits that work against your system, not generic categories.

AI attack surface categories we test
  • Direct prompt injection: user input that overrides or supplements system prompt instructions
  • Indirect prompt injection: malicious instructions embedded in retrieved or processed content
  • Jailbreaking: multi-turn, encoded, or role-playing inputs that bypass content filtering
  • Data extraction via inference: using model responses to reconstruct training data or system prompts
  • Adversarial examples: inputs crafted to produce specific misclassifications in detection models
  • Output handling vulnerabilities: LLM-generated content reaching security-relevant code paths without validation
Our Approach

Red teaming exercises follow a structured methodology adapted from offensive security practices. We establish scope (systems, attack categories, attacker personas), develop an attack playbook specific to your architecture, execute testing, and document every successful exploit with reproduction steps and impact assessment.

The output is not a theoretical vulnerability checklist — it is a prioritized set of actual findings from testing your specific system, with concrete remediation guidance including Guardrails AI and NeMo Guardrails configurations where they apply. What goes into your backlog is things that actually broke in testing, not things that might theoretically break.

Red team engagement process

01
Scope and threat model

Define systems in scope, worst-case outcomes (data exposure, unauthorized agent actions, compliance violations), and attacker personas most relevant to your threat model — internal users, external users, and automated attackers.

02
Attack playbook development

Develop a playbook of techniques relevant to your architecture: prompt injection variants, indirect injection via RAG retrieval, jailbreak attempts, adversarial input generation, data extraction probes, and workflow abuse scenarios specific to your agent's tool surface.

03
Adversarial testing execution

Execute the playbook against your systems. Document every successful exploit with reproduction steps, attack complexity, and impact severity. Run attacks multiple times to establish exploit rates — non-deterministic systems require statistical testing.

04
Findings report with remediation

Prioritized findings with CVSS-style severity ratings adapted for AI vulnerabilities. Each finding: description, exploit demonstration, business impact, remediation guidance including specific Guardrails AI or NeMo Guardrails configuration where applicable.

05
Remediation validation

Optional follow-up to validate that implemented remediations are effective and have not introduced new vulnerabilities. Re-test exploits from the original report.

What Is Included
01

Prompt injection testing (direct and indirect)

We test direct and indirect prompt injection using a comprehensive playbook: role-playing attacks, delimiter injection, instruction override, context manipulation, and indirect injection via content the LLM retrieves and processes. Indirect injection is particularly tested for agentic systems with RAG or email/document processing.

02

Agentic system adversarial testing

Agents with tool access have a larger attack surface than pure text generation. We test whether prompt injection can cause agents to take unintended actions — calling tools with crafted parameters, accessing out-of-scope data, sending unauthorized communications. The blast radius of each tool determines the severity of a successful injection.

03

Guardrails configuration

NeMo Guardrails and Guardrails AI provide output filtering and policy enforcement — but they need to be configured against actual exploit patterns. We configure guardrails based on the exploits that work against your specific system, not against generic categories.

04

Adversarial input generation for classifiers

For classification and detection models, we generate adversarial examples using black-box techniques. We measure model robustness and identify input regions where adversarial examples are most effective — surfaces to harden via adversarial training or confidence thresholds.

05

OWASP LLM Top 10 coverage

Our testing methodology covers all OWASP Top 10 for LLM Applications categories relevant to your architecture. We map findings to OWASP categories for compliance reporting and prioritization conversations.

Deliverables
  • Threat model covering system architecture and attacker personas
  • Attack playbook specific to your architecture and tool surface
  • Adversarial testing execution across all in-scope systems and attack categories
  • Findings report with prioritized exploits, reproduction steps, and business impact
  • Remediation guidance per finding — including guardrails configuration where applicable
  • Optional re-test to validate remediation effectiveness
Projected Impact

Red teaming surfaces exploitable vulnerabilities before adversarial users discover them. For agentic systems with tool access, this is high-stakes — an undetected prompt injection in an agent with write access to external systems is a significant operational and reputational risk that grows with agent autonomy.

FAQ

Common questions about this service.

Is AI red teaming different from traditional penetration testing?

Yes. Traditional pentesting looks for binary vulnerabilities in infrastructure, code, and protocols. AI red teaming tests probabilistic systems for failure modes that are often not binary: a prompt injection that works 30% of the time is still a documented vulnerability with a specific severity profile. The techniques — adversarial examples, jailbreaks, indirect injection — are specific to AI systems and require different methodology.

How do you handle non-determinism in LLM red teaming?

We run attacks multiple times to establish exploit rates rather than just presence or absence of vulnerability. A jailbreak that works one in ten attempts is a documented finding — with a different severity profile than one that works reliably. We use temperature 0 for reproducibility testing and document attack success rates across repeated attempts at production temperature.

How do we defend against prompt injection in agentic systems?

Defense in depth: input sanitization (flag or strip suspicious instruction patterns in user input), privilege separation (agent tool permissions are minimally scoped to what the task requires), output validation (tool call parameters are validated against schemas before execution), and content isolation (retrieved documents processed in a context separate from user instructions where possible). NeMo Guardrails or Guardrails AI configured against the specific injection patterns that succeed in testing. No single defense is sufficient.

Do you need model weights or just API access for testing?

For LLM-based system red teaming, API access is sufficient — we test the deployed system as an attacker would. For adversarial example testing of custom-trained classification models, access to model architecture and weights enables gradient-based attacks. We design engagement scope based on available access and the threat actors you are defending against.

Ready to get started?

Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.

Start a Conversation

Free 30-minute scoping call. No obligation.