Red Team / Adversarial Testing

Test AI agents against adversarial attacks — prompt injection, jailbreaks, data leakage, hallucination, and off-topic manipulation.

Sentinel includes a red team module that generates adversarial test cases to probe an agent's safety and robustness.

Attack types

Five attack generators are available in the redteam package:

Attack	Description
Prompt Injection	Inject commands into user input to override system instructions
Jailbreak	Attempt to bypass safety guidelines and content policies
Leakage	Extract system prompts, internal knowledge, or confidential data
Hallucination	Probe for fabricated information or made-up facts
Off-Topic	Manipulate the agent away from its intended purpose

How it works

Generate attacks — The red team module generates adversarial test cases for a suite
Run against target — Each attack case is sent to the target agent
Evaluate defenses — Scorers check whether the agent resisted the attack
Report results — Summarize bypass rate, vulnerability patterns, and recommendations

Using red team via API

POST /sentinel/redteam/:suiteId/generate

Generates adversarial cases targeting the suite's system prompt and persona. The generated cases are added to the suite with appropriate scorers.

POST /sentinel/redteam/:suiteId/run

Runs the adversarial evaluation and reports results including bypass count and attack success rates.

Plugin hooks

Red team evaluations fire dedicated lifecycle hooks:

type RedTeamStarted interface {
    OnRedTeamStarted(ctx context.Context, suiteID id.SuiteID, attackCount int) error
}

type RedTeamCompleted interface {
    OnRedTeamCompleted(ctx context.Context, suiteID id.SuiteID, bypassCount int, elapsed time.Duration) error
}

Use cases

Pre-deployment safety testing for customer-facing agents
Continuous red team evaluation in CI/CD pipelines
Compliance testing for agents handling sensitive data
Benchmarking defense improvements across prompt versions