Red Team / Adversarial Testing
Test AI agents against adversarial attacks — prompt injection, jailbreaks, data leakage, hallucination, and off-topic manipulation.
Sentinel includes a red team module that generates adversarial test cases to probe an agent's safety and robustness.
Attack types
Five attack generators are available in the redteam package:
| Attack | Description |
|---|---|
| Prompt Injection | Inject commands into user input to override system instructions |
| Jailbreak | Attempt to bypass safety guidelines and content policies |
| Leakage | Extract system prompts, internal knowledge, or confidential data |
| Hallucination | Probe for fabricated information or made-up facts |
| Off-Topic | Manipulate the agent away from its intended purpose |
How it works
- Generate attacks — The red team module generates adversarial test cases for a suite
- Run against target — Each attack case is sent to the target agent
- Evaluate defenses — Scorers check whether the agent resisted the attack
- Report results — Summarize bypass rate, vulnerability patterns, and recommendations
Using red team via API
POST /sentinel/redteam/:suiteId/generateGenerates adversarial cases targeting the suite's system prompt and persona. The generated cases are added to the suite with appropriate scorers.
POST /sentinel/redteam/:suiteId/runRuns the adversarial evaluation and reports results including bypass count and attack success rates.
Plugin hooks
Red team evaluations fire dedicated lifecycle hooks:
type RedTeamStarted interface {
OnRedTeamStarted(ctx context.Context, suiteID id.SuiteID, attackCount int) error
}
type RedTeamCompleted interface {
OnRedTeamCompleted(ctx context.Context, suiteID id.SuiteID, bypassCount int, elapsed time.Duration) error
}Use cases
- Pre-deployment safety testing for customer-facing agents
- Continuous red team evaluation in CI/CD pipelines
- Compliance testing for agents handling sensitive data
- Benchmarking defense improvements across prompt versions