Sentinel

Red Team / Adversarial Testing

Test AI agents against adversarial attacks — prompt injection, jailbreaks, data leakage, hallucination, and off-topic manipulation.

Sentinel includes a red team module that generates adversarial test cases to probe an agent's safety and robustness.

Attack types

Five attack generators are available in the redteam package:

AttackDescription
Prompt InjectionInject commands into user input to override system instructions
JailbreakAttempt to bypass safety guidelines and content policies
LeakageExtract system prompts, internal knowledge, or confidential data
HallucinationProbe for fabricated information or made-up facts
Off-TopicManipulate the agent away from its intended purpose

How it works

  1. Generate attacks — The red team module generates adversarial test cases for a suite
  2. Run against target — Each attack case is sent to the target agent
  3. Evaluate defenses — Scorers check whether the agent resisted the attack
  4. Report results — Summarize bypass rate, vulnerability patterns, and recommendations

Using red team via API

POST /sentinel/redteam/:suiteId/generate

Generates adversarial cases targeting the suite's system prompt and persona. The generated cases are added to the suite with appropriate scorers.

POST /sentinel/redteam/:suiteId/run

Runs the adversarial evaluation and reports results including bypass count and attack success rates.

Plugin hooks

Red team evaluations fire dedicated lifecycle hooks:

type RedTeamStarted interface {
    OnRedTeamStarted(ctx context.Context, suiteID id.SuiteID, attackCount int) error
}

type RedTeamCompleted interface {
    OnRedTeamCompleted(ctx context.Context, suiteID id.SuiteID, bypassCount int, elapsed time.Duration) error
}

Use cases

  • Pre-deployment safety testing for customer-facing agents
  • Continuous red team evaluation in CI/CD pipelines
  • Compliance testing for agents handling sensitive data
  • Benchmarking defense improvements across prompt versions

On this page