Introduction
Composable AI evaluation and testing framework for Go.
Sentinel is a Go library for testing AI agents the way you'd evaluate a human professional. Instead of simple input/output scoring, you evaluate across multiple dimensions — skills, traits, behaviors, cognition, communication, perception, and persona coherence — the same building blocks that make people unique.
Sentinel is a library — not a service. You bring your own LLM provider, database, and HTTP server. Sentinel provides the evaluation orchestration plumbing.
The Human-Like Testing Model
Sentinel evaluates AI agents the way you would evaluate a person:
| Dimension | What it tests | Scorer |
|---|---|---|
| Skill | Can the agent do the job? (tool selection, proficiency) | skill_usage |
| Trait | Who is the agent? (personality consistency) | trait_consistency |
| Behavior | How does it react? (trigger-action patterns) | behavior_trigger |
| Cognition | How does it think? (phase transitions, depth) | cognitive_phase |
| Communication | How does it talk? (tone, formality, verbosity) | communication_style |
| Perception | What does it notice? (attention focus, detail) | perception_focus |
| Persona | The whole person (end-to-end identity coherence) | persona_coherence |
What it does
- Multi-dimensional scoring — Score across 7 human-like dimensions simultaneously, or use traditional input/output scorers.
- 22 built-in scorers — From exact match and regex to LLM-as-judge, semantic similarity, and all 7 persona-aware scorers.
- Scenario generation — Auto-generate test cases targeting specific evaluation dimensions.
- Baseline & regression detection — Track performance over time and detect regressions.
- Adversarial testing (red team) — Test for prompt injection, jailbreaks, data leakage, hallucination, and off-topic responses.
- Multi-model comparison — Compare performance across different LLMs or agent configurations.
- Prompt versioning — Track system prompt iterations with performance metrics.
- Plugin system — 16 lifecycle hooks for metrics, audit trails, and custom processing.
- Three storage backends — PostgreSQL, SQLite, and in-memory.
- Forge integration — Drop-in
forge.Extensionwith DI-injected Engine and auto-registered HTTP routes. - REST API — 32+ endpoints for managing suites, cases, runs, baselines, red team, prompts, scenarios, and reports.
- go test integration — Assertion functions for CI/CD pipelines.
Design philosophy
Library, not service. Sentinel is a set of Go packages you import. You control main, the database connection, and the process lifecycle.
Human-like evaluation. You don't just test output — you test skills, personality, thinking patterns, and communication style. Traditional LLM evals still work as a subset.
Interfaces over implementations. Every subsystem defines a Go interface. Swap any storage backend with a single type change.
Tenant-scoped by design. sentinel.WithTenant and sentinel.WithApp inject context enforced at every layer.
TypeID everywhere. All entities use type-prefixed, K-sortable, UUIDv7-based identifiers (suite_, tcase_, erun_, eres_, base_, etc.).
Quick look
package main
import (
"context"
"log"
"github.com/xraph/sentinel/engine"
"github.com/xraph/sentinel/store/memory"
)
func main() {
ctx := context.Background()
// Create an in-memory store for development.
memStore := memory.New()
// Build the Sentinel engine.
eng, err := engine.New(
engine.WithStore(memStore),
)
if err != nil {
log.Fatal(err)
}
_ = eng
_ = ctx
}