Sentinel your AI
Define evaluation suites, score AI outputs across human-like dimensions, detect regressions against baselines, and red team your models — tenant-scoped, plugin-extensible, and Forge-native.
go get github.com/xraph/sentinelEverything you need for AI evaluation
Sentinel handles the hard parts — scoring, baselines, regression detection, red teaming, and multi-tenancy — so you can focus on your application.
Human-Like Scoring Pipeline
22 built-in scorers across 7 human-like dimensions. Score AI outputs for cognitive phase, perception focus, skill usage, behavior triggers, empathy, length, and LLM-as-judge quality.
result, _ := engine.RunEval(ctx, suiteID, sentinel.RunEvalInput{ Model: "gpt-4o", Scorers: []scorer.Scorer{ scorer.Length(), scorer.LLMJudge(llmClient), scorer.CognitivePhase(), }, })"text-fd-muted-foreground/60 italic">// result.PassRate = 0.92Baseline & Regression Detection
Save evaluation baselines and automatically detect regressions when scores drop. Compare across prompt versions, models, and configurations.
"text-fd-muted-foreground/60 italic">// Save a baseline snapshotbl, _ := engine.SaveBaseline(ctx, suiteID) "text-fd-muted-foreground/60 italic">// Later: detect regressionsreport, _ := engine.DetectRegression(ctx, suiteID, bl.ID)"text-fd-muted-foreground/60 italic">// report.Delta = -0.05 (5% drop)Multi-Tenant Isolation
Every suite, case, and eval run is scoped to a tenant via context. Cross-tenant queries are structurally impossible.
ctx = sentinel.WithTenant(ctx, "tenant-1")ctx = sentinel.WithApp(ctx, "myapp") "text-fd-muted-foreground/60 italic">// All suites, cases, and eval runs are"text-fd-muted-foreground/60 italic">// automatically scoped to tenant-1Pluggable Store Backends
Start with in-memory for development, swap to SQLite or PostgreSQL for production. Every subsystem is a Go interface.
engine, _ := sentinel.NewEngine( sentinel.WithStore(postgres.New(pool)), sentinel.WithLogger(slog.Default()),)"text-fd-muted-foreground/60 italic">// Also: memory.New(), sqlite.New(db)Red Team Testing
5 built-in attack generators — prompt injection, jailbreak, PII extraction, hallucination probes, and bias detection. Measure model resilience.
report, _ := engine.RunRedTeam(ctx, suiteID, redteam.Config{ Attacks: []redteam.AttackType{ redteam.PromptInjection, redteam.Jailbreak, redteam.PIIExtraction, }, })"text-fd-muted-foreground/60 italic">// report.BypassCount = 2Scenario Types & Persona Evaluation
8 scenario types — factual, creative, safety, summarization, classification, extraction, conversation, and reasoning. Run persona-aware evaluations with dimension scoring.
_, _ = engine.CreateCase(ctx, suiteID, sentinel.CreateCaseInput{ Input: "Summarize this article...", Expected: "Key points: ...", Scenario: "summarization", Tags: []string{"news", "concise"}, })"text-fd-muted-foreground/60 italic">// Scenarios: factual, creative, safety,"text-fd-muted-foreground/60 italic">// summarization, classification, extraction,"text-fd-muted-foreground/60 italic">// conversation, reasoningFrom test case to confidence score.
Sentinel orchestrates the entire evaluation lifecycle — case invocation, multi-scorer aggregation, baseline comparison, and regression detection.
7 Evaluation Dimensions
Score AI outputs across cognitive phase, perception focus, skill usage, behavior triggers, empathy, length, and LLM-as-judge quality. Each dimension maps to a human evaluation trait.
Baseline Regression Detection
Save evaluation baselines and automatically detect when scores drop. Compare across prompt versions, model changes, and configuration updates with delta reporting.
16 Plugin Lifecycle Hooks
OnEvalRunStarted, OnCaseCompleted, OnRegressionDetected, and 13 other lifecycle events. Wire in metrics, audit trails, or custom processing logic without modifying engine code.
Simple API. Powerful evaluation.
Create an evaluation suite and score AI outputs in under 20 lines. Sentinel handles the rest.
1package main2 3import (4 "context"5 "log/slog"6 7 "github.com/xraph/sentinel"8 "github.com/xraph/sentinel/scorer"9 "github.com/xraph/sentinel/store/memory"10)11 12func main() {13 ctx := context.Background()14 15 engine, _ := sentinel.NewEngine(16 sentinel.WithStore(memory.New()),17 sentinel.WithLogger(slog.Default()),18 )19 20 ctx = sentinel.WithTenant(ctx, "tenant-1")21 ctx = sentinel.WithApp(ctx, "myapp")22 23 "text-fd-muted-foreground/60 italic">// Create a suite with test cases24 suite, _ := engine.CreateSuite(ctx,25 sentinel.CreateSuiteInput{26 Name: "qa-eval",27 })28 29 _, _ = engine.CreateCase(ctx, suite.ID,30 sentinel.CreateCaseInput{31 Input: "What is Go?",32 Expected: "A compiled language.",33 Scenario: "factual",34 })35}1package main2 3import (4 "context"5 "fmt"6 7 "github.com/xraph/sentinel"8 "github.com/xraph/sentinel/scorer"9)10 11func runEval(12 engine *sentinel.Engine,13 ctx context.Context,14 suiteID string,15) {16 ctx = sentinel.WithTenant(ctx, "tenant-1")17 18 "text-fd-muted-foreground/60 italic">// Run evaluation with multiple scorers19 result, _ := engine.RunEval(ctx, suiteID,20 sentinel.RunEvalInput{21 Model: "gpt-4o",22 Scorers: []scorer.Scorer{23 scorer.Length(),24 scorer.LLMJudge(llmClient),25 },26 })27 28 fmt.Printf("Pass rate: %.0f%%\n",29 result.PassRate*100)30 "text-fd-muted-foreground/60 italic">// Pass rate: 92%31 32 "text-fd-muted-foreground/60 italic">// Save baseline for regression detection33 _, _ = engine.SaveBaseline(ctx, suiteID)34}Start evaluating with Sentinel
Add production-grade AI evaluation to your Go service in minutes. Sentinel handles scoring, baselines, regression detection, and red team testing out of the box.
go get github.com/xraph/sentinel