Composable AI evaluation framework for Go

Sentinel your AI

Define evaluation suites, score AI outputs across human-like dimensions, detect regressions against baselines, and red team your models — tenant-scoped, plugin-extensible, and Forge-native.

$go get github.com/xraph/sentinel
RunEval()
Score
Compare
Report
case.pass
scored
scorer.run
judged
baseline.ok
passed
22 Scorers
Multi-Tenant
Forge-Native
Red Team
Features

Everything you need for AI evaluation

Sentinel handles the hard parts — scoring, baselines, regression detection, red teaming, and multi-tenancy — so you can focus on your application.

Human-Like Scoring Pipeline

22 built-in scorers across 7 human-like dimensions. Score AI outputs for cognitive phase, perception focus, skill usage, behavior triggers, empathy, length, and LLM-as-judge quality.

eval.go
result, _ := engine.RunEval(ctx, suiteID,
sentinel.RunEvalInput{
Model: "gpt-4o",
Scorers: []scorer.Scorer{
scorer.Length(),
scorer.LLMJudge(llmClient),
scorer.CognitivePhase(),
},
})
"text-fd-muted-foreground/60 italic">// result.PassRate = 0.92

Baseline & Regression Detection

Save evaluation baselines and automatically detect regressions when scores drop. Compare across prompt versions, models, and configurations.

baseline.go
"text-fd-muted-foreground/60 italic">// Save a baseline snapshot
bl, _ := engine.SaveBaseline(ctx, suiteID)
 
"text-fd-muted-foreground/60 italic">// Later: detect regressions
report, _ := engine.DetectRegression(ctx,
suiteID, bl.ID)
"text-fd-muted-foreground/60 italic">// report.Delta = -0.05 (5% drop)

Multi-Tenant Isolation

Every suite, case, and eval run is scoped to a tenant via context. Cross-tenant queries are structurally impossible.

scope.go
ctx = sentinel.WithTenant(ctx, "tenant-1")
ctx = sentinel.WithApp(ctx, "myapp")
 
"text-fd-muted-foreground/60 italic">// All suites, cases, and eval runs are
"text-fd-muted-foreground/60 italic">// automatically scoped to tenant-1

Pluggable Store Backends

Start with in-memory for development, swap to SQLite or PostgreSQL for production. Every subsystem is a Go interface.

main.go
engine, _ := sentinel.NewEngine(
sentinel.WithStore(postgres.New(pool)),
sentinel.WithLogger(slog.Default()),
)
"text-fd-muted-foreground/60 italic">// Also: memory.New(), sqlite.New(db)

Red Team Testing

5 built-in attack generators — prompt injection, jailbreak, PII extraction, hallucination probes, and bias detection. Measure model resilience.

redteam.go
report, _ := engine.RunRedTeam(ctx,
suiteID, redteam.Config{
Attacks: []redteam.AttackType{
redteam.PromptInjection,
redteam.Jailbreak,
redteam.PIIExtraction,
},
})
"text-fd-muted-foreground/60 italic">// report.BypassCount = 2

Scenario Types & Persona Evaluation

8 scenario types — factual, creative, safety, summarization, classification, extraction, conversation, and reasoning. Run persona-aware evaluations with dimension scoring.

case.go
_, _ = engine.CreateCase(ctx, suiteID,
sentinel.CreateCaseInput{
Input: "Summarize this article...",
Expected: "Key points: ...",
Scenario: "summarization",
Tags: []string{"news", "concise"},
})
"text-fd-muted-foreground/60 italic">// Scenarios: factual, creative, safety,
"text-fd-muted-foreground/60 italic">// summarization, classification, extraction,
"text-fd-muted-foreground/60 italic">// conversation, reasoning
Evaluation Scoring Pipeline

From test case to confidence score.

Sentinel orchestrates the entire evaluation lifecycle — case invocation, multi-scorer aggregation, baseline comparison, and regression detection.

7 Evaluation Dimensions

Score AI outputs across cognitive phase, perception focus, skill usage, behavior triggers, empathy, length, and LLM-as-judge quality. Each dimension maps to a human evaluation trait.

Baseline Regression Detection

Save evaluation baselines and automatically detect when scores drop. Compare across prompt versions, model changes, and configuration updates with delta reporting.

16 Plugin Lifecycle Hooks

OnEvalRunStarted, OnCaseCompleted, OnRegressionDetected, and 13 other lifecycle events. Wire in metrics, audit trails, or custom processing logic without modifying engine code.

RunEval()
eval.run
Scorers22 built-in
Aggregatepass rate
case.invoked
✓ Scored
scores.computed
⟳ Scoring
baseline.checked
✓ No regression
Passed
Scoring
Comparing
Failed
Developer Experience

Simple API. Powerful evaluation.

Create an evaluation suite and score AI outputs in under 20 lines. Sentinel handles the rest.

Setup & Run
main.go
1package main
2 
3import (
4 "context"
5 "log/slog"
6 
7 "github.com/xraph/sentinel"
8 "github.com/xraph/sentinel/scorer"
9 "github.com/xraph/sentinel/store/memory"
10)
11 
12func main() {
13 ctx := context.Background()
14 
15 engine, _ := sentinel.NewEngine(
16 sentinel.WithStore(memory.New()),
17 sentinel.WithLogger(slog.Default()),
18 )
19 
20 ctx = sentinel.WithTenant(ctx, "tenant-1")
21 ctx = sentinel.WithApp(ctx, "myapp")
22 
23 "text-fd-muted-foreground/60 italic">// Create a suite with test cases
24 suite, _ := engine.CreateSuite(ctx,
25 sentinel.CreateSuiteInput{
26 Name: "qa-eval",
27 })
28 
29 _, _ = engine.CreateCase(ctx, suite.ID,
30 sentinel.CreateCaseInput{
31 Input: "What is Go?",
32 Expected: "A compiled language.",
33 Scenario: "factual",
34 })
35}
Score & Verify
eval.go
1package main
2 
3import (
4 "context"
5 "fmt"
6 
7 "github.com/xraph/sentinel"
8 "github.com/xraph/sentinel/scorer"
9)
10 
11func runEval(
12 engine *sentinel.Engine,
13 ctx context.Context,
14 suiteID string,
15) {
16 ctx = sentinel.WithTenant(ctx, "tenant-1")
17 
18 "text-fd-muted-foreground/60 italic">// Run evaluation with multiple scorers
19 result, _ := engine.RunEval(ctx, suiteID,
20 sentinel.RunEvalInput{
21 Model: "gpt-4o",
22 Scorers: []scorer.Scorer{
23 scorer.Length(),
24 scorer.LLMJudge(llmClient),
25 },
26 })
27 
28 fmt.Printf("Pass rate: %.0f%%\n",
29 result.PassRate*100)
30 "text-fd-muted-foreground/60 italic">// Pass rate: 92%
31 
32 "text-fd-muted-foreground/60 italic">// Save baseline for regression detection
33 _, _ = engine.SaveBaseline(ctx, suiteID)
34}

Start evaluating with Sentinel

Add production-grade AI evaluation to your Go service in minutes. Sentinel handles scoring, baselines, regression detection, and red team testing out of the box.

$go get github.com/xraph/sentinel