Composable AI evaluation framework for Go

Sentinel your AI

Define evaluation suites, score AI outputs across human-like dimensions, detect regressions against baselines, and red team your models — tenant-scoped, plugin-extensible, and Forge-native.

$go get github.com/xraph/sentinel

Get Started GitHub

RunEval()

Score

Compare

Report

case.pass

scored

scorer.run

judged

baseline.ok

passed

22 Scorers

Multi-Tenant

Forge-Native

Red Team

Features

Everything you need for AI evaluation

Sentinel handles the hard parts — scoring, baselines, regression detection, red teaming, and multi-tenancy — so you can focus on your application.

Human-Like Scoring Pipeline

22 built-in scorers across 7 human-like dimensions. Score AI outputs for cognitive phase, perception focus, skill usage, behavior triggers, empathy, length, and LLM-as-judge quality.

eval.go

result, _ := engine.RunEval(ctx, suiteID,
  sentinel.RunEvalInput{
    Model: "gpt-4o",
    Scorers: []scorer.Scorer{
      scorer.Length(),
      scorer.LLMJudge(llmClient),
      scorer.CognitivePhase(),
    },
  })
"text-fd-muted-foreground/60 italic">// result.PassRate = 0.92

Baseline & Regression Detection

Save evaluation baselines and automatically detect regressions when scores drop. Compare across prompt versions, models, and configurations.

baseline.go

"text-fd-muted-foreground/60 italic">// Save a baseline snapshot
bl, _ := engine.SaveBaseline(ctx, suiteID)
 
"text-fd-muted-foreground/60 italic">// Later: detect regressions
report, _ := engine.DetectRegression(ctx,
  suiteID, bl.ID)
"text-fd-muted-foreground/60 italic">// report.Delta = -0.05 (5% drop)

Multi-Tenant Isolation

Every suite, case, and eval run is scoped to a tenant via context. Cross-tenant queries are structurally impossible.

scope.go

ctx = sentinel.WithTenant(ctx, "tenant-1")
ctx = sentinel.WithApp(ctx, "myapp")
 
"text-fd-muted-foreground/60 italic">// All suites, cases, and eval runs are
"text-fd-muted-foreground/60 italic">// automatically scoped to tenant-1

Pluggable Store Backends

Start with in-memory for development, swap to SQLite or PostgreSQL for production. Every subsystem is a Go interface.

main.go

engine, _ := sentinel.NewEngine(
  sentinel.WithStore(postgres.New(pool)),
  sentinel.WithLogger(slog.Default()),
)
"text-fd-muted-foreground/60 italic">// Also: memory.New(), sqlite.New(db)

Red Team Testing

5 built-in attack generators — prompt injection, jailbreak, PII extraction, hallucination probes, and bias detection. Measure model resilience.

redteam.go

report, _ := engine.RunRedTeam(ctx,
  suiteID, redteam.Config{
    Attacks: []redteam.AttackType{
      redteam.PromptInjection,
      redteam.Jailbreak,
      redteam.PIIExtraction,
    },
  })
"text-fd-muted-foreground/60 italic">// report.BypassCount = 2

Scenario Types & Persona Evaluation

8 scenario types — factual, creative, safety, summarization, classification, extraction, conversation, and reasoning. Run persona-aware evaluations with dimension scoring.

case.go

_, _ = engine.CreateCase(ctx, suiteID,
  sentinel.CreateCaseInput{
    Input:    "Summarize this article...",
    Expected: "Key points: ...",
    Scenario: "summarization",
    Tags:     []string{"news", "concise"},
  })
"text-fd-muted-foreground/60 italic">// Scenarios: factual, creative, safety,
"text-fd-muted-foreground/60 italic">// summarization, classification, extraction,
"text-fd-muted-foreground/60 italic">// conversation, reasoning

Evaluation Scoring Pipeline

From test case to confidence score.

Sentinel orchestrates the entire evaluation lifecycle — case invocation, multi-scorer aggregation, baseline comparison, and regression detection.

7 Evaluation Dimensions

Score AI outputs across cognitive phase, perception focus, skill usage, behavior triggers, empathy, length, and LLM-as-judge quality. Each dimension maps to a human evaluation trait.

Baseline Regression Detection

Save evaluation baselines and automatically detect when scores drop. Compare across prompt versions, model changes, and configuration updates with delta reporting.

16 Plugin Lifecycle Hooks

OnEvalRunStarted, OnCaseCompleted, OnRegressionDetected, and 13 other lifecycle events. Wire in metrics, audit trails, or custom processing logic without modifying engine code.

Learn about the architecture

RunEval()

eval.run

Scorers22 built-in

Aggregatepass rate

case.invoked

✓ Scored

scores.computed

⟳ Scoring

baseline.checked

✓ No regression

Passed

Scoring

Comparing

Failed

Developer Experience

Simple API. Powerful evaluation.

Create an evaluation suite and score AI outputs in under 20 lines. Sentinel handles the rest.

Setup & Run

main.go

1package main
2 
3import (
4  "context"
5  "log/slog"
6 
7  "github.com/xraph/sentinel"
8  "github.com/xraph/sentinel/scorer"
9  "github.com/xraph/sentinel/store/memory"
10)
11 
12func main() {
13  ctx := context.Background()
14 
15  engine, _ := sentinel.NewEngine(
16    sentinel.WithStore(memory.New()),
17    sentinel.WithLogger(slog.Default()),
18  )
19 
20  ctx = sentinel.WithTenant(ctx, "tenant-1")
21  ctx = sentinel.WithApp(ctx, "myapp")
22 
23  "text-fd-muted-foreground/60 italic">// Create a suite with test cases
24  suite, _ := engine.CreateSuite(ctx,
25    sentinel.CreateSuiteInput{
26      Name: "qa-eval",
27    })
28 
29  _, _ = engine.CreateCase(ctx, suite.ID,
30    sentinel.CreateCaseInput{
31      Input:    "What is Go?",
32      Expected: "A compiled language.",
33      Scenario: "factual",
34    })
35}

Score & Verify

eval.go

1package main
2 
3import (
4  "context"
5  "fmt"
6 
7  "github.com/xraph/sentinel"
8  "github.com/xraph/sentinel/scorer"
9)
10 
11func runEval(
12  engine *sentinel.Engine,
13  ctx context.Context,
14  suiteID string,
15) {
16  ctx = sentinel.WithTenant(ctx, "tenant-1")
17 
18  "text-fd-muted-foreground/60 italic">// Run evaluation with multiple scorers
19  result, _ := engine.RunEval(ctx, suiteID,
20    sentinel.RunEvalInput{
21      Model: "gpt-4o",
22      Scorers: []scorer.Scorer{
23        scorer.Length(),
24        scorer.LLMJudge(llmClient),
25      },
26    })
27 
28  fmt.Printf("Pass rate: %.0f%%\n",
29    result.PassRate*100)
30  "text-fd-muted-foreground/60 italic">// Pass rate: 92%
31 
32  "text-fd-muted-foreground/60 italic">// Save baseline for regression detection
33  _, _ = engine.SaveBaseline(ctx, suiteID)
34}

Start evaluating with Sentinel

Add production-grade AI evaluation to your Go service in minutes. Sentinel handles scoring, baselines, regression detection, and red team testing out of the box.

$go get github.com/xraph/sentinel

Get Started View Examples