Introduction

Sentinel is a Go library for testing AI agents the way you'd evaluate a human professional. Instead of simple input/output scoring, you evaluate across multiple dimensions — skills, traits, behaviors, cognition, communication, perception, and persona coherence — the same building blocks that make people unique.

Sentinel is a library — not a service. You bring your own LLM provider, database, and HTTP server. Sentinel provides the evaluation orchestration plumbing.

The Human-Like Testing Model

Sentinel evaluates AI agents the way you would evaluate a person:

Dimension	What it tests	Scorer
Skill	Can the agent do the job? (tool selection, proficiency)	`skill_usage`
Trait	Who is the agent? (personality consistency)	`trait_consistency`
Behavior	How does it react? (trigger-action patterns)	`behavior_trigger`
Cognition	How does it think? (phase transitions, depth)	`cognitive_phase`
Communication	How does it talk? (tone, formality, verbosity)	`communication_style`
Perception	What does it notice? (attention focus, detail)	`perception_focus`
Persona	The whole person (end-to-end identity coherence)	`persona_coherence`

What it does

Multi-dimensional scoring — Score across 7 human-like dimensions simultaneously, or use traditional input/output scorers.
22 built-in scorers — From exact match and regex to LLM-as-judge, semantic similarity, and all 7 persona-aware scorers.
Scenario generation — Auto-generate test cases targeting specific evaluation dimensions.
Baseline & regression detection — Track performance over time and detect regressions.
Adversarial testing (red team) — Test for prompt injection, jailbreaks, data leakage, hallucination, and off-topic responses.
Multi-model comparison — Compare performance across different LLMs or agent configurations.
Prompt versioning — Track system prompt iterations with performance metrics.
Plugin system — 16 lifecycle hooks for metrics, audit trails, and custom processing.
Three storage backends — PostgreSQL, SQLite, and in-memory.
Forge integration — Drop-in forge.Extension with DI-injected Engine and auto-registered HTTP routes.
REST API — 32+ endpoints for managing suites, cases, runs, baselines, red team, prompts, scenarios, and reports.
go test integration — Assertion functions for CI/CD pipelines.

Design philosophy

Library, not service. Sentinel is a set of Go packages you import. You control main, the database connection, and the process lifecycle.

Human-like evaluation. You don't just test output — you test skills, personality, thinking patterns, and communication style. Traditional LLM evals still work as a subset.

Interfaces over implementations. Every subsystem defines a Go interface. Swap any storage backend with a single type change.

Tenant-scoped by design. sentinel.WithTenant and sentinel.WithApp inject context enforced at every layer.

TypeID everywhere. All entities use type-prefixed, K-sortable, UUIDv7-based identifiers (suite_, tcase_, erun_, eres_, base_, etc.).

Quick look

package main

import (
    "context"
    "log"

    "github.com/xraph/sentinel/engine"
    "github.com/xraph/sentinel/store/memory"
)

func main() {
    ctx := context.Background()

    // Create an in-memory store for development.
    memStore := memory.New()

    // Build the Sentinel engine.
    eng, err := engine.New(
        engine.WithStore(memStore),
    )
    if err != nil {
        log.Fatal(err)
    }
    _ = eng
    _ = ctx
}

Introduction

The Human-Like Testing Model

What it does

Design philosophy

Quick look

Where to go next

Getting started

Architecture

The Human Model

HTTP API reference

On this page