Architecture

Sentinel is organized as a set of focused Go packages. The engine package is the central coordinator. All other packages define interfaces, entities, and subsystem logic that compose around it.

Package diagram

┌──────────────────────────────────────────────────────────────────────┐
│                         engine.Engine                                  │
│  CreateSuite / GetSuite / ListSuites / UpdateSuite / DeleteSuite      │
│  CreateCase / CreateCaseBatch / GetCase / ListCases / ImportCases      │
│  GetRun / ListRuns / ListResults / GetResultStats                     │
│  SaveBaseline / GetBaseline / GetLatestBaseline / ListBaselines       │
│  CreatePromptVersion / ListPromptVersions / SetCurrentPromptVersion   │
├──────────────────────────────────────────────────────────────────────┤
│                     Evaluation pipeline                                │
│  1. Load suite + cases                                                 │
│  2. Resolve target (LLM, Agent, or Function)                          │
│  3. Run each case: send input → capture output + trace                │
│  4. Score output with configured scorers                              │
│  5. Aggregate results, emit plugin hooks                              │
│  6. Compare against baseline → detect regressions                     │
├──────────────────────┬───────────────────────────────────────────────┤
│  plugin.Registry      │  api.API (Forge HTTP handlers)                │
│  OnEvalRunStarted     │  32+ REST endpoints:                          │
│  OnEvalRunCompleted   │  - Suites (5 routes)                          │
│  OnCaseCompleted      │  - Cases (5 routes)                           │
│  OnRegressionDetected │  - Runs (5 routes)                            │
│  OnBaselineSaved      │  - Baselines, RedTeam, Prompts,               │
│  (16 total hooks)     │    Scenarios, Reports                         │
├──────────────────────┴───────────────────────────────────────────────┤
│                         store.Store                                    │
│  (composite: suite.Store + testcase.Store + evalrun.Store +           │
│   baseline.Store + promptversion.Store + Migrate/Ping/Close)          │
├──────────────────────────────────────────────────────────────────────┤
│  store/postgres       │  store/sqlite         │  store/memory          │
│  (PostgreSQL + bun)   │  (SQLite + bun)       │  (in-memory maps)     │
└──────────────────────────────────────────────────────────────────────┘

Engine construction

engine.New accepts option functions:

eng, err := engine.New(
    engine.WithStore(pgStore),             // required: composite Store
    engine.WithConfig(sentinel.Config{     // optional: override defaults
        DefaultModel:  "gpt-4o",
        PassThreshold: 0.8,
    }),
    engine.WithExtension(metricsExt),      // optional: lifecycle hooks
    engine.WithLogger(slog.Default()),     // optional: structured logger
)

All components are interfaces — swap any with your own implementation.

Evaluation flow

When an evaluation run is triggered:

Load suite and cases — Read the suite configuration and all test cases from the store.
Resolve target — Set up the evaluation target: an LLMTarget for raw LLM calls, an AgentTarget for agent invocations (with full run trace capture), or a FuncTarget for wrapping plain functions.
Execute cases — For each case, send the input to the target and capture the output. Cases run concurrently up to the configured Concurrency limit.
Score results — Each case's output is evaluated by its configured scorers. Persona-aware scorers also analyze the run trace for tool usage, trait consistency, and cognitive phase transitions.
Aggregate and record — Results are aggregated into the run record with pass rate, average score, dimension scores, token usage, and cost.
Compare against baseline — If a baseline exists, compare the new run against it and detect regressions.

Tenant isolation

sentinel.WithTenant(ctx, id) and sentinel.WithApp(ctx, id) inject identifiers into the context. These are extracted at every layer:

Store — all queries include WHERE app_id = ? filters
Engine — scope is applied before any store operation
API — the Forge request context provides tenant/app identifiers

Cross-tenant access is structurally impossible: even if a caller passes a suite ID from another app, the store layer returns ErrSuiteNotFound.

Plugin system

Extensions implement the plugin.Extension base interface (just Name() string) and then opt in to specific lifecycle hooks by implementing additional interfaces:

type Extension interface {
    Name() string
}

// Opt-in hooks (implement any subset):
type EvalRunStarted interface {
    OnEvalRunStarted(ctx context.Context, suiteID id.SuiteID, runID id.EvalRunID, model string) error
}
type EvalRunCompleted interface { /* ... */ }
type CaseCompleted interface { /* ... */ }
type RegressionDetected interface { /* ... */ }
// ... 16 hooks total

The plugin.Registry type-caches extensions at registration time, so emit calls iterate only over extensions that implement the relevant hook.

Built-in extensions:

observability.MetricsExtension — counters for all lifecycle events
audithook.Extension — bridges lifecycle events to an audit trail backend

Package index

Package	Import path	Purpose
`sentinel`	`github.com/xraph/sentinel`	Root — Entity, Config, scope helpers, errors
`id`	`.../id`	TypeID-based entity identifiers (7 prefixes)
`engine`	`.../engine`	Central coordinator — all CRUD and eval orchestration
`suite`	`.../suite`	Suite entity and store interface
`testcase`	`.../testcase`	Case entity, ScenarioType, ScorerConfig
`evalrun`	`.../evalrun`	Run, Result, RunTrace entities and store
`baseline`	`.../baseline`	Baseline entity for regression detection
`promptversion`	`.../promptversion`	Prompt version entity for A/B testing
`scorer`	`.../scorer`	Scorer interface, registry, 22 built-in scorers
`scenario`	`.../scenario`	6 scenario generators
`target`	`.../target`	Target interface — LLM, Agent, Func adapters
`redteam`	`.../redteam`	5 adversarial attack generators
`comparison`	`.../comparison`	Multi-model comparison and baseline diff
`dataset`	`.../dataset`	Data loaders (JSON, CSV, JSONL) and generation
`report`	`.../report`	Report generators (terminal, JSON, HTML, CI)
`store`	`.../store`	Composite store interface
`store/postgres`	`.../store/postgres`	PostgreSQL backend (bun ORM)
`store/sqlite`	`.../store/sqlite`	SQLite backend (bun ORM)
`store/memory`	`.../store/memory`	In-memory backend for testing
`plugin`	`.../plugin`	Extension interfaces and Registry
`observability`	`.../observability`	Metrics extension
`audit_hook`	`.../audit_hook`	Audit trail extension
`api`	`.../api`	Forge-native HTTP handlers (32+ routes)
`extension`	`.../extension`	Forge framework extension adapter