Sentinel

Baselines & Regression Detection

Track performance over time and detect regressions against known-good baselines.

Baselines capture the results of a known-good evaluation run. Future runs can be compared against the baseline to detect performance regressions.

Baseline

type Baseline struct {
    ID              id.BaselineID
    SuiteID         id.SuiteID
    RunID           id.EvalRunID
    Name            string
    Results         []BaselineResult
    PassRate        float64
    AvgScore        float64
    DimensionScores map[string]float64
    IsCurrent       bool
    CreatedAt       time.Time
}

BaselineResult

Per-case baseline data for comparison:

type BaselineResult struct {
    CaseID          id.CaseID
    CaseName        string
    Score           float64
    Status          string
    DimensionScores map[string]float64
}

Creating a baseline

Save a known-good run as the current baseline:

b := &baseline.Baseline{
    SuiteID:   suiteID,
    RunID:     runID,
    Name:      "v2.1-release",
    PassRate:  0.95,
    AvgScore:  0.88,
    IsCurrent: true,
}
if err := eng.SaveBaseline(ctx, b); err != nil {
    log.Fatal(err)
}

The IsCurrent flag marks which baseline is the active reference point. Only one baseline per suite should be current.

Regression detection

Compare a new run against the current baseline to detect performance drops:

  • Pass rate regression — Overall pass rate dropped below the baseline
  • Score regression — Average score dropped by more than the configured threshold
  • Per-case regression — Individual cases that passed in the baseline now fail
  • Dimension regression — Specific dimensions (e.g., skill, trait) degraded

When a regression is detected, the OnRegressionDetected plugin hook fires:

type RegressionDetected interface {
    OnRegressionDetected(ctx context.Context, suiteID id.SuiteID, baselineID id.BaselineID, delta float64) error
}

DimensionScores tracking

Baselines track per-dimension scores alongside overall metrics:

b.DimensionScores = map[string]float64{
    "skill":         0.92,
    "trait":          0.88,
    "communication": 0.95,
    "cognition":     0.85,
}

This enables regression detection per dimension — catch a drop in skill evaluation even if overall pass rate holds steady.

Engine methods

eng.SaveBaseline(ctx, baseline)
eng.GetBaseline(ctx, baselineID)
eng.GetLatestBaseline(ctx, suiteID)
eng.ListBaselines(ctx, suiteID)
eng.DeleteBaseline(ctx, baselineID)

API routes

MethodPathDescription
POST/sentinel/baselines/:suiteIdSave a baseline
GET/sentinel/baselines/:suiteIdGet the latest baseline
GET/sentinel/baselines/:suiteId/baselinesList all baselines

On this page