Eval Runs & Results

Every evaluation execution is tracked as a Run containing Results. This model provides full observability into how an evaluation suite performed against a target.

Run

A Run represents a single execution of an evaluation suite against a target:

type Run struct {
    sentinel.Entity
    ID              id.EvalRunID
    SuiteID         id.SuiteID
    Model           string
    SystemPrompt    string
    Temperature     float64
    TotalCases      int
    Passed          int
    Failed          int
    PassRate        float64
    AvgScore        float64
    AvgLatencyMs    int
    TotalTokens     int
    TotalCost       float64
    AppID           string
    TargetTenantID  string
    PersonaRef      string
    State           RunState
    DimensionScores map[string]float64
    CompletedAt     *time.Time
}

Run states

Runs follow a state machine with 4 states:

State	Description
`running`	Evaluation is actively processing cases
`completed`	All cases finished successfully
`failed`	Evaluation terminated with an error
`cancelled`	Evaluation was cancelled

Result

A Result is the outcome of evaluating a single test case within a run:

type Result struct {
    sentinel.Entity
    ID              id.EvalResultID
    RunID           id.EvalRunID
    CaseID          id.CaseID
    CaseName        string
    Status          ResultStatus    // pass, fail, error
    Score           float64
    Output          string
    LatencyMs       int
    TokensUsed      int
    Cost            float64
    ScorerResults   []ScorerResult
    DimensionScores map[string]float64
    RunTrace        *RunTrace
}

Result status

Status	Description
`pass`	Score meets or exceeds the pass threshold
`fail`	Score is below the pass threshold
`error`	Evaluation failed (target error, scorer error, etc.)

ScorerResult

Each scorer produces a ScorerResult:

type ScorerResult struct {
    ScorerName string
    Score      float64
    Passed     bool
    Reason     string
    Dimension  string         // e.g., "skill", "trait"
    Details    map[string]any
}

RunTrace

For persona-aware evaluations using AgentTarget, the RunTrace captures the agent's execution:

type RunTrace struct {
    Steps     []StepTrace
    ToolCalls []ToolTrace
}

type StepTrace struct {
    Index      int
    Type       string
    Output     string
    TokensUsed int
}

type ToolTrace struct {
    ToolName  string
    Arguments string
    Result    string
    Error     string
}

ResultStats

Aggregate statistics for a run:

type ResultStats struct {
    TotalCases      int
    Passed          int
    Failed          int
    Errored         int
    PassRate        float64
    AvgScore        float64
    AvgLatencyMs    int
    TotalTokens     int
    TotalCost       float64
    DimensionScores map[string]float64
}

Engine methods

eng.GetRun(ctx, runID)
eng.ListRuns(ctx, filter)
eng.ListRunsBySuite(ctx, suiteID)
eng.ListResults(ctx, runID)
eng.GetResultStats(ctx, runID)

API routes

Method	Path	Description
`GET`	`/sentinel/runs`	List runs
`GET`	`/sentinel/runs/:id`	Get a specific run
`GET`	`/sentinel/runs/:id/results`	List results for a run
`GET`	`/sentinel/runs/:id/stats`	Get aggregate stats
`POST`	`/sentinel/suites/:id/run`	Execute an evaluation

Eval Runs & Results

On this page