The Human Model

Sentinel takes a fundamentally different approach to AI evaluation. Instead of only testing input/output accuracy, you evaluate agents across the same dimensions that make humans unique.

Philosophy

People are not defined by a single metric. They are evaluated across:

Skills they have learned (can they do the job?)
Personality traits that shape how they work (who are they?)
Behavioral habits that emerge from experience (how do they react?)
Thinking patterns that guide their reasoning (how do they think?)
Communication preferences that define their voice (how do they talk?)
Perceptual filters that determine what they notice (what do they see?)

Sentinel makes each of these a first-class evaluation dimension.

The seven dimensions

Dimension	Package	Scorer	Scenario	Description
Skill	`scorer`	`skill_usage`	`skill_challenge`	Tool selection, proficiency, correctness
Trait	`scorer`	`trait_consistency`	`trait_probe`	Personality consistency across interactions
Behavior	`scorer`	`behavior_trigger`	`behavior_trigger`	Condition-triggered action patterns
Cognitive	`scorer`	`cognitive_phase`	`cognitive_stress`	Phase transitions, depth, focus
Communication	`scorer`	`communication_style`	`comms_adaptation`	Tone, formality, verbosity, technical level
Perception	`scorer`	`perception_focus`	`perception_test`	Attention filters and detail orientation
Persona	`scorer`	`persona_coherence`	`persona_coherence`	End-to-end identity coherence

How it works

                    ┌─────────────────┐
                    │  Test Scenario   │
                    │                  │
                    │  type: skill     │
                    │  input: "..."    │
                    │  expected: "..." │
                    └────────┬─────────┘
                             │
                    ┌────────▼─────────┐
                    │  Target           │
                    │  (LLM or Agent)   │
                    │                   │
                    │  persona: "..."   │
                    └────────┬─────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼──┐  ┌───────▼───┐  ┌──────▼──────┐
     │  Output   │  │ Run Trace │  │ Tool Calls  │
     │  (text)   │  │  (steps)  │  │  (actions)  │
     └────────┬──┘  └───────┬───┘  └──────┬──────┘
              │              │              │
     ┌────────▼──────────────▼──────────────▼──────┐
     │           Multi-Dimensional Scoring           │
     │                                                │
     │  Skill: 0.92  │  Trait: 0.88  │  Comms: 0.95  │
     │  Behavior: 1.0 │  Cognition: 0.85             │
     └────────────────────────────────────────────────┘

Scenario types

Each dimension has a corresponding scenario type that generates targeted test cases:

const (
    ScenarioStandard         = "standard"          // Traditional input/output
    ScenarioSkillChallenge   = "skill_challenge"   // Tool selection & proficiency
    ScenarioTraitProbe       = "trait_probe"        // Personality consistency
    ScenarioBehaviorTrigger  = "behavior_trigger"   // Condition-action patterns
    ScenarioCognitiveStress  = "cognitive_stress"   // Thinking strategy transitions
    ScenarioCommsAdaptation  = "comms_adaptation"   // Communication style
    ScenarioPerceptionTest   = "perception_test"    // Attention & focus
    ScenarioPersonaCoherence = "persona_coherence"  // End-to-end identity
)

Traditional LLM evals (input/output with standard scorers) still work exactly as before. The persona-aware testing is an additional layer — use it when testing AI agents with defined personas, skip it when testing raw LLM outputs.

DimensionScores

Both Run and Result entities carry a DimensionScores field that maps dimension names to scores:

DimensionScores map[string]float64
// e.g., {"skill": 0.92, "trait": 0.88, "communication": 0.95}

These scores are aggregated from persona-aware scorers and provide a holistic view of agent performance.