Sentinel

The Human Model

How Sentinel evaluates AI agents as compositions of human-like dimensions.

Sentinel takes a fundamentally different approach to AI evaluation. Instead of only testing input/output accuracy, you evaluate agents across the same dimensions that make humans unique.

Philosophy

People are not defined by a single metric. They are evaluated across:

  • Skills they have learned (can they do the job?)
  • Personality traits that shape how they work (who are they?)
  • Behavioral habits that emerge from experience (how do they react?)
  • Thinking patterns that guide their reasoning (how do they think?)
  • Communication preferences that define their voice (how do they talk?)
  • Perceptual filters that determine what they notice (what do they see?)

Sentinel makes each of these a first-class evaluation dimension.

The seven dimensions

DimensionPackageScorerScenarioDescription
Skillscorerskill_usageskill_challengeTool selection, proficiency, correctness
Traitscorertrait_consistencytrait_probePersonality consistency across interactions
Behaviorscorerbehavior_triggerbehavior_triggerCondition-triggered action patterns
Cognitivescorercognitive_phasecognitive_stressPhase transitions, depth, focus
Communicationscorercommunication_stylecomms_adaptationTone, formality, verbosity, technical level
Perceptionscorerperception_focusperception_testAttention filters and detail orientation
Personascorerpersona_coherencepersona_coherenceEnd-to-end identity coherence

How it works

                    ┌─────────────────┐
                    │  Test Scenario   │
                    │                  │
                    │  type: skill     │
                    │  input: "..."    │
                    │  expected: "..." │
                    └────────┬─────────┘

                    ┌────────▼─────────┐
                    │  Target           │
                    │  (LLM or Agent)   │
                    │                   │
                    │  persona: "..."   │
                    └────────┬─────────┘

              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼──┐  ┌───────▼───┐  ┌──────▼──────┐
     │  Output   │  │ Run Trace │  │ Tool Calls  │
     │  (text)   │  │  (steps)  │  │  (actions)  │
     └────────┬──┘  └───────┬───┘  └──────┬──────┘
              │              │              │
     ┌────────▼──────────────▼──────────────▼──────┐
     │           Multi-Dimensional Scoring           │
     │                                                │
     │  Skill: 0.92  │  Trait: 0.88  │  Comms: 0.95  │
     │  Behavior: 1.0 │  Cognition: 0.85             │
     └────────────────────────────────────────────────┘

Scenario types

Each dimension has a corresponding scenario type that generates targeted test cases:

const (
    ScenarioStandard         = "standard"          // Traditional input/output
    ScenarioSkillChallenge   = "skill_challenge"   // Tool selection & proficiency
    ScenarioTraitProbe       = "trait_probe"        // Personality consistency
    ScenarioBehaviorTrigger  = "behavior_trigger"   // Condition-action patterns
    ScenarioCognitiveStress  = "cognitive_stress"   // Thinking strategy transitions
    ScenarioCommsAdaptation  = "comms_adaptation"   // Communication style
    ScenarioPerceptionTest   = "perception_test"    // Attention & focus
    ScenarioPersonaCoherence = "persona_coherence"  // End-to-end identity
)

Backward compatibility

Traditional LLM evals (input/output with standard scorers) still work exactly as before. The persona-aware testing is an additional layer — use it when testing AI agents with defined personas, skip it when testing raw LLM outputs.

DimensionScores

Both Run and Result entities carry a DimensionScores field that maps dimension names to scores:

DimensionScores map[string]float64
// e.g., {"skill": 0.92, "trait": 0.88, "communication": 0.95}

These scores are aggregated from persona-aware scorers and provide a holistic view of agent performance.

On this page