The Human Model
How Sentinel evaluates AI agents as compositions of human-like dimensions.
Sentinel takes a fundamentally different approach to AI evaluation. Instead of only testing input/output accuracy, you evaluate agents across the same dimensions that make humans unique.
Philosophy
People are not defined by a single metric. They are evaluated across:
- Skills they have learned (can they do the job?)
- Personality traits that shape how they work (who are they?)
- Behavioral habits that emerge from experience (how do they react?)
- Thinking patterns that guide their reasoning (how do they think?)
- Communication preferences that define their voice (how do they talk?)
- Perceptual filters that determine what they notice (what do they see?)
Sentinel makes each of these a first-class evaluation dimension.
The seven dimensions
| Dimension | Package | Scorer | Scenario | Description |
|---|---|---|---|---|
| Skill | scorer | skill_usage | skill_challenge | Tool selection, proficiency, correctness |
| Trait | scorer | trait_consistency | trait_probe | Personality consistency across interactions |
| Behavior | scorer | behavior_trigger | behavior_trigger | Condition-triggered action patterns |
| Cognitive | scorer | cognitive_phase | cognitive_stress | Phase transitions, depth, focus |
| Communication | scorer | communication_style | comms_adaptation | Tone, formality, verbosity, technical level |
| Perception | scorer | perception_focus | perception_test | Attention filters and detail orientation |
| Persona | scorer | persona_coherence | persona_coherence | End-to-end identity coherence |
How it works
┌─────────────────┐
│ Test Scenario │
│ │
│ type: skill │
│ input: "..." │
│ expected: "..." │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Target │
│ (LLM or Agent) │
│ │
│ persona: "..." │
└────────┬─────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌────────▼──┐ ┌───────▼───┐ ┌──────▼──────┐
│ Output │ │ Run Trace │ │ Tool Calls │
│ (text) │ │ (steps) │ │ (actions) │
└────────┬──┘ └───────┬───┘ └──────┬──────┘
│ │ │
┌────────▼──────────────▼──────────────▼──────┐
│ Multi-Dimensional Scoring │
│ │
│ Skill: 0.92 │ Trait: 0.88 │ Comms: 0.95 │
│ Behavior: 1.0 │ Cognition: 0.85 │
└────────────────────────────────────────────────┘Scenario types
Each dimension has a corresponding scenario type that generates targeted test cases:
const (
ScenarioStandard = "standard" // Traditional input/output
ScenarioSkillChallenge = "skill_challenge" // Tool selection & proficiency
ScenarioTraitProbe = "trait_probe" // Personality consistency
ScenarioBehaviorTrigger = "behavior_trigger" // Condition-action patterns
ScenarioCognitiveStress = "cognitive_stress" // Thinking strategy transitions
ScenarioCommsAdaptation = "comms_adaptation" // Communication style
ScenarioPerceptionTest = "perception_test" // Attention & focus
ScenarioPersonaCoherence = "persona_coherence" // End-to-end identity
)Backward compatibility
Traditional LLM evals (input/output with standard scorers) still work exactly as before. The persona-aware testing is an additional layer — use it when testing AI agents with defined personas, skip it when testing raw LLM outputs.
DimensionScores
Both Run and Result entities carry a DimensionScores field that maps dimension names to scores:
DimensionScores map[string]float64
// e.g., {"skill": 0.92, "trait": 0.88, "communication": 0.95}These scores are aggregated from persona-aware scorers and provide a holistic view of agent performance.