Sentinel

Skill Evaluation

Test what an agent can do — tool selection, proficiency, and correctness.

Skill evaluation tests whether an agent can perform its job by selecting the right tools, using them proficiently, and producing correct results.

What it tests

  • Does the agent select the appropriate tool for the task?
  • Does it use the tool correctly (proper arguments, error handling)?
  • Does it achieve the desired outcome?
  • Does proficiency match the expected level?

Scorer: skill_usage

The skill_usage scorer evaluates tool selection and usage by analyzing the agent's run trace:

testcase.ScorerConfig{
    Name: "skill_usage",
    Config: map[string]any{
        "expected_tools":  []string{"read_file", "search_code"},
        "proficiency_min": 0.7,
    },
}

The scorer examines the RunTrace to check which tools were called, in what order, and whether the arguments were well-formed.

Scenario: skill_challenge

The skill_challenge scenario generator creates test cases that require the agent to select and use specific tools:

Case{
    ScenarioType: testcase.ScenarioSkillChallenge,
    Input:        "Find all security vulnerabilities in the auth module",
    Context: map[string]any{
        "available_tools": []string{"read_file", "search_code", "run_tests"},
        "expected_tools":  []string{"search_code", "read_file"},
    },
}

Dimension score

The skill dimension contributes to the DimensionScores map on both Result and Run:

result.DimensionScores["skill"] // 0.0 to 1.0

Use cases

  • Verify a code review agent uses search_code before read_file
  • Test that a data analyst agent selects the right database query tools
  • Ensure a customer support agent escalates correctly when tools fail

On this page