Skill Evaluation
Test what an agent can do — tool selection, proficiency, and correctness.
Skill evaluation tests whether an agent can perform its job by selecting the right tools, using them proficiently, and producing correct results.
What it tests
- Does the agent select the appropriate tool for the task?
- Does it use the tool correctly (proper arguments, error handling)?
- Does it achieve the desired outcome?
- Does proficiency match the expected level?
Scorer: skill_usage
The skill_usage scorer evaluates tool selection and usage by analyzing the agent's run trace:
testcase.ScorerConfig{
Name: "skill_usage",
Config: map[string]any{
"expected_tools": []string{"read_file", "search_code"},
"proficiency_min": 0.7,
},
}The scorer examines the RunTrace to check which tools were called, in what order, and whether the arguments were well-formed.
Scenario: skill_challenge
The skill_challenge scenario generator creates test cases that require the agent to select and use specific tools:
Case{
ScenarioType: testcase.ScenarioSkillChallenge,
Input: "Find all security vulnerabilities in the auth module",
Context: map[string]any{
"available_tools": []string{"read_file", "search_code", "run_tests"},
"expected_tools": []string{"search_code", "read_file"},
},
}Dimension score
The skill dimension contributes to the DimensionScores map on both Result and Run:
result.DimensionScores["skill"] // 0.0 to 1.0Use cases
- Verify a code review agent uses
search_codebeforeread_file - Test that a data analyst agent selects the right database query tools
- Ensure a customer support agent escalates correctly when tools fail