Skill Evaluation

Skill evaluation tests whether an agent can perform its job by selecting the right tools, using them proficiently, and producing correct results.

What it tests

Does the agent select the appropriate tool for the task?
Does it use the tool correctly (proper arguments, error handling)?
Does it achieve the desired outcome?
Does proficiency match the expected level?

Scorer: `skill_usage`

The skill_usage scorer evaluates tool selection and usage by analyzing the agent's run trace:

testcase.ScorerConfig{
    Name: "skill_usage",
    Config: map[string]any{
        "expected_tools":  []string{"read_file", "search_code"},
        "proficiency_min": 0.7,
    },
}

The scorer examines the RunTrace to check which tools were called, in what order, and whether the arguments were well-formed.

Scenario: `skill_challenge`

The skill_challenge scenario generator creates test cases that require the agent to select and use specific tools:

Case{
    ScenarioType: testcase.ScenarioSkillChallenge,
    Input:        "Find all security vulnerabilities in the auth module",
    Context: map[string]any{
        "available_tools": []string{"read_file", "search_code", "run_tests"},
        "expected_tools":  []string{"search_code", "read_file"},
    },
}

Dimension score

The skill dimension contributes to the DimensionScores map on both Result and Run:

result.DimensionScores["skill"] // 0.0 to 1.0

Use cases

Verify a code review agent uses search_code before read_file
Test that a data analyst agent selects the right database query tools
Ensure a customer support agent escalates correctly when tools fail

Skill Evaluation

What it tests

Scorer: skill_usage

Scenario: skill_challenge

Dimension score

Use cases

On this page

Scorer: `skill_usage`

Scenario: `skill_challenge`