Focus 1
Capability
Measure whether the agent can complete intended tasks without hidden manual work.
Agent systems need repeatable evaluation across capability, safety, refusal behavior, and tool use.
Focus 1
Measure whether the agent can complete intended tasks without hidden manual work.
Focus 2
Evaluate unsafe action prevention, least-privilege tool access, and policy boundaries.
Focus 3
Test prompt-injection handling across user input, retrieved content, and tool outputs.
Focus 4
Validate schema use, authorization, timeouts, and error handling for every tool.
Focus 5
Check that refusals are neither too broad nor too narrow for risky tasks.
Focus 6
Record prompts, tool versions, seeds where available, and source artifacts.