Page7/10
Agent Evaluation & Benchmarking · Page 1 of 1
Measuring Agent Success
Agent Evaluation
What to Measure?
1. Task Success Rate
Did agent accomplish the goal?
Metric: Success Rate = (Tasks completed / Total tasks) × 100%
Example:
- Asked agent to book 10 flights
- Successfully booked 9
- Success Rate = 90%
2. Efficiency
How much did agent spend (tokens, API calls, time)?
Metrics:
- Token efficiency (fewer tokens = better)
- API call efficiency (fewer calls = better)
- Time taken (faster = better)
- Cost (lower = better)
3. Quality
Are results correct and useful?
Metrics:
- Accuracy (compared to ground truth)
- User satisfaction (human rating 1-5)
- Hallucination rate (false claims)
- Relevance (how relevant to goal)
Benchmarks for Agents
HotpotQA
Multi-step question answering
Requires agent to search, reason, combine facts
Success = Correct final answer
Example:
Q: "What is the birth year of the director of Inception?"
Steps needed:
1. Search "Inception director" → Christopher Nolan
2. Search "Christopher Nolan birth year" → 1970
A: 1970
WebShop
E-commerce task: Find and purchase product within constraints
Agent must navigate website, filter options, buy item
Success = Item purchased matches requirements
ALFWorld
Virtual household environment (like text-based game)
Agent must navigate, find objects, complete tasks
Success = Task completed successfully
Evaluation Metrics
Accuracy
Accuracy = (Correct outputs / Total outputs) × 100%
F1 Score
Balances precision and recall
F1 = 2 × (precision × recall) / (precision + recall)
Tool Use Metrics
- Tool call accuracy (% of valid tool calls)
- Tool call necessity (% of tools actually needed)
- Tool diversity (agent uses many tools?)
Human Evaluation
Sometimes automated metrics miss important aspects:
Human evaluators rate:
- Correctness (1-5)
- Helpfulness (1-5)
- Safety (1-5)
- Efficiency (1-5)
Combine human & automated evaluation for best assessment!
main.py
Loading...
OUTPUT
▶Click "Run Code" to execute…