Alt+←/→to navigatePage7/1070

Agent Evaluation & Benchmarking · Page 1 of 1

Measuring Agent Success

28 min Intermediate

Agent Evaluation

What to Measure?

1. Task Success Rate

Did agent accomplish the goal?

Metric: Success Rate = (Tasks completed / Total tasks) × 100%

Example:
- Asked agent to book 10 flights
- Successfully booked 9
- Success Rate = 90%

2. Efficiency

How much did agent spend (tokens, API calls, time)?

Metrics:
- Token efficiency (fewer tokens = better)
- API call efficiency (fewer calls = better)
- Time taken (faster = better)
- Cost (lower = better)

3. Quality

Are results correct and useful?

Metrics:
- Accuracy (compared to ground truth)
- User satisfaction (human rating 1-5)
- Hallucination rate (false claims)
- Relevance (how relevant to goal)

Benchmarks for Agents

HotpotQA

Multi-step question answering
Requires agent to search, reason, combine facts
Success = Correct final answer

Example:
Q: "What is the birth year of the director of Inception?"
Steps needed:
  1. Search "Inception director" → Christopher Nolan
  2. Search "Christopher Nolan birth year" → 1970
A: 1970

WebShop

E-commerce task: Find and purchase product within constraints
Agent must navigate website, filter options, buy item

Success = Item purchased matches requirements

ALFWorld

Virtual household environment (like text-based game)
Agent must navigate, find objects, complete tasks

Success = Task completed successfully

Evaluation Metrics

Accuracy

Accuracy = (Correct outputs / Total outputs) × 100%

F1 Score

Balances precision and recall
F1 = 2 × (precision × recall) / (precision + recall)

Tool Use Metrics

- Tool call accuracy (% of valid tool calls)
- Tool call necessity (% of tools actually needed)
- Tool diversity (agent uses many tools?)

Human Evaluation

Sometimes automated metrics miss important aspects:

Human evaluators rate:
- Correctness (1-5)
- Helpfulness (1-5)
- Safety (1-5)
- Efficiency (1-5)

Combine human & automated evaluation for best assessment!

main.py

OUTPUT

▶Click "Run Code" to execute…