Testing AI Agents — How to Evaluate Before You Deploy

Testing AI Agents — How to Evaluate Before You Deploy
Opening
Last year I deployed an Agent that worked flawlessly in staging but caused an incident on day three in production — the Agent mixed up "AM" and "PM" in conversations with customers in certain time zones, sending out a batch of incorrect meeting invitations. Root cause: my test cases all used UTC, with zero coverage for time zone conversions. That incident made me completely redesign my Agent testing framework. This article shares the full testing framework I use today.
Problem Background
Agent testing is fundamentally different from traditional software testing:
Non-deterministic output: The same input can produce different outputs. You can't test with assertEqual(output, expected).
Long behavior chains: An Agent might go through "understand question → select tool → call API → parse result → generate reply" — five steps, each of which can fail.
Heavy external dependencies: Agents depend on LLM APIs, databases, and external tools. The behavior of these dependencies is also non-deterministic.
Fuzzy correctness boundaries: What counts as a "correct" answer? If a customer asks "What's your cheapest plan?" and the Agent answers "$29/month" versus "Our starter plan is $29 per month," both are correct — but "about $30" is debatable.
Traditional unit tests and integration tests are still necessary, but they're not enough. Agents need an additional evaluation layer.
Core Framework
The Testing Pyramid (Agent Edition)
╱ ╲
╱ E2E╲ ← End-to-end scenario tests (few, time-consuming)
╱ Tests ╲
╱─────────╲
╱ Agent ╲ ← Agent behavior evaluation (LLM-as-judge)
╱ Evaluation ╲
╱───────────────╲
╱ Integration ╲ ← Tool call + API integration tests
╱ Tests ╲
╱─────────────────────╲
╱ Unit Tests ╲ ← Pure function tests (tool functions, parsing logic)
╱─────────────────────────╲
Unit Tests: Test the parts that don't depend on LLMs — tool functions, data parsing, format validation. Fast, cheap, deterministic.
Integration Tests: Test tool call pipelines — can we correctly call APIs, parse return values, handle errors?
Agent Evaluation: Use an LLM to assess the quality of Agent outputs. This is the core of Agent testing, and the most challenging part.
E2E Tests: Full user scenario simulation, from input to final output, covering multi-turn conversations.
Implementation Details
Step 1: Unit Tests — Testing the Deterministic Parts
Extract all non-LLM-dependent logic from the Agent system and unit test it:
import pytest
from agent.tools import parse_order_id, format_price, validate_email
# Testing deterministic functions: traditional asserts are fine
class TestToolFunctions:
def test_parse_order_id_valid(self):
assert parse_order_id("ORD-2026-0001") == "ORD-2026-0001"
def test_parse_order_id_from_text(self):
text = "My order number is ORD-2026-0001, can you check on it?"
assert parse_order_id(text) == "ORD-2026-0001"
def test_parse_order_id_invalid(self):
assert parse_order_id("text without an order number") is None
def test_format_price_cny(self):
assert format_price(2999, "CNY") == "¥2,999.00"
def test_format_price_usd(self):
assert format_price(29.99, "USD") == "$29.99"
def test_validate_email(self):
assert validate_email("user@example.com") is True
assert validate_email("not-an-email") is False
These tests run fast (millisecond-level), cost no API money, and can run on every commit.
Step 2: Integration Tests — Mock the LLM, Test the Tool Pipeline
import pytest
from unittest.mock import AsyncMock, patch
from agent.core import CustomerServiceAgent
class TestToolIntegration:
"""Test whether the Agent's tool calls are correct"""
@pytest.fixture
def agent(self):
return CustomerServiceAgent(model="claude-sonnet-4-5-20250514")
@pytest.mark.asyncio
async def test_order_query_calls_correct_tool(self, agent):
"""Verify that an order query message triggers the correct tool call"""
# Mock the LLM to return a tool_use response
mock_response = create_mock_tool_use_response(
tool_name="get_order_status",
tool_input={"order_id": "ORD-2026-0001"}
)
with patch.object(agent.client.messages, "create", return_value=mock_response):
# Mock tool execution
with patch.object(agent, "execute_tool") as mock_tool:
mock_tool.return_value = '{"status": "shipped", "tracking": "SF1234"}'
await agent.handle_message(
"What's the status of my order ORD-2026-0001?",
customer_id="CUST001"
)
# Verify the correct tool and arguments were called
mock_tool.assert_called_once_with(
"get_order_status",
{"order_id": "ORD-2026-0001"}
)
@pytest.mark.asyncio
async def test_tool_error_handling(self, agent):
"""Verify graceful error handling when a tool call fails"""
with patch.object(agent, "execute_tool", side_effect=TimeoutError("API timeout")):
result = await agent.handle_message(
"Check order ORD-2026-0001",
customer_id="CUST001"
)
# Agent should degrade gracefully, not throw an exception
assert "later" in result or "retry" in result or "human" in result
Step 3: LLM-as-Judge — Evaluating Agent Output Quality
This is the core of Agent testing. Use a "judge model" to evaluate the quality of Agent outputs:
import anthropic
import json
from dataclasses import dataclass
@dataclass
class EvalCase:
"""Evaluation test case"""
input_message: str # User input
context: str # Scenario context
expected_behavior: str # Expected behavior (natural language description)
criteria: list[str] # Evaluation dimensions
@dataclass
class EvalResult:
"""Evaluation result"""
score: float # 0-1
passed: bool
reasoning: str
dimension_scores: dict[str, float]
class LLMJudge:
"""Use an LLM to evaluate Agent output quality"""
def __init__(self):
self.client = anthropic.Anthropic()
async def evaluate(
self,
eval_case: EvalCase,
agent_response: str
) -> EvalResult:
"""Score an Agent's reply"""
judge_prompt = f"""You are an expert evaluator of AI Agent output quality.
## Evaluation Scenario
User input: {eval_case.input_message}
Scenario context: {eval_case.context}
Expected behavior: {eval_case.expected_behavior}
## Agent's Actual Reply
{agent_response}
## Evaluation Dimensions
{json.dumps(eval_case.criteria, ensure_ascii=False)}
## Evaluation Rules
- Score each dimension (0.0-1.0)
- Focus on factual accuracy (most important), information completeness, tone appropriateness
- If the Agent fabricated information, give that dimension a 0
- If the Agent correctly expressed uncertainty and suggested escalation to a human, it should still score high even without a direct answer
Output JSON:
{{
"overall_score": 0.0-1.0,
"dimensions": {{
"dimension_name": {{"score": 0.0-1.0, "reason": "explanation"}}
}},
"critical_issues": ["list of critical issues"],
"reasoning": "overall evaluation rationale"
}}"""
response = self.client.messages.create(
model="claude-sonnet-4-5-20250514", # Use Sonnet as the judge model
max_tokens=1024,
messages=[{"role": "user", "content": judge_prompt}]
)
result = json.loads(response.content[0].text)
return EvalResult(
score=result["overall_score"],
passed=result["overall_score"] >= 0.75,
reasoning=result["reasoning"],
dimension_scores={
k: v["score"] for k, v in result["dimensions"].items()
}
)
# Evaluation test suite
EVAL_SUITE = [
EvalCase(
input_message="What's your cheapest plan?",
context="Product has three plans: Starter $29/month, Pro $79/month, Enterprise custom",
expected_behavior="Accurately report the Starter plan price of $29/month",
criteria=["Price accuracy", "Information completeness", "Tone appropriateness"]
),
EvalCase(
input_message="I want a refund",
context="Refund policy: Full refund within 30 days of purchase, requires human processing",
expected_behavior="State the refund policy, then escalate to a human agent",
criteria=["Policy accuracy", "Correct escalation to human", "Tone appropriateness"]
),
EvalCase(
input_message="Do you support SAML SSO?",
context="The knowledge base has no information about SAML SSO",
expected_behavior="Acknowledge uncertainty, suggest escalation to human or checking docs; do not fabricate an answer",
criteria=["Honesty", "Avoidance of fabrication", "Guidance behavior"]
),
]
Step 4: A/B Testing Framework
Compare two Agent versions in production:
import random
import time
from dataclasses import dataclass, field
@dataclass
class ABTestConfig:
"""A/B test configuration"""
test_name: str
variant_a: dict # Agent A config
variant_b: dict # Agent B config
traffic_split: float # Traffic ratio for A (0.0-1.0)
min_sample_size: int # Minimum sample size
@dataclass
class ABTestMetrics:
"""A/B test metrics"""
variant: str
response_time_ms: float
token_cost: float
customer_satisfaction: float | None = None
escalation_rate: float = 0.0
class ABTestRunner:
def __init__(self, config: ABTestConfig):
self.config = config
self.metrics: dict[str, list[ABTestMetrics]] = {"A": [], "B": []}
def assign_variant(self, customer_id: str) -> str:
"""Deterministic grouping based on customer ID (same customer always in same group)"""
hash_val = hash(customer_id) % 100
return "A" if hash_val < self.config.traffic_split * 100 else "B"
async def run_and_record(
self, customer_id: str, message: str
) -> tuple[str, str]:
"""Run the A/B test and record metrics"""
variant = self.assign_variant(customer_id)
agent_config = (
self.config.variant_a if variant == "A"
else self.config.variant_b
)
start = time.time()
response = await run_agent(message, customer_id, agent_config)
elapsed = (time.time() - start) * 1000
self.metrics[variant].append(ABTestMetrics(
variant=variant,
response_time_ms=elapsed,
token_cost=response["token_cost"],
))
return variant, response["text"]
def get_summary(self) -> dict:
"""Summarize A/B test results"""
summary = {}
for variant in ("A", "B"):
data = self.metrics[variant]
if not data:
continue
summary[variant] = {
"sample_size": len(data),
"avg_response_ms": sum(m.response_time_ms for m in data) / len(data),
"avg_cost": sum(m.token_cost for m in data) / len(data),
"p95_response_ms": sorted(
[m.response_time_ms for m in data]
)[int(len(data) * 0.95)],
}
return summary
Step 5: Regression Tests — Run These Every Time You Change a Prompt
# Regression test suite
class AgentRegressionTest:
"""Tests that must run after every prompt or model change"""
def __init__(self):
self.judge = LLMJudge()
self.test_cases = self._load_test_cases()
def _load_test_cases(self) -> list[EvalCase]:
"""Load test cases from a JSON file"""
import json
with open("tests/regression_cases.json") as f:
cases = json.load(f)
return [EvalCase(**case) for case in cases]
async def run_full_suite(self, agent) -> dict:
"""Run the full regression suite"""
results = []
for case in self.test_cases:
# Run the Agent
response = await agent.handle_message(
case.input_message,
customer_id="TEST_USER"
)
# LLM evaluation
eval_result = await self.judge.evaluate(case, response)
results.append({
"case": case.input_message[:50],
"score": eval_result.score,
"passed": eval_result.passed,
"reasoning": eval_result.reasoning
})
# Summary
total = len(results)
passed = sum(1 for r in results if r["passed"])
avg_score = sum(r["score"] for r in results) / total
return {
"total_cases": total,
"passed": passed,
"failed": total - passed,
"pass_rate": passed / total,
"avg_score": avg_score,
"failed_cases": [r for r in results if not r["passed"]]
}
Lessons from the Field
Production Data
My customer service Agent testing stack:
| Test Layer | # Cases | Runtime | Frequency | Cost/Run |
|---|---|---|---|---|
| Unit Tests | 127 | 3 sec | Every commit | $0 |
| Integration Tests | 43 | 25 sec | Every PR | $0 |
| LLM-as-Judge Eval | 85 | 12 min | Every prompt change | $2.50 |
| Full Regression Suite | 85 | 15 min | Weekly | $3.20 |
| A/B Tests | Ongoing | Ongoing | On new versions | ~$50/week |
Pitfalls We Hit
Pitfall 1: LLM-as-Judge bias. LLMs tend to give higher scores to longer, more formal answers. Solution: Explicitly state in the judge prompt that "concise and precise answers are preferred over verbose but accurate ones," and add a length penalty.
Pitfall 2: Test case maintenance overhead. Every time a product feature updates, test cases need updating too. We established a rule: every new feature's PRD must include 3-5 Agent test cases.
Pitfall 3: Regression test result variance. The same test suite can produce LLM-as-Judge scores that fluctuate 5-10% between runs. Solution: Run each case 3 times and take the median to reduce randomness.
Pitfall 4: A/B test sample size issues. A product with 20K DAU only generates a few hundred support requests per day. Reaching statistical significance takes 2-3 weeks. Workaround: Run synthetic evaluations in parallel (use an LLM to simulate customer test messages).
Conclusion
Three core takeaways:
-
Agent testing is layered, not one-size-fits-all — Use traditional unit tests for deterministic parts, integration tests for tool pipelines, LLM-as-Judge for output quality, and A/B tests for production performance. Each layer solves a different problem.
-
LLM-as-Judge is the core technique for Agent evaluation, but know its limitations — The judge model has its own biases (favoring longer answers, formal tone). These need explicit correction in the prompt. Critical quality metrics (pricing, policies) still need rule-based checks.
-
Regression tests are your safety net — Every time you change a prompt, swap a model, or update the knowledge base, you must run regression tests. Without this safety net, you won't dare change anything, and the Agent will stay frozen at version one.
If you haven't built a testing framework for your Agent yet, start simple: write 10 core scenario LLM-as-Judge evaluation cases, paired with 20 deterministic unit tests. That combination covers the most common issues.
How do you test your Agents? Have you hit similar pitfalls? I'd love to hear about it.