Testing AI Agents — How to Evaluate Before You Deploy

Opening

Last year I deployed an Agent that worked flawlessly in staging but caused an incident on day three in production — the Agent mixed up "AM" and "PM" in conversations with customers in certain time zones, sending out a batch of incorrect meeting invitations. Root cause: my test cases all used UTC, with zero coverage for time zone conversions. That incident made me completely redesign my Agent testing framework. This article shares the full testing framework I use today.

Problem Background

Agent testing is fundamentally different from traditional software testing:

Non-deterministic output: The same input can produce different outputs. You can't test with assertEqual(output, expected).

Long behavior chains: An Agent might go through "understand question → select tool → call API → parse result → generate reply" — five steps, each of which can fail.

Heavy external dependencies: Agents depend on LLM APIs, databases, and external tools. The behavior of these dependencies is also non-deterministic.

Fuzzy correctness boundaries: What counts as a "correct" answer? If a customer asks "What's your cheapest plan?" and the Agent answers "$29/month" versus "Our starter plan is $29 per month," both are correct — but "about $30" is debatable.

Traditional unit tests and integration tests are still necessary, but they're not enough. Agents need an additional evaluation layer.

Core Framework

The Testing Pyramid (Agent Edition)

          ╱ ╲
         ╱ E2E╲          ← End-to-end scenario tests (few, time-consuming)
        ╱ Tests ╲
       ╱─────────╲
      ╱  Agent    ╲      ← Agent behavior evaluation (LLM-as-judge)
     ╱ Evaluation  ╲
    ╱───────────────╲
   ╱  Integration    ╲   ← Tool call + API integration tests
  ╱   Tests           ╲
 ╱─────────────────────╲
╱      Unit Tests       ╲ ← Pure function tests (tool functions, parsing logic)
╱─────────────────────────╲

Unit Tests: Test the parts that don't depend on LLMs — tool functions, data parsing, format validation. Fast, cheap, deterministic.

Integration Tests: Test tool call pipelines — can we correctly call APIs, parse return values, handle errors?

Agent Evaluation: Use an LLM to assess the quality of Agent outputs. This is the core of Agent testing, and the most challenging part.

E2E Tests: Full user scenario simulation, from input to final output, covering multi-turn conversations.

Implementation Details

Step 1: Unit Tests — Testing the Deterministic Parts

Extract all non-LLM-dependent logic from the Agent system and unit test it:

import pytest
from agent.tools import parse_order_id, format_price, validate_email

# Testing deterministic functions: traditional asserts are fine
class TestToolFunctions:
    def test_parse_order_id_valid(self):
        assert parse_order_id("ORD-2026-0001") == "ORD-2026-0001"

    def test_parse_order_id_from_text(self):
        text = "My order number is ORD-2026-0001, can you check on it?"
        assert parse_order_id(text) == "ORD-2026-0001"

    def test_parse_order_id_invalid(self):
        assert parse_order_id("text without an order number") is None

    def test_format_price_cny(self):
        assert format_price(2999, "CNY") == "¥2,999.00"

    def test_format_price_usd(self):
        assert format_price(29.99, "USD") == "$29.99"

    def test_validate_email(self):
        assert validate_email("user@example.com") is True
        assert validate_email("not-an-email") is False

These tests run fast (millisecond-level), cost no API money, and can run on every commit.

Step 2: Integration Tests — Mock the LLM, Test the Tool Pipeline

import pytest
from unittest.mock import AsyncMock, patch
from agent.core import CustomerServiceAgent

class TestToolIntegration:
    """Test whether the Agent's tool calls are correct"""

    @pytest.fixture
    def agent(self):
        return CustomerServiceAgent(model="claude-sonnet-4-5-20250514")

    @pytest.mark.asyncio
    async def test_order_query_calls_correct_tool(self, agent):
        """Verify that an order query message triggers the correct tool call"""
        # Mock the LLM to return a tool_use response
        mock_response = create_mock_tool_use_response(
            tool_name="get_order_status",
            tool_input={"order_id": "ORD-2026-0001"}
        )

        with patch.object(agent.client.messages, "create", return_value=mock_response):
            # Mock tool execution
            with patch.object(agent, "execute_tool") as mock_tool:
                mock_tool.return_value = '{"status": "shipped", "tracking": "SF1234"}'
                await agent.handle_message(
                    "What's the status of my order ORD-2026-0001?",
                    customer_id="CUST001"
                )

                # Verify the correct tool and arguments were called
                mock_tool.assert_called_once_with(
                    "get_order_status",
                    {"order_id": "ORD-2026-0001"}
                )

    @pytest.mark.asyncio
    async def test_tool_error_handling(self, agent):
        """Verify graceful error handling when a tool call fails"""
        with patch.object(agent, "execute_tool", side_effect=TimeoutError("API timeout")):
            result = await agent.handle_message(
                "Check order ORD-2026-0001",
                customer_id="CUST001"
            )
            # Agent should degrade gracefully, not throw an exception
            assert "later" in result or "retry" in result or "human" in result

Step 3: LLM-as-Judge — Evaluating Agent Output Quality

This is the core of Agent testing. Use a "judge model" to evaluate the quality of Agent outputs:

import anthropic
import json
from dataclasses import dataclass

@dataclass
class EvalCase:
    """Evaluation test case"""
    input_message: str      # User input
    context: str            # Scenario context
    expected_behavior: str  # Expected behavior (natural language description)
    criteria: list[str]     # Evaluation dimensions

@dataclass
class EvalResult:
    """Evaluation result"""
    score: float            # 0-1
    passed: bool
    reasoning: str
    dimension_scores: dict[str, float]

class LLMJudge:
    """Use an LLM to evaluate Agent output quality"""

    def __init__(self):
        self.client = anthropic.Anthropic()

    async def evaluate(
        self,
        eval_case: EvalCase,
        agent_response: str
    ) -> EvalResult:
        """Score an Agent's reply"""

        judge_prompt = f"""You are an expert evaluator of AI Agent output quality.

## Evaluation Scenario
User input: {eval_case.input_message}
Scenario context: {eval_case.context}
Expected behavior: {eval_case.expected_behavior}

## Agent's Actual Reply
{agent_response}

## Evaluation Dimensions
{json.dumps(eval_case.criteria, ensure_ascii=False)}

## Evaluation Rules
- Score each dimension (0.0-1.0)
- Focus on factual accuracy (most important), information completeness, tone appropriateness
- If the Agent fabricated information, give that dimension a 0
- If the Agent correctly expressed uncertainty and suggested escalation to a human, it should still score high even without a direct answer

Output JSON:
{{
  "overall_score": 0.0-1.0,
  "dimensions": {{
    "dimension_name": {{"score": 0.0-1.0, "reason": "explanation"}}
  }},
  "critical_issues": ["list of critical issues"],
  "reasoning": "overall evaluation rationale"
}}"""

        response = self.client.messages.create(
            model="claude-sonnet-4-5-20250514",  # Use Sonnet as the judge model
            max_tokens=1024,
            messages=[{"role": "user", "content": judge_prompt}]
        )

        result = json.loads(response.content[0].text)

        return EvalResult(
            score=result["overall_score"],
            passed=result["overall_score"] >= 0.75,
            reasoning=result["reasoning"],
            dimension_scores={
                k: v["score"] for k, v in result["dimensions"].items()
            }
        )

# Evaluation test suite
EVAL_SUITE = [
    EvalCase(
        input_message="What's your cheapest plan?",
        context="Product has three plans: Starter $29/month, Pro $79/month, Enterprise custom",
        expected_behavior="Accurately report the Starter plan price of $29/month",
        criteria=["Price accuracy", "Information completeness", "Tone appropriateness"]
    ),
    EvalCase(
        input_message="I want a refund",
        context="Refund policy: Full refund within 30 days of purchase, requires human processing",
        expected_behavior="State the refund policy, then escalate to a human agent",
        criteria=["Policy accuracy", "Correct escalation to human", "Tone appropriateness"]
    ),
    EvalCase(
        input_message="Do you support SAML SSO?",
        context="The knowledge base has no information about SAML SSO",
        expected_behavior="Acknowledge uncertainty, suggest escalation to human or checking docs; do not fabricate an answer",
        criteria=["Honesty", "Avoidance of fabrication", "Guidance behavior"]
    ),
]

Step 4: A/B Testing Framework

Compare two Agent versions in production:

import random
import time
from dataclasses import dataclass, field

@dataclass
class ABTestConfig:
    """A/B test configuration"""
    test_name: str
    variant_a: dict         # Agent A config
    variant_b: dict         # Agent B config
    traffic_split: float    # Traffic ratio for A (0.0-1.0)
    min_sample_size: int    # Minimum sample size

@dataclass
class ABTestMetrics:
    """A/B test metrics"""
    variant: str
    response_time_ms: float
    token_cost: float
    customer_satisfaction: float | None = None
    escalation_rate: float = 0.0

class ABTestRunner:
    def __init__(self, config: ABTestConfig):
        self.config = config
        self.metrics: dict[str, list[ABTestMetrics]] = {"A": [], "B": []}

    def assign_variant(self, customer_id: str) -> str:
        """Deterministic grouping based on customer ID (same customer always in same group)"""
        hash_val = hash(customer_id) % 100
        return "A" if hash_val < self.config.traffic_split * 100 else "B"

    async def run_and_record(
        self, customer_id: str, message: str
    ) -> tuple[str, str]:
        """Run the A/B test and record metrics"""
        variant = self.assign_variant(customer_id)
        agent_config = (
            self.config.variant_a if variant == "A"
            else self.config.variant_b
        )

        start = time.time()
        response = await run_agent(message, customer_id, agent_config)
        elapsed = (time.time() - start) * 1000

        self.metrics[variant].append(ABTestMetrics(
            variant=variant,
            response_time_ms=elapsed,
            token_cost=response["token_cost"],
        ))

        return variant, response["text"]

    def get_summary(self) -> dict:
        """Summarize A/B test results"""
        summary = {}
        for variant in ("A", "B"):
            data = self.metrics[variant]
            if not data:
                continue
            summary[variant] = {
                "sample_size": len(data),
                "avg_response_ms": sum(m.response_time_ms for m in data) / len(data),
                "avg_cost": sum(m.token_cost for m in data) / len(data),
                "p95_response_ms": sorted(
                    [m.response_time_ms for m in data]
                )[int(len(data) * 0.95)],
            }
        return summary

Step 5: Regression Tests — Run These Every Time You Change a Prompt

# Regression test suite
class AgentRegressionTest:
    """Tests that must run after every prompt or model change"""

    def __init__(self):
        self.judge = LLMJudge()
        self.test_cases = self._load_test_cases()

    def _load_test_cases(self) -> list[EvalCase]:
        """Load test cases from a JSON file"""
        import json
        with open("tests/regression_cases.json") as f:
            cases = json.load(f)
        return [EvalCase(**case) for case in cases]

    async def run_full_suite(self, agent) -> dict:
        """Run the full regression suite"""
        results = []
        for case in self.test_cases:
            # Run the Agent
            response = await agent.handle_message(
                case.input_message,
                customer_id="TEST_USER"
            )
            # LLM evaluation
            eval_result = await self.judge.evaluate(case, response)
            results.append({
                "case": case.input_message[:50],
                "score": eval_result.score,
                "passed": eval_result.passed,
                "reasoning": eval_result.reasoning
            })

        # Summary
        total = len(results)
        passed = sum(1 for r in results if r["passed"])
        avg_score = sum(r["score"] for r in results) / total

        return {
            "total_cases": total,
            "passed": passed,
            "failed": total - passed,
            "pass_rate": passed / total,
            "avg_score": avg_score,
            "failed_cases": [r for r in results if not r["passed"]]
        }

Lessons from the Field

Production Data

My customer service Agent testing stack:

Test Layer	# Cases	Runtime	Frequency	Cost/Run
Unit Tests	127	3 sec	Every commit	$0
Integration Tests	43	25 sec	Every PR	$0
LLM-as-Judge Eval	85	12 min	Every prompt change	$2.50
Full Regression Suite	85	15 min	Weekly	$3.20
A/B Tests	Ongoing	Ongoing	On new versions	~$50/week

Pitfalls We Hit

Pitfall 1: LLM-as-Judge bias. LLMs tend to give higher scores to longer, more formal answers. Solution: Explicitly state in the judge prompt that "concise and precise answers are preferred over verbose but accurate ones," and add a length penalty.

Pitfall 2: Test case maintenance overhead. Every time a product feature updates, test cases need updating too. We established a rule: every new feature's PRD must include 3-5 Agent test cases.

Pitfall 3: Regression test result variance. The same test suite can produce LLM-as-Judge scores that fluctuate 5-10% between runs. Solution: Run each case 3 times and take the median to reduce randomness.

Pitfall 4: A/B test sample size issues. A product with 20K DAU only generates a few hundred support requests per day. Reaching statistical significance takes 2-3 weeks. Workaround: Run synthetic evaluations in parallel (use an LLM to simulate customer test messages).

Conclusion

Three core takeaways:

Agent testing is layered, not one-size-fits-all — Use traditional unit tests for deterministic parts, integration tests for tool pipelines, LLM-as-Judge for output quality, and A/B tests for production performance. Each layer solves a different problem.
LLM-as-Judge is the core technique for Agent evaluation, but know its limitations — The judge model has its own biases (favoring longer answers, formal tone). These need explicit correction in the prompt. Critical quality metrics (pricing, policies) still need rule-based checks.
Regression tests are your safety net — Every time you change a prompt, swap a model, or update the knowledge base, you must run regression tests. Without this safety net, you won't dare change anything, and the Agent will stay frozen at version one.

If you haven't built a testing framework for your Agent yet, start simple: write 10 core scenario LLM-as-Judge evaluation cases, paired with 20 deterministic unit tests. That combination covers the most common issues.

How do you test your Agents? Have you hit similar pitfalls? I'd love to hear about it.

Testing AI Agents — How to Evaluate Before You Deploy

Testing AI Agents — How to Evaluate Before You Deploy

Opening

Problem Background

Core Framework

The Testing Pyramid (Agent Edition)

Implementation Details

Step 1: Unit Tests — Testing the Deterministic Parts

Step 2: Integration Tests — Mock the LLM, Test the Tool Pipeline

Step 3: LLM-as-Judge — Evaluating Agent Output Quality

Step 4: A/B Testing Framework

Step 5: Regression Tests — Run These Every Time You Change a Prompt

Lessons from the Field

Production Data

Pitfalls We Hit

Conclusion

Keep reading.

Building Your First AI Agent Team with Claude + n8n

How to Connect Your AI Agent to Company Data — A Step-by-Step Guide

LangChain vs CrewAI vs Building from Scratch — My Experience