Solo Unicorn Club logoSolo Unicorn
2,800 words

From Prototype to Production — An Enterprise AI Agent Launch Checklist

AI AgentProduction DeploymentMonitoringSecurityEnterpriseOperationsHands-On
From Prototype to Production — An Enterprise AI Agent Launch Checklist

From Prototype to Production — An Enterprise AI Agent Launch Checklist

Opening

I've seen too many Agent projects die on the road from prototype to production. Not because the model was bad or features were missing, but because they lacked the "boring but critical" infrastructure — logging, monitoring, rate limiting, error handling, security auditing. The gap between an Agent that runs in a Jupyter notebook and one that runs reliably in production is at least 40% of the total effort. This article is the complete checklist I compiled from helping 3 enterprise clients launch their Agents.

Problem Background

The core differences between prototype and production:

Dimension Prototype Production
Users Just you Hundreds to tens of thousands
Error tolerance Wrong? Just rerun Wrong? Lose customers
Uptime Run once 24/7
Data sensitivity Test data Real user data
Cost control Doesn't matter Every cent counts
Observability Print debugging Full tracing

Enterprise clients care most about three questions: How will I know if this Agent breaks? Will it leak data? Can I keep monthly costs within budget?

The Launch Checklist

I've organized the checklist into 6 modules, ordered by priority.

Module 1: Observability

This is the most important module. An Agent whose internal state you can't see is no different from a black box.

import logging
import time
import json
from contextlib import contextmanager
from dataclasses import dataclass, field
from uuid import uuid4

# Structured logging configuration
logging.basicConfig(
    format='{"timestamp":"%(asctime)s","level":"%(levelname)s","message":%(message)s}',
    level=logging.INFO
)
logger = logging.getLogger("agent")

@dataclass
class AgentTrace:
    """Complete trace record of an Agent execution"""
    trace_id: str = field(default_factory=lambda: str(uuid4()))
    steps: list[dict] = field(default_factory=list)
    total_tokens: int = 0
    total_cost: float = 0.0
    start_time: float = 0.0

    def add_step(self, step_type: str, **kwargs):
        self.steps.append({
            "type": step_type,
            "timestamp": time.time(),
            **kwargs
        })

    def log_llm_call(self, model: str, input_tokens: int, output_tokens: int):
        """Record token consumption for each LLM call"""
        cost = self._calculate_cost(model, input_tokens, output_tokens)
        self.total_tokens += input_tokens + output_tokens
        self.total_cost += cost
        self.add_step(
            "llm_call",
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost=cost
        )

    def log_tool_call(self, tool_name: str, input_data: dict, output: str, latency_ms: float):
        """Record tool calls"""
        self.add_step(
            "tool_call",
            tool=tool_name,
            input=input_data,
            output_length=len(output),
            latency_ms=latency_ms
        )

    def finalize(self) -> dict:
        """Generate the final trace report"""
        elapsed = time.time() - self.start_time
        return {
            "trace_id": self.trace_id,
            "total_steps": len(self.steps),
            "total_tokens": self.total_tokens,
            "total_cost": self.total_cost,
            "total_time_seconds": elapsed,
            "steps": self.steps
        }

    @staticmethod
    def _calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
        # Claude API pricing as of March 2026
        pricing = {
            "claude-sonnet-4-5-20250514": {"input": 3.0, "output": 15.0},
            "claude-haiku-4-5-20250514": {"input": 1.0, "output": 5.0},
        }
        rates = pricing.get(model, {"input": 3.0, "output": 15.0})
        return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1_000_000

# Usage
@contextmanager
def traced_agent_call(user_id: str):
    """Context manager: automatically trace Agent calls"""
    trace = AgentTrace(start_time=time.time())
    try:
        yield trace
    finally:
        report = trace.finalize()
        logger.info(json.dumps({
            "event": "agent_call_complete",
            "user_id": user_id,
            **report
        }))
        # Write to monitoring system (Prometheus/DataDog/custom)
        metrics.record_agent_call(report)

Metrics you must monitor:

  • Token consumption and cost per request
  • End-to-end latency (P50, P95, P99)
  • Tool call success rate and latency
  • LLM API error rate
  • Human escalation rate

Module 2: Error Handling

Error handling for Agents is more complex than traditional applications — an LLM's "error" might not be an exception, but rather incorrect output.

import anthropic
from enum import Enum

class AgentError(Enum):
    LLM_API_ERROR = "llm_api_error"          # API call failure
    LLM_TIMEOUT = "llm_timeout"              # API timeout
    TOOL_ERROR = "tool_error"                # Tool execution failure
    PARSE_ERROR = "parse_error"              # Output parsing failure
    SAFETY_VIOLATION = "safety_violation"      # Safety rule triggered
    BUDGET_EXCEEDED = "budget_exceeded"        # Cost limit exceeded
    LOOP_DETECTED = "loop_detected"           # Loop detected

class ResilientAgent:
    """Agent with comprehensive error handling"""

    def __init__(self, max_retries: int = 3, max_cost_per_request: float = 0.50):
        self.client = anthropic.Anthropic()
        self.max_retries = max_retries
        self.max_cost = max_cost_per_request

    async def handle_request(self, message: str, user_id: str) -> str:
        trace = AgentTrace(start_time=time.time())

        try:
            return await self._execute_with_retry(message, user_id, trace)
        except Exception as e:
            logger.error(json.dumps({
                "event": "agent_error",
                "user_id": user_id,
                "error_type": type(e).__name__,
                "error_message": str(e),
                "trace": trace.finalize()
            }))
            # Graceful degradation: return a preset fallback response
            return self._fallback_response(e)

    async def _execute_with_retry(
        self, message: str, user_id: str, trace: AgentTrace
    ) -> str:
        last_error = None

        for attempt in range(self.max_retries):
            try:
                # Cost check
                if trace.total_cost >= self.max_cost:
                    raise BudgetExceededError(
                        f"Single request cost has reached ${trace.total_cost:.3f}, exceeding limit of ${self.max_cost}"
                    )

                response = await self._call_llm(message, trace)
                return response

            except anthropic.RateLimitError:
                # Exponential backoff
                wait_time = 2 ** attempt
                logger.warning(f"Rate limited, waiting {wait_time}s (attempt {attempt + 1})")
                await asyncio.sleep(wait_time)
                last_error = "rate_limit"

            except anthropic.APITimeoutError:
                logger.warning(f"API timeout (attempt {attempt + 1})")
                last_error = "timeout"

            except anthropic.APIError as e:
                if e.status_code >= 500:
                    # Server error, retry
                    logger.warning(f"Server error {e.status_code} (attempt {attempt + 1})")
                    last_error = f"server_error_{e.status_code}"
                else:
                    # Client error (400, 401, etc.), don't retry
                    raise

        raise MaxRetriesExceededError(f"Still failing after {self.max_retries} retries: {last_error}")

    def _fallback_response(self, error: Exception) -> str:
        """Return different fallback responses based on error type"""
        if isinstance(error, BudgetExceededError):
            return "Your question is quite complex. I've transferred you to a human agent for assistance."
        elif isinstance(error, MaxRetriesExceededError):
            return "The system is temporarily busy. Please try again shortly, or contact our support team."
        else:
            return "Sorry, we encountered an issue. I've logged your question, and our support team will follow up shortly."

Module 3: Rate Limiting and Cost Control

from collections import defaultdict
import time

class CostController:
    """Agent cost controller"""

    def __init__(self):
        # Per-user limits
        self.user_daily_cost: dict[str, float] = defaultdict(float)
        self.user_daily_requests: dict[str, int] = defaultdict(int)
        # Global limits
        self.global_hourly_cost: float = 0.0
        self.last_reset: float = time.time()

    def check_limits(self, user_id: str) -> tuple[bool, str]:
        """Check whether limits have been exceeded"""
        self._maybe_reset()

        # User-level limit: max $2/day, 50 requests
        if self.user_daily_cost[user_id] >= 2.0:
            return False, "You've used up your daily quota"
        if self.user_daily_requests[user_id] >= 50:
            return False, "Request rate too high, please try again later"

        # Global limit: max $100/hour
        if self.global_hourly_cost >= 100.0:
            return False, "System is busy, please try again later"

        return True, ""

    def record_cost(self, user_id: str, cost: float):
        self.user_daily_cost[user_id] += cost
        self.user_daily_requests[user_id] += 1
        self.global_hourly_cost += cost

    def _maybe_reset(self):
        now = time.time()
        # Reset global counter every hour
        if now - self.last_reset > 3600:
            self.global_hourly_cost = 0.0
            self.last_reset = now

Module 4: Security Hardening

import re

class SecurityLayer:
    """Agent security layer"""

    # PII detection regex patterns
    PII_PATTERNS = {
        "phone": r"1[3-9]\d{9}",
        "id_card": r"\d{17}[\dXx]",
        "email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        "credit_card": r"\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}",
    }

    @staticmethod
    def sanitize_output(text: str) -> str:
        """Redact PII from Agent output"""
        for pii_type, pattern in SecurityLayer.PII_PATTERNS.items():
            text = re.sub(pattern, f"[{pii_type}_REDACTED]", text)
        return text

    @staticmethod
    def detect_prompt_injection(user_input: str) -> bool:
        """Detect prompt injection attacks"""
        injection_patterns = [
            r"ignore\s+(previous|above|all)\s+(instructions|prompts)",
            r"system\s*prompt",
            r"IMPORTANT:\s*NEW\s*INSTRUCTIONS",
        ]
        for pattern in injection_patterns:
            if re.search(pattern, user_input, re.IGNORECASE):
                return True
        return False

    @staticmethod
    def validate_tool_output(tool_name: str, output: str) -> str:
        """Validate tool output to prevent data leaks"""
        # Remove potential internal URLs
        output = re.sub(r"https?://internal\.[^\s]+", "[INTERNAL_URL_REDACTED]", output)
        # Remove database connection strings
        output = re.sub(r"(postgresql|mysql|mongodb)://[^\s]+", "[DB_URL_REDACTED]", output)
        return output

Module 5: Monitoring and Alerts

# Monitoring metrics and alert rules (conceptual code; use Prometheus + Grafana in practice)

ALERT_RULES = {
    "high_error_rate": {
        "condition": "error_rate_5min > 5%",
        "severity": "critical",
        "action": "Notify on-call engineer, auto-switch to fallback mode"
    },
    "cost_spike": {
        "condition": "hourly_cost > 2x historical average",
        "severity": "warning",
        "action": "Notify owner, check for abnormal traffic"
    },
    "high_latency": {
        "condition": "p95_latency > 10s",
        "severity": "warning",
        "action": "Check LLM API status, consider downgrading to a faster model"
    },
    "low_satisfaction": {
        "condition": "csat_daily < 3.5/5",
        "severity": "warning",
        "action": "Pull low-score conversation logs, analyze root causes"
    },
    "api_quota_approaching": {
        "condition": "daily_tokens > 80% of quota",
        "severity": "info",
        "action": "Prepare to switch to Batch API or downgrade model"
    }
}

Module 6: Pre-Launch Checklist

## Pre-Launch Must-Haves

### Infrastructure
- [ ] Logging system configured, structured logs are searchable
- [ ] Monitoring dashboard built
- [ ] Alert rules configured and tested
- [ ] Error handling and fallback responses cover all exception paths

### Security
- [ ] PII redaction logic implemented
- [ ] Prompt injection detection deployed
- [ ] API keys use environment variables or a secret manager, never hardcoded
- [ ] Database accounts used by the Agent are read-only with minimum permissions
- [ ] Tool outputs are validated and sanitized

### Cost
- [ ] User-level and global-level cost limits configured
- [ ] Max tokens per request set
- [ ] Cost monitoring dashboard available
- [ ] Batch API fallback plan ready

### Testing
- [ ] Unit tests 100% passing
- [ ] Integration tests cover all tool call paths
- [ ] LLM-as-Judge evaluation pass rate > 85%
- [ ] Regression test suite can be run automatically
- [ ] Edge case tests (empty input, extra-long input, malicious input)

### Operations
- [ ] Canary release strategy defined (start with 5% traffic)
- [ ] Rollback plan verified
- [ ] On-call rotation and escalation paths documented
- [ ] Knowledge base update process established

Lessons from the Field

Production Data

For the enterprise Agent launches I led, here's the before-and-after comparison:

Metric Without Checklist (V1) With Checklist (V2)
Incidents in first week 7 0
Avg time to detect failure 4.5 hours 3 minutes (alert triggered)
Avg time to recover 2 hours 15 minutes (rollback plan ready)
Monthly cost over budget Yes (2.3x) No (within ±10%)
Security incidents 1 (PII leak) 0

Pitfalls We Hit

Pitfall 1: Too many logs is the same as no logs. Initially I logged the full prompt and response for every LLM call — 50GB of logs per day. Searching for issues was impossible. Solution: For normal requests, log only metadata (trace_id, token count, latency, tool calls). Log full content only on errors.

Pitfall 2: Canary releases take an extra week compared to going all-in, but they're worth it. We first routed 5% of traffic to the new Agent, observed for a week, then expanded to 20%, then another week before going to 100%. During the 5% phase we caught a timezone handling bug — had we gone 100% from the start, every user would have been affected.

Pitfall 3: Cost controls must work at the request level, not the monthly level. If an Agent enters a tool call death loop, a single request can burn tens of thousands of tokens (several dollars). Monthly cost limits are useless here — you need a per-request ceiling.

Conclusion

Three core takeaways:

  1. Observability is the #1 priority — You can get by without the best model or the most perfect prompt, but you cannot get by without logging and monitoring. When something goes wrong, you need to know within 5 minutes what happened, how many users were affected, and what the root cause is.

  2. Security and cost control are not "add later" items — PII redaction, prompt injection protection, and cost limits must be in place before launch. The first security incident in production might be the last — you'll lose your customer's trust.

  3. Checklists beat experience — Pilots have pre-flight checklists, surgeons have surgical checklists, and Agent launches need checklists too. Not because you're not smart enough, but because there's too much to check and human attention is limited. Bake this checklist into your release process.

If you're preparing to launch an Agent, I suggest going through this checklist first, flagging the items you haven't done yet, and tackling them by priority. Every red-flagged item must be completed before going live.

What surprises have you encountered during Agent launches? I'd love to hear your stories.