From Prototype to Production — An Enterprise AI Agent Launch Checklist

From Prototype to Production — An Enterprise AI Agent Launch Checklist
Opening
I've seen too many Agent projects die on the road from prototype to production. Not because the model was bad or features were missing, but because they lacked the "boring but critical" infrastructure — logging, monitoring, rate limiting, error handling, security auditing. The gap between an Agent that runs in a Jupyter notebook and one that runs reliably in production is at least 40% of the total effort. This article is the complete checklist I compiled from helping 3 enterprise clients launch their Agents.
Problem Background
The core differences between prototype and production:
| Dimension | Prototype | Production |
|---|---|---|
| Users | Just you | Hundreds to tens of thousands |
| Error tolerance | Wrong? Just rerun | Wrong? Lose customers |
| Uptime | Run once | 24/7 |
| Data sensitivity | Test data | Real user data |
| Cost control | Doesn't matter | Every cent counts |
| Observability | Print debugging | Full tracing |
Enterprise clients care most about three questions: How will I know if this Agent breaks? Will it leak data? Can I keep monthly costs within budget?
The Launch Checklist
I've organized the checklist into 6 modules, ordered by priority.
Module 1: Observability
This is the most important module. An Agent whose internal state you can't see is no different from a black box.
import logging
import time
import json
from contextlib import contextmanager
from dataclasses import dataclass, field
from uuid import uuid4
# Structured logging configuration
logging.basicConfig(
format='{"timestamp":"%(asctime)s","level":"%(levelname)s","message":%(message)s}',
level=logging.INFO
)
logger = logging.getLogger("agent")
@dataclass
class AgentTrace:
"""Complete trace record of an Agent execution"""
trace_id: str = field(default_factory=lambda: str(uuid4()))
steps: list[dict] = field(default_factory=list)
total_tokens: int = 0
total_cost: float = 0.0
start_time: float = 0.0
def add_step(self, step_type: str, **kwargs):
self.steps.append({
"type": step_type,
"timestamp": time.time(),
**kwargs
})
def log_llm_call(self, model: str, input_tokens: int, output_tokens: int):
"""Record token consumption for each LLM call"""
cost = self._calculate_cost(model, input_tokens, output_tokens)
self.total_tokens += input_tokens + output_tokens
self.total_cost += cost
self.add_step(
"llm_call",
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost=cost
)
def log_tool_call(self, tool_name: str, input_data: dict, output: str, latency_ms: float):
"""Record tool calls"""
self.add_step(
"tool_call",
tool=tool_name,
input=input_data,
output_length=len(output),
latency_ms=latency_ms
)
def finalize(self) -> dict:
"""Generate the final trace report"""
elapsed = time.time() - self.start_time
return {
"trace_id": self.trace_id,
"total_steps": len(self.steps),
"total_tokens": self.total_tokens,
"total_cost": self.total_cost,
"total_time_seconds": elapsed,
"steps": self.steps
}
@staticmethod
def _calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
# Claude API pricing as of March 2026
pricing = {
"claude-sonnet-4-5-20250514": {"input": 3.0, "output": 15.0},
"claude-haiku-4-5-20250514": {"input": 1.0, "output": 5.0},
}
rates = pricing.get(model, {"input": 3.0, "output": 15.0})
return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1_000_000
# Usage
@contextmanager
def traced_agent_call(user_id: str):
"""Context manager: automatically trace Agent calls"""
trace = AgentTrace(start_time=time.time())
try:
yield trace
finally:
report = trace.finalize()
logger.info(json.dumps({
"event": "agent_call_complete",
"user_id": user_id,
**report
}))
# Write to monitoring system (Prometheus/DataDog/custom)
metrics.record_agent_call(report)
Metrics you must monitor:
- Token consumption and cost per request
- End-to-end latency (P50, P95, P99)
- Tool call success rate and latency
- LLM API error rate
- Human escalation rate
Module 2: Error Handling
Error handling for Agents is more complex than traditional applications — an LLM's "error" might not be an exception, but rather incorrect output.
import anthropic
from enum import Enum
class AgentError(Enum):
LLM_API_ERROR = "llm_api_error" # API call failure
LLM_TIMEOUT = "llm_timeout" # API timeout
TOOL_ERROR = "tool_error" # Tool execution failure
PARSE_ERROR = "parse_error" # Output parsing failure
SAFETY_VIOLATION = "safety_violation" # Safety rule triggered
BUDGET_EXCEEDED = "budget_exceeded" # Cost limit exceeded
LOOP_DETECTED = "loop_detected" # Loop detected
class ResilientAgent:
"""Agent with comprehensive error handling"""
def __init__(self, max_retries: int = 3, max_cost_per_request: float = 0.50):
self.client = anthropic.Anthropic()
self.max_retries = max_retries
self.max_cost = max_cost_per_request
async def handle_request(self, message: str, user_id: str) -> str:
trace = AgentTrace(start_time=time.time())
try:
return await self._execute_with_retry(message, user_id, trace)
except Exception as e:
logger.error(json.dumps({
"event": "agent_error",
"user_id": user_id,
"error_type": type(e).__name__,
"error_message": str(e),
"trace": trace.finalize()
}))
# Graceful degradation: return a preset fallback response
return self._fallback_response(e)
async def _execute_with_retry(
self, message: str, user_id: str, trace: AgentTrace
) -> str:
last_error = None
for attempt in range(self.max_retries):
try:
# Cost check
if trace.total_cost >= self.max_cost:
raise BudgetExceededError(
f"Single request cost has reached ${trace.total_cost:.3f}, exceeding limit of ${self.max_cost}"
)
response = await self._call_llm(message, trace)
return response
except anthropic.RateLimitError:
# Exponential backoff
wait_time = 2 ** attempt
logger.warning(f"Rate limited, waiting {wait_time}s (attempt {attempt + 1})")
await asyncio.sleep(wait_time)
last_error = "rate_limit"
except anthropic.APITimeoutError:
logger.warning(f"API timeout (attempt {attempt + 1})")
last_error = "timeout"
except anthropic.APIError as e:
if e.status_code >= 500:
# Server error, retry
logger.warning(f"Server error {e.status_code} (attempt {attempt + 1})")
last_error = f"server_error_{e.status_code}"
else:
# Client error (400, 401, etc.), don't retry
raise
raise MaxRetriesExceededError(f"Still failing after {self.max_retries} retries: {last_error}")
def _fallback_response(self, error: Exception) -> str:
"""Return different fallback responses based on error type"""
if isinstance(error, BudgetExceededError):
return "Your question is quite complex. I've transferred you to a human agent for assistance."
elif isinstance(error, MaxRetriesExceededError):
return "The system is temporarily busy. Please try again shortly, or contact our support team."
else:
return "Sorry, we encountered an issue. I've logged your question, and our support team will follow up shortly."
Module 3: Rate Limiting and Cost Control
from collections import defaultdict
import time
class CostController:
"""Agent cost controller"""
def __init__(self):
# Per-user limits
self.user_daily_cost: dict[str, float] = defaultdict(float)
self.user_daily_requests: dict[str, int] = defaultdict(int)
# Global limits
self.global_hourly_cost: float = 0.0
self.last_reset: float = time.time()
def check_limits(self, user_id: str) -> tuple[bool, str]:
"""Check whether limits have been exceeded"""
self._maybe_reset()
# User-level limit: max $2/day, 50 requests
if self.user_daily_cost[user_id] >= 2.0:
return False, "You've used up your daily quota"
if self.user_daily_requests[user_id] >= 50:
return False, "Request rate too high, please try again later"
# Global limit: max $100/hour
if self.global_hourly_cost >= 100.0:
return False, "System is busy, please try again later"
return True, ""
def record_cost(self, user_id: str, cost: float):
self.user_daily_cost[user_id] += cost
self.user_daily_requests[user_id] += 1
self.global_hourly_cost += cost
def _maybe_reset(self):
now = time.time()
# Reset global counter every hour
if now - self.last_reset > 3600:
self.global_hourly_cost = 0.0
self.last_reset = now
Module 4: Security Hardening
import re
class SecurityLayer:
"""Agent security layer"""
# PII detection regex patterns
PII_PATTERNS = {
"phone": r"1[3-9]\d{9}",
"id_card": r"\d{17}[\dXx]",
"email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
"credit_card": r"\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}",
}
@staticmethod
def sanitize_output(text: str) -> str:
"""Redact PII from Agent output"""
for pii_type, pattern in SecurityLayer.PII_PATTERNS.items():
text = re.sub(pattern, f"[{pii_type}_REDACTED]", text)
return text
@staticmethod
def detect_prompt_injection(user_input: str) -> bool:
"""Detect prompt injection attacks"""
injection_patterns = [
r"ignore\s+(previous|above|all)\s+(instructions|prompts)",
r"system\s*prompt",
r"IMPORTANT:\s*NEW\s*INSTRUCTIONS",
]
for pattern in injection_patterns:
if re.search(pattern, user_input, re.IGNORECASE):
return True
return False
@staticmethod
def validate_tool_output(tool_name: str, output: str) -> str:
"""Validate tool output to prevent data leaks"""
# Remove potential internal URLs
output = re.sub(r"https?://internal\.[^\s]+", "[INTERNAL_URL_REDACTED]", output)
# Remove database connection strings
output = re.sub(r"(postgresql|mysql|mongodb)://[^\s]+", "[DB_URL_REDACTED]", output)
return output
Module 5: Monitoring and Alerts
# Monitoring metrics and alert rules (conceptual code; use Prometheus + Grafana in practice)
ALERT_RULES = {
"high_error_rate": {
"condition": "error_rate_5min > 5%",
"severity": "critical",
"action": "Notify on-call engineer, auto-switch to fallback mode"
},
"cost_spike": {
"condition": "hourly_cost > 2x historical average",
"severity": "warning",
"action": "Notify owner, check for abnormal traffic"
},
"high_latency": {
"condition": "p95_latency > 10s",
"severity": "warning",
"action": "Check LLM API status, consider downgrading to a faster model"
},
"low_satisfaction": {
"condition": "csat_daily < 3.5/5",
"severity": "warning",
"action": "Pull low-score conversation logs, analyze root causes"
},
"api_quota_approaching": {
"condition": "daily_tokens > 80% of quota",
"severity": "info",
"action": "Prepare to switch to Batch API or downgrade model"
}
}
Module 6: Pre-Launch Checklist
## Pre-Launch Must-Haves
### Infrastructure
- [ ] Logging system configured, structured logs are searchable
- [ ] Monitoring dashboard built
- [ ] Alert rules configured and tested
- [ ] Error handling and fallback responses cover all exception paths
### Security
- [ ] PII redaction logic implemented
- [ ] Prompt injection detection deployed
- [ ] API keys use environment variables or a secret manager, never hardcoded
- [ ] Database accounts used by the Agent are read-only with minimum permissions
- [ ] Tool outputs are validated and sanitized
### Cost
- [ ] User-level and global-level cost limits configured
- [ ] Max tokens per request set
- [ ] Cost monitoring dashboard available
- [ ] Batch API fallback plan ready
### Testing
- [ ] Unit tests 100% passing
- [ ] Integration tests cover all tool call paths
- [ ] LLM-as-Judge evaluation pass rate > 85%
- [ ] Regression test suite can be run automatically
- [ ] Edge case tests (empty input, extra-long input, malicious input)
### Operations
- [ ] Canary release strategy defined (start with 5% traffic)
- [ ] Rollback plan verified
- [ ] On-call rotation and escalation paths documented
- [ ] Knowledge base update process established
Lessons from the Field
Production Data
For the enterprise Agent launches I led, here's the before-and-after comparison:
| Metric | Without Checklist (V1) | With Checklist (V2) |
|---|---|---|
| Incidents in first week | 7 | 0 |
| Avg time to detect failure | 4.5 hours | 3 minutes (alert triggered) |
| Avg time to recover | 2 hours | 15 minutes (rollback plan ready) |
| Monthly cost over budget | Yes (2.3x) | No (within ±10%) |
| Security incidents | 1 (PII leak) | 0 |
Pitfalls We Hit
Pitfall 1: Too many logs is the same as no logs. Initially I logged the full prompt and response for every LLM call — 50GB of logs per day. Searching for issues was impossible. Solution: For normal requests, log only metadata (trace_id, token count, latency, tool calls). Log full content only on errors.
Pitfall 2: Canary releases take an extra week compared to going all-in, but they're worth it. We first routed 5% of traffic to the new Agent, observed for a week, then expanded to 20%, then another week before going to 100%. During the 5% phase we caught a timezone handling bug — had we gone 100% from the start, every user would have been affected.
Pitfall 3: Cost controls must work at the request level, not the monthly level. If an Agent enters a tool call death loop, a single request can burn tens of thousands of tokens (several dollars). Monthly cost limits are useless here — you need a per-request ceiling.
Conclusion
Three core takeaways:
-
Observability is the #1 priority — You can get by without the best model or the most perfect prompt, but you cannot get by without logging and monitoring. When something goes wrong, you need to know within 5 minutes what happened, how many users were affected, and what the root cause is.
-
Security and cost control are not "add later" items — PII redaction, prompt injection protection, and cost limits must be in place before launch. The first security incident in production might be the last — you'll lose your customer's trust.
-
Checklists beat experience — Pilots have pre-flight checklists, surgeons have surgical checklists, and Agent launches need checklists too. Not because you're not smart enough, but because there's too much to check and human attention is limited. Bake this checklist into your release process.
If you're preparing to launch an Agent, I suggest going through this checklist first, flagging the items you haven't done yet, and tackling them by priority. Every red-flagged item must be completed before going live.
What surprises have you encountered during Agent launches? I'd love to hear your stories.