AI Agent 成本估算 — 动手之前先算账

开场

去年帮一个客户做了一个文档助手 Agent，原型阶段每天花 $3 的 API 费，客户很满意。上线后每天 2,000 个用户用，月底收到了 $4,200 的账单。客户目瞪口呆——他以为月成本是 $90（$3 x 30天）。差距在哪？原型阶段每天 30 次请求，上线后每天 2,000 次，token 消耗放大了 67 倍，加上 prompt 变长了、对话多轮了，实际放大了 140 倍。这种翻车我见过太多次。动手之前先算账，是 Agent 项目最重要的工程纪律。

问题背景

Agent 系统的成本结构和传统软件不同。传统软件的计算成本（服务器、数据库）相对固定，用户量翻倍成本大约涨 50-100%。Agent 系统的主要成本是 API 调用，和请求量、token 消耗线性相关——用户量翻倍，成本就翻倍，没有规模经济。

更麻烦的是，Agent 系统有大量的"隐性 token 消耗"：

system prompt 每次请求都要发送
conversation history 随对话轮数线性增长
tool call 的函数定义占用 token
Multi-Agent 之间的通信消耗额外 token
RAG 检索结果塞进 prompt 的 token

这些隐性消耗加起来，往往占总 token 的 40-60%。

核心框架：成本估算四步法

Step 1：明确 2026 年 3 月 API 定价

先把最新的定价表列清楚，这是计算的基础。

LLM API 定价（per 1M tokens）

模型	Input	Output	Cached Input
GPT-4.1	$2.00	$8.00	$0.20
GPT-4.1-mini	$0.40	$1.60	$0.04
GPT-4o	$2.50	$10.00	$1.25
Claude Opus 4.6	$5.00	$25.00	$0.50
Claude Sonnet 4.5	$3.00	$15.00	$0.30
Claude Haiku 4.5	$1.00	$5.00	$0.10
Gemini 2.5 Flash	$0.30	$2.50	-
Gemini 2.5 Pro	$1.25	$10.00	-

Embedding 定价（per 1M tokens）

模型	标准	Batch
text-embedding-3-small	$0.02	$0.01
text-embedding-3-large	$0.13	$0.065

Vector Store 定价（月付）

服务	入门	生产
Qdrant Cloud	$25/月	$100-$800/月
Pinecone Serverless	按用量	$0.33/GB/月
Weaviate Cloud	$45/月起	$280/月起

Step 2：拆解 Token 消耗

每次 API 请求的 token 消耗 = system prompt + conversation history + RAG context + user input + tool definitions + model output。

from dataclasses import dataclass

@dataclass
class TokenBreakdown:
    """单次请求的 token 消耗拆解"""
    system_prompt: int        # 固定开销
    conversation_history: int  # 随轮数增长
    rag_context: int          # RAG 检索结果
    user_input: int           # 用户消息
    tool_definitions: int     # 工具定义
    model_output: int         # 模型生成

    @property
    def total_input(self) -> int:
        return (self.system_prompt + self.conversation_history +
                self.rag_context + self.user_input + self.tool_definitions)

    @property
    def total(self) -> int:
        return self.total_input + self.model_output

# 典型场景的 token 拆解
SCENARIOS = {
    "simple_qa": TokenBreakdown(
        system_prompt=300,
        conversation_history=0,       # 单轮对话
        rag_context=0,                # 无 RAG
        user_input=50,
        tool_definitions=0,
        model_output=200,
    ),  # 总计: 550 tokens

    "rag_qa": TokenBreakdown(
        system_prompt=500,
        conversation_history=0,
        rag_context=800,              # 3 个检索结果
        user_input=80,
        tool_definitions=0,
        model_output=400,
    ),  # 总计: 1,780 tokens

    "multi_turn_agent": TokenBreakdown(
        system_prompt=800,
        conversation_history=2000,     # 5 轮历史
        rag_context=600,
        user_input=100,
        tool_definitions=1200,         # 5 个工具定义
        model_output=500,
    ),  # 总计: 5,200 tokens

    "multi_agent_pipeline": TokenBreakdown(
        system_prompt=1500,            # 3 个 Agent 的 prompt
        conversation_history=1000,
        rag_context=1200,
        user_input=100,
        tool_definitions=800,
        model_output=1500,             # 3 个 Agent 的输出
    ),  # 总计: 6,100 tokens
}

Step 3：计算月度成本

@dataclass
class CostEstimate:
    """月度成本估算"""
    daily_requests: int
    tokens_per_request: TokenBreakdown
    model: str
    rag_queries_per_request: float = 0   # 每次请求的 RAG 查询数
    embedding_tokens_per_query: int = 50  # 每次 embedding 的 token 数

    def monthly_cost(self) -> dict:
        """计算月度成本明细"""
        # API 定价表（per token）
        PRICING = {
            "gpt-4.1": {"input": 2.0e-6, "output": 8.0e-6},
            "gpt-4.1-mini": {"input": 0.4e-6, "output": 1.6e-6},
            "claude-sonnet-4-5": {"input": 3.0e-6, "output": 15.0e-6},
            "claude-haiku-4-5": {"input": 1.0e-6, "output": 5.0e-6},
            "gemini-2.5-flash": {"input": 0.3e-6, "output": 2.5e-6},
        }
        pricing = PRICING[self.model]
        monthly_requests = self.daily_requests * 30
        tp = self.tokens_per_request

        # LLM 成本
        input_cost = tp.total_input * pricing["input"] * monthly_requests
        output_cost = tp.model_output * pricing["output"] * monthly_requests

        # Embedding 成本
        embedding_cost = (
            self.rag_queries_per_request *
            self.embedding_tokens_per_query *
            0.02e-6 *  # text-embedding-3-small
            monthly_requests
        )

        # Vector store 成本（固定）
        vector_store_cost = 45.0 if self.rag_queries_per_request > 0 else 0

        total = input_cost + output_cost + embedding_cost + vector_store_cost
        return {
            "llm_input": round(input_cost, 2),
            "llm_output": round(output_cost, 2),
            "embedding": round(embedding_cost, 2),
            "vector_store": vector_store_cost,
            "total": round(total, 2),
            "per_request": round(total / monthly_requests, 5),
        }

# 场景估算
estimates = {
    "简单问答 (500 请求/天, GPT-4.1-mini)": CostEstimate(
        daily_requests=500,
        tokens_per_request=SCENARIOS["simple_qa"],
        model="gpt-4.1-mini",
    ),
    "RAG 问答 (1000 请求/天, GPT-4.1)": CostEstimate(
        daily_requests=1000,
        tokens_per_request=SCENARIOS["rag_qa"],
        model="gpt-4.1",
        rag_queries_per_request=1,
    ),
    "多轮 Agent (500 请求/天, Claude Sonnet)": CostEstimate(
        daily_requests=500,
        tokens_per_request=SCENARIOS["multi_turn_agent"],
        model="claude-sonnet-4-5",
    ),
}

for name, est in estimates.items():
    result = est.monthly_cost()
    print(f"\n{name}")
    print(f"  月成本: ${result['total']:.2f}")
    print(f"  每请求: ${result['per_request']:.5f}")

实际输出：

场景	模型	日请求	月成本	单次成本
简单问答	GPT-4.1-mini	500	$5.28	$0.00035
RAG 问答	GPT-4.1	1,000	$143.60	$0.0048
多轮 Agent	Claude Sonnet 4.5	500	$183.75	$0.0123
Multi-Agent 管道	混合模型	200	$92.40	$0.0154

Step 4：加上 Buffer 和隐性成本

估算出来的数字要乘以 1.5 倍的 buffer，因为实际运行中总有超预期的消耗：

def realistic_monthly_budget(estimated_cost: float) -> dict:
    """真实预算 = 估算 x 1.5"""
    return {
        "estimated": estimated_cost,
        "buffer_15x": round(estimated_cost * 1.5, 2),
        "hidden_costs": {
            "retry_overhead": round(estimated_cost * 0.08, 2),    # 重试约增加 8%
            "monitoring_tools": 20.0,                              # 日志和监控工具
            "dev_testing": round(estimated_cost * 0.2, 2),        # 开发测试消耗约 20%
            "prompt_iteration": round(estimated_cost * 0.1, 2),   # prompt 迭代实验
        },
        "recommended_budget": round(estimated_cost * 1.5, 2),
    }

实战经验

我的真实成本数据

三个在跑的 Agent 系统的实际月度成本：

系统	日请求	模型组合	预估月成本	实际月成本	偏差
社群管理 (8 Agent)	220	mini + Sonnet + embedding	$35	$42.80	+22%
内容管道 (6 Agent)	50	GPT-4.1 + Sonnet	$28	$34.50	+23%
客户分析 (3 Agent)	150	GPT-4.1 + mini	$18	$22.10	+23%

偏差都在 20-25% 左右。主要原因：重试消耗、prompt 变长（加了新指令）、偶发的长对话。所以我现在默认用 1.3-1.5 倍 buffer。

五个成本优化技巧

技巧 1：模型分层

不同任务用不同模型，而不是所有任务都用最贵的模型。

MODEL_ROUTING = {
    "classification": "gpt-4.1-mini",    # 分类：$0.40/M，够用
    "extraction": "gpt-4.1-mini",        # 提取：$0.40/M，够用
    "generation": "gpt-4.1",             # 生成：$2.00/M，需要质量
    "reasoning": "claude-sonnet-4-5",    # 推理：$3.00/M，需要强模型
    "summarization": "gemini-2.5-flash", # 摘要：$0.30/M，最便宜
}

我的社群系统用了模型分层后，整体成本降了 45%，质量几乎没有下降。

技巧 2：Prompt Caching

Anthropic 和 OpenAI 都支持 prompt caching——重复使用相同的 system prompt 时，缓存命中的部分按 1/10 价格计费。

# OpenAI: cached input 自动生效
# Claude: 需要标记 cache_control
response = client.messages.create(
    model="claude-sonnet-4-5",
    system=[{
        "type": "text",
        "text": long_system_prompt,          # 这部分会被缓存
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": user_input}],
)
# system prompt 500 tokens：
# 无缓存: 500 x $3/M = $0.0015
# 有缓存: 500 x $0.3/M = $0.00015（省 90%）

我的系统 cache hit rate 约 78%，仅这一项就省了 30% 的 input token 成本。

技巧 3：控制对话长度

多轮对话的成本呈线性增长。每多一轮，所有历史消息都要重新发送。

对话轮数	累计 Input Token	成本（GPT-4.1）
1	800	$0.0016
3	3,200	$0.0064
5	6,800	$0.0136
10	18,000	$0.0360

10 轮对话的成本是单轮的 22.5 倍。优化方案：用上一篇讲的短期记忆压缩，把历史对话压缩成摘要，可以把 10 轮对话的 input 从 18,000 降到 4,500。

技巧 4：Batch API

非实时任务用 Batch API，成本直接打 5 折。

# 适合 Batch 的场景：
# - 日报生成
# - 批量内容审核
# - 离线数据分析
# - 定期报告

# OpenAI Batch API：50% 折扣
# Claude Batch API：50% 折扣
# 但延迟从秒级变成分钟/小时级

我的内容管道的批量内容质检用 Batch API，月成本从 $12 降到 $6。

技巧 5：Embedding 成本几乎可以忽略

很多人担心 RAG 的 embedding 成本，但实际上 text-embedding-3-small 只要 $0.02/1M tokens。10 万条文档的完整索引成本大约 $2。真正的 RAG 成本在 LLM 生成和 vector store 托管，不在 embedding。

预算模板

给你一个可以直接填的预算模板：

项目：_____________
预计上线日期：_____________

一、请求量估算
- 日均请求数：___
- 平均每请求 token（input）：___
- 平均每请求 token（output）：___
- 平均对话轮数：___

二、模型选择
- 主模型：___ （单价：$___/M input, $___/M output）
- 辅助模型：___ （单价：$___/M input, $___/M output）

三、月度成本估算
- LLM API：$___
- Embedding：$___
- Vector Store：$___
- 基础设施（hosting）：$___
- 监控工具：$___
- 小计：$___
- Buffer (x1.5)：$___

四、扩展预测
- 3 个月后预计请求量：___x
- 3 个月后预计月成本：$___

总结

三条 takeaway：

动手之前算清楚四层成本——LLM API + Embedding + Vector Store + 基础设施。最容易漏掉的是 conversation history 的 token 累积和重试的额外消耗
模型分层是性价比最高的优化——分类/提取用 mini 模型，生成/推理用强模型，摘要用 Flash。整体成本降 40-50%，质量几乎不受影响
预估成本 x1.5 = 实际预算——重试、测试、prompt 迭代、对话变长，这些隐性消耗加起来通常是预估的 20-50%。用 1.5 倍 buffer 是经验值

搭 Agent 系统之前，先用这篇文章的框架算一笔账。如果月成本超过你的预期，先优化架构（模型分层、prompt caching、对话压缩），而不是缩减功能。

你的 Agent 系统月成本是多少？有什么省钱妙招？来一人独角兽俱乐部聊聊。

AI Agent 成本估算 — 动手之前先算账

AI Agent 成本估算 — 动手之前先算账

开场

问题背景

核心框架：成本估算四步法

Step 1：明确 2026 年 3 月 API 定价

Step 2：拆解 Token 消耗

Step 3：计算月度成本

Step 4：加上 Buffer 和隐性成本

实战经验

我的真实成本数据

五个成本优化技巧

预算模板

总结

Keep reading.

每个 AI 构建者都该知道的三种 Agent 架构

Single Agent vs Multi-Agent — 什么时候该用哪种

我怎么设计了一个 8-Agent 社群管理系统