RAG vs Fine-tuning vs Prompt Engineering — A Decision Framework

Opening

I ran an A/B test: same customer service scenario, three approaches — pure Prompt Engineering, RAG, and Fine-tuning — each processing 1,000 real tickets over two weeks. The results — Prompt Engineering scored 82% customer satisfaction, RAG scored 89%, Fine-tuning scored 91%. But look at costs: Prompt Engineering cost $38, RAG cost $165, Fine-tuning cost $2,400 (including training). A 9-point satisfaction gap, a 63x cost gap. This isn't about which is better — it's about knowing when to use what.

The Problem

"I have proprietary data and want my LLM to use it — what should I do?" — this is the question I get asked most. Most people's first instinct is Fine-tuning or RAG, but in the majority of cases, a few few-shot examples in the prompt already solve 80% of the problem.

The fundamental difference between the three approaches:

Prompt Engineering: Don't change the model, change the input. Use system prompts, few-shot examples, and instructions to guide model behavior
RAG (Retrieval-Augmented Generation): Don't change the model, change the context. At runtime, retrieve relevant information from external data sources and inject it into the prompt
Fine-tuning: Change the model itself. Train the model's weights on your data so it internalizes specific behavioral patterns

Core Framework: Two-Dimensional Decision Matrix

Plot your requirements on two axes:

X-axis: Knowledge type — what does the model need to know?

Static knowledge (product manuals, company policies, FAQs) → can go in the prompt or be retrieved via RAG
Dynamic knowledge (real-time pricing, inventory, news) → must use RAG, since model training data is outdated

Y-axis: Behavioral pattern — how should the model act?

Generic behavior (summarizing, classifying, translating) → Prompt Engineering is enough
Proprietary behavior (specific tone, specialized terminology, industry conventions) → may need Fine-tuning

            Generic Behavior            Proprietary Behavior
         ┌──────────────┬──────────────┐
  Static │  Prompt      │  Fine-tuning │
  Know-  │  Engineering │  or high-    │
  ledge  │  (cheapest)  │  quality     │
         │              │  Prompt      │
         ├──────────────┼──────────────┤
  Dynamic│  RAG         │  RAG +       │
  Know-  │  (must-have) │  Fine-tuning │
  ledge  │              │  (priciest   │
         │              │  but best)   │
         └──────────────┴──────────────┘

Approach 1: Prompt Engineering

When to use: Small knowledge volume (fits within an 8K token prompt), behavioral patterns can be described clearly through instructions.

from openai import OpenAI
client = OpenAI()

# Approach 1: Few-shot Prompt Engineering
# Encode both knowledge and behavioral patterns in the prompt
SYSTEM_PROMPT = """You are ArkTop AI's customer service agent. Follow these rules:

## Refund Policy
- Unconditional refund within 7 days
- Refunds between 7-30 days require review
- No refunds after 30 days
- Annual plans refunded pro-rata

## Response Style
- Concise and professional, no excessive pleasantries
- Give clear answers, never say "please wait"
- When amounts are involved, always be precise to the cent

## Examples
User: I bought it three days ago and want a refund
Response: Absolutely. We offer unconditional refunds within 7 days — I'll process this right now. The refund will arrive within 3 business days.

User: I've used my annual plan for two months and want a refund
Response: Annual plans are refunded pro-rata based on remaining months. You've used 2 months, so the refund amount is 10/12 of the annual fee. Shall I proceed?"""

def prompt_engineering_approach(user_message: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content

Cost: Pure API call fees. System prompt is ~500 tokens; with GPT-4.1, per-request prompt cost is ~$0.001.

Pros: Zero infrastructure cost, fastest iteration cycle (just edit the prompt), can go live in 5 minutes Cons: Knowledge capacity limited by context window, instruction-following degrades when prompts exceed ~3,000 tokens

Approach 2: RAG

When to use: Large knowledge volume (hundreds to millions of documents), knowledge updates regularly, source attribution is needed.

from openai import OpenAI
from qdrant_client import QdrantClient

openai_client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")

async def rag_approach(user_question: str) -> str:
    """RAG: retrieve relevant documents → feed to LLM for generation"""

    # Step 1: Convert user question to embedding
    embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",  # $0.02/1M tokens
        input=user_question
    ).data[0].embedding

    # Step 2: Retrieve the most relevant documents from the vector store (top 5)
    results = qdrant.query_points(
        collection_name="product_docs",
        query=embedding,
        limit=5,
        score_threshold=0.75  # Similarity threshold — below this, don't return
    )

    # Step 3: Assemble context
    context_docs = "\n---\n".join([
        f"[Source: {r.payload['source']}]\n{r.payload['content']}"
        for r in results.points
    ])

    # Step 4: Generate answer with context
    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": """Answer questions based on the reference documents provided.
Rules:
1. Only answer based on document content — do not fabricate
2. If the documents don't contain the answer, say so clearly
3. Cite source documents"""},
            {"role": "user", "content": f"Reference documents:\n{context_docs}\n\nQuestion: {user_question}"}
        ],
        temperature=0.2,  # Low temperature for RAG to reduce hallucination
    )
    return response.choices[0].message.content

Cost breakdown (monthly average, 10,000 queries):

Component	Monthly Cost
Embedding generation	$2.00
Vector DB (Qdrant Cloud)	$45.00
LLM generation (GPT-4.1)	$120.00
Document processing and updates	$15.00
Total	$182.00

Pros: Near-unlimited knowledge capacity, updating knowledge doesn't require retraining, can cite sources Cons: Retrieval quality directly impacts generation quality (garbage retrieval = garbage generation), requires maintaining a vector store, adds 200-500ms latency

Approach 3: Fine-tuning

When to use: The model needs to internalize specific behavioral patterns (tone, format, reasoning style), and those patterns are difficult to describe through prompts alone.

from openai import OpenAI
client = OpenAI()

# Step 1: Prepare training data (JSONL format)
# Need at least 50-100 high-quality samples
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are ArkTop AI's customer service agent."},
            {"role": "user", "content": "Why haven't my points arrived yet?"},
            {"role": "assistant", "content": "I checked — your points were credited on March 2, current balance is 2,450 points. If that doesn't match what you expected, tell me what you think it should be and I'll verify."}
        ]
    },
    # ... at least 50 similar samples
]

# Step 2: Upload training file
file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Step 3: Start Fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4.1-mini",  # Choose a cost-effective base model
    hyperparameters={
        "n_epochs": 3,          # Training epochs
        "learning_rate_multiplier": 1.8,
    }
)
# GPT-4.1-mini fine-tuning: $3.00/1M training tokens
# 100 samples ≈ 50K tokens → training cost ≈ $0.15
# But the human labor cost of data preparation far exceeds this

# Step 4: Use the fine-tuned model
def finetuned_approach(user_message: str) -> str:
    response = client.chat.completions.create(
        model=f"ft:gpt-4.1-mini:arktop-ai:customer-service:xxx",
        messages=[
            {"role": "system", "content": "You are ArkTop AI's customer service agent."},
            {"role": "user", "content": user_message},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content

Cost breakdown:

Item	Cost
Data preparation (human labor, 100 high-quality samples)	$500-$2,000
GPT-4.1-mini Fine-tuning (50K tokens x 3 epochs)	$0.45
Inference cost (same as base model)	Same as Approach 1
Retraining per knowledge update	$500+ (including data prep)

Pros: Fast inference (no retrieval step needed), more consistent model behavior, system prompt can be very short (saves tokens) Cons: High data preparation cost, requires retraining for every knowledge update, risk of catastrophic forgetting

Practical Lessons

Decision Tree

In real projects, I use this decision tree:

1. Can the knowledge fit into an 8K token prompt?
   ├── Yes → Prompt Engineering (try this first)
   │      Good enough?
   │      ├── Yes → Use this, done
   │      └── No → Next step
   │
   └── No → Next step

2. Does the knowledge need frequent updates? (weekly or more)
   ├── Yes → RAG (must-have)
   │      Do behavioral patterns also need customization?
   │      ├── Yes → RAG + Fine-tuning
   │      └── No → Pure RAG
   │
   └── No → Next step

3. Do you have 100+ high-quality training samples?
   ├── Yes → Fine-tuning
   └── No → Use Prompt Engineering for now,
              accumulate data, Fine-tune when ready

Performance Comparison Across All Three (Real Data)

A/B test in a customer service scenario, 1,000 tickets per approach:

Metric	Prompt Engineering	RAG	Fine-tuning
Accuracy	78%	89%	91%
Customer satisfaction	82%	89%	91%
Average latency	1.2s	2.1s	1.0s
Two-week total cost	$38	$165	$2,400
Time to launch	1 day	1 week	3 weeks
Knowledge update speed	Instant	Minutes	Days

Hybrid Approach: Production Best Practice

In 2026, most production systems use a hybrid approach:

async def hybrid_approach(user_question: str) -> str:
    """Hybrid: Prompt Engineering + RAG, Fine-tuning when necessary"""

    # Layer 1: Check if it hits an FAQ (pure prompt can answer)
    faq_match = check_faq_cache(user_question)
    if faq_match and faq_match.confidence > 0.9:
        return faq_match.answer  # Cache hit, zero API cost

    # Layer 2: RAG retrieval
    docs = await retrieve_relevant_docs(user_question)

    # Layer 3: Generate with fine-tuned model (if available)
    model = "ft:gpt-4.1-mini:arktop:cs:xxx" if FINETUNED_AVAILABLE else "gpt-4.1"
    response = await generate_response(model, user_question, docs)
    return response

Pitfalls I've Hit

Pitfall 1: RAG retrieval quality. After implementing RAG the results were poor, and I assumed it was an LLM issue, spending two weeks tweaking prompts. The actual root cause: the embedding search was stuffing irrelevant documents into the context. Solution: raise the similarity threshold from 0.6 to 0.78 — better to return nothing than to return garbage.

Pitfall 2: Fine-tuning data quality. My first Fine-tuning run used 200 auto-generated training samples, and the model picked up an "AI-sounding" response style. Switching to 100 hand-crafted samples produced dramatically better results. Fine-tuning quality depends on data quality, not data quantity.

Pitfall 3: Premature Fine-tuning. One project spent three weeks on Fine-tuning right at launch, only to discover post-launch that business requirements had changed and all the training data was wasted. Lesson: run Prompt Engineering for three months first, accumulate real user data, then consider Fine-tuning.

Takeaways

Three things to remember:

80% of needs can be met with Prompt Engineering — start with the simplest approach and only escalate if results fall short. Before jumping from $38 to $2,400, make sure that quality gap is worth the price
RAG and Fine-tuning solve different problems — RAG addresses "what to know" (knowledge), Fine-tuning addresses "how to act" (behavior). Figure out which one your need actually is
Hybrid is the production standard — FAQ cache + RAG retrieval + Fine-tuned model, layered processing. Simple questions hit the cache (zero cost), complex questions go through RAG (moderate cost), core scenarios use Fine-tuning (high quality)

Next time someone asks "should I use RAG or Fine-tuning?", counter with: "How much knowledge do you need to add? How often does it change? Do you have high-quality training data?" The answer will emerge on its own.

What approach are you using in production? How do you split the hybrid? Come share at the Solo Unicorn Club.

RAG vs Fine-tuning vs Prompt Engineering — A Decision Framework

RAG vs Fine-tuning vs Prompt Engineering — A Decision Framework

Opening

The Problem

Core Framework: Two-Dimensional Decision Matrix

Approach 1: Prompt Engineering

Approach 2: RAG

Approach 3: Fine-tuning

Practical Lessons

Decision Tree

Performance Comparison Across All Three (Real Data)

Hybrid Approach: Production Best Practice

Pitfalls I've Hit

Takeaways

Keep reading.

Single Agent vs Multi-Agent — When to Use Which

Designing Memory Systems for AI Agents

How to Connect Your AI Agent to Company Data — A Step-by-Step Guide