Solo Unicorn Club logoSolo Unicorn
2,700 words

RAG vs Fine-tuning vs Prompt Engineering — A Decision Framework

RAGFine-tuningPrompt EngineeringLLMDecision FrameworkCost Analysis
RAG vs Fine-tuning vs Prompt Engineering — A Decision Framework

RAG vs Fine-tuning vs Prompt Engineering — A Decision Framework

Opening

I ran an A/B test: same customer service scenario, three approaches — pure Prompt Engineering, RAG, and Fine-tuning — each processing 1,000 real tickets over two weeks. The results — Prompt Engineering scored 82% customer satisfaction, RAG scored 89%, Fine-tuning scored 91%. But look at costs: Prompt Engineering cost $38, RAG cost $165, Fine-tuning cost $2,400 (including training). A 9-point satisfaction gap, a 63x cost gap. This isn't about which is better — it's about knowing when to use what.

The Problem

"I have proprietary data and want my LLM to use it — what should I do?" — this is the question I get asked most. Most people's first instinct is Fine-tuning or RAG, but in the majority of cases, a few few-shot examples in the prompt already solve 80% of the problem.

The fundamental difference between the three approaches:

  • Prompt Engineering: Don't change the model, change the input. Use system prompts, few-shot examples, and instructions to guide model behavior
  • RAG (Retrieval-Augmented Generation): Don't change the model, change the context. At runtime, retrieve relevant information from external data sources and inject it into the prompt
  • Fine-tuning: Change the model itself. Train the model's weights on your data so it internalizes specific behavioral patterns

Core Framework: Two-Dimensional Decision Matrix

Plot your requirements on two axes:

X-axis: Knowledge type — what does the model need to know?

  • Static knowledge (product manuals, company policies, FAQs) → can go in the prompt or be retrieved via RAG
  • Dynamic knowledge (real-time pricing, inventory, news) → must use RAG, since model training data is outdated

Y-axis: Behavioral pattern — how should the model act?

  • Generic behavior (summarizing, classifying, translating) → Prompt Engineering is enough
  • Proprietary behavior (specific tone, specialized terminology, industry conventions) → may need Fine-tuning
            Generic Behavior            Proprietary Behavior
         ┌──────────────┬──────────────┐
  Static │  Prompt      │  Fine-tuning │
  Know-  │  Engineering │  or high-    │
  ledge  │  (cheapest)  │  quality     │
         │              │  Prompt      │
         ├──────────────┼──────────────┤
  Dynamic│  RAG         │  RAG +       │
  Know-  │  (must-have) │  Fine-tuning │
  ledge  │              │  (priciest   │
         │              │  but best)   │
         └──────────────┴──────────────┘

Approach 1: Prompt Engineering

When to use: Small knowledge volume (fits within an 8K token prompt), behavioral patterns can be described clearly through instructions.

from openai import OpenAI
client = OpenAI()

# Approach 1: Few-shot Prompt Engineering
# Encode both knowledge and behavioral patterns in the prompt
SYSTEM_PROMPT = """You are ArkTop AI's customer service agent. Follow these rules:

## Refund Policy
- Unconditional refund within 7 days
- Refunds between 7-30 days require review
- No refunds after 30 days
- Annual plans refunded pro-rata

## Response Style
- Concise and professional, no excessive pleasantries
- Give clear answers, never say "please wait"
- When amounts are involved, always be precise to the cent

## Examples
User: I bought it three days ago and want a refund
Response: Absolutely. We offer unconditional refunds within 7 days — I'll process this right now. The refund will arrive within 3 business days.

User: I've used my annual plan for two months and want a refund
Response: Annual plans are refunded pro-rata based on remaining months. You've used 2 months, so the refund amount is 10/12 of the annual fee. Shall I proceed?"""

def prompt_engineering_approach(user_message: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content

Cost: Pure API call fees. System prompt is ~500 tokens; with GPT-4.1, per-request prompt cost is ~$0.001.

Pros: Zero infrastructure cost, fastest iteration cycle (just edit the prompt), can go live in 5 minutes Cons: Knowledge capacity limited by context window, instruction-following degrades when prompts exceed ~3,000 tokens

Approach 2: RAG

When to use: Large knowledge volume (hundreds to millions of documents), knowledge updates regularly, source attribution is needed.

from openai import OpenAI
from qdrant_client import QdrantClient

openai_client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")

async def rag_approach(user_question: str) -> str:
    """RAG: retrieve relevant documents → feed to LLM for generation"""

    # Step 1: Convert user question to embedding
    embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",  # $0.02/1M tokens
        input=user_question
    ).data[0].embedding

    # Step 2: Retrieve the most relevant documents from the vector store (top 5)
    results = qdrant.query_points(
        collection_name="product_docs",
        query=embedding,
        limit=5,
        score_threshold=0.75  # Similarity threshold — below this, don't return
    )

    # Step 3: Assemble context
    context_docs = "\n---\n".join([
        f"[Source: {r.payload['source']}]\n{r.payload['content']}"
        for r in results.points
    ])

    # Step 4: Generate answer with context
    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": """Answer questions based on the reference documents provided.
Rules:
1. Only answer based on document content — do not fabricate
2. If the documents don't contain the answer, say so clearly
3. Cite source documents"""},
            {"role": "user", "content": f"Reference documents:\n{context_docs}\n\nQuestion: {user_question}"}
        ],
        temperature=0.2,  # Low temperature for RAG to reduce hallucination
    )
    return response.choices[0].message.content

Cost breakdown (monthly average, 10,000 queries):

Component Monthly Cost
Embedding generation $2.00
Vector DB (Qdrant Cloud) $45.00
LLM generation (GPT-4.1) $120.00
Document processing and updates $15.00
Total $182.00

Pros: Near-unlimited knowledge capacity, updating knowledge doesn't require retraining, can cite sources Cons: Retrieval quality directly impacts generation quality (garbage retrieval = garbage generation), requires maintaining a vector store, adds 200-500ms latency

Approach 3: Fine-tuning

When to use: The model needs to internalize specific behavioral patterns (tone, format, reasoning style), and those patterns are difficult to describe through prompts alone.

from openai import OpenAI
client = OpenAI()

# Step 1: Prepare training data (JSONL format)
# Need at least 50-100 high-quality samples
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are ArkTop AI's customer service agent."},
            {"role": "user", "content": "Why haven't my points arrived yet?"},
            {"role": "assistant", "content": "I checked — your points were credited on March 2, current balance is 2,450 points. If that doesn't match what you expected, tell me what you think it should be and I'll verify."}
        ]
    },
    # ... at least 50 similar samples
]

# Step 2: Upload training file
file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Step 3: Start Fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4.1-mini",  # Choose a cost-effective base model
    hyperparameters={
        "n_epochs": 3,          # Training epochs
        "learning_rate_multiplier": 1.8,
    }
)
# GPT-4.1-mini fine-tuning: $3.00/1M training tokens
# 100 samples ≈ 50K tokens → training cost ≈ $0.15
# But the human labor cost of data preparation far exceeds this

# Step 4: Use the fine-tuned model
def finetuned_approach(user_message: str) -> str:
    response = client.chat.completions.create(
        model=f"ft:gpt-4.1-mini:arktop-ai:customer-service:xxx",
        messages=[
            {"role": "system", "content": "You are ArkTop AI's customer service agent."},
            {"role": "user", "content": user_message},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content

Cost breakdown:

Item Cost
Data preparation (human labor, 100 high-quality samples) $500-$2,000
GPT-4.1-mini Fine-tuning (50K tokens x 3 epochs) $0.45
Inference cost (same as base model) Same as Approach 1
Retraining per knowledge update $500+ (including data prep)

Pros: Fast inference (no retrieval step needed), more consistent model behavior, system prompt can be very short (saves tokens) Cons: High data preparation cost, requires retraining for every knowledge update, risk of catastrophic forgetting

Practical Lessons

Decision Tree

In real projects, I use this decision tree:

1. Can the knowledge fit into an 8K token prompt?
   ├── Yes → Prompt Engineering (try this first)
   │      Good enough?
   │      ├── Yes → Use this, done
   │      └── No → Next step
   │
   └── No → Next step

2. Does the knowledge need frequent updates? (weekly or more)
   ├── Yes → RAG (must-have)
   │      Do behavioral patterns also need customization?
   │      ├── Yes → RAG + Fine-tuning
   │      └── No → Pure RAG
   │
   └── No → Next step

3. Do you have 100+ high-quality training samples?
   ├── Yes → Fine-tuning
   └── No → Use Prompt Engineering for now,
              accumulate data, Fine-tune when ready

Performance Comparison Across All Three (Real Data)

A/B test in a customer service scenario, 1,000 tickets per approach:

Metric Prompt Engineering RAG Fine-tuning
Accuracy 78% 89% 91%
Customer satisfaction 82% 89% 91%
Average latency 1.2s 2.1s 1.0s
Two-week total cost $38 $165 $2,400
Time to launch 1 day 1 week 3 weeks
Knowledge update speed Instant Minutes Days

Hybrid Approach: Production Best Practice

In 2026, most production systems use a hybrid approach:

async def hybrid_approach(user_question: str) -> str:
    """Hybrid: Prompt Engineering + RAG, Fine-tuning when necessary"""

    # Layer 1: Check if it hits an FAQ (pure prompt can answer)
    faq_match = check_faq_cache(user_question)
    if faq_match and faq_match.confidence > 0.9:
        return faq_match.answer  # Cache hit, zero API cost

    # Layer 2: RAG retrieval
    docs = await retrieve_relevant_docs(user_question)

    # Layer 3: Generate with fine-tuned model (if available)
    model = "ft:gpt-4.1-mini:arktop:cs:xxx" if FINETUNED_AVAILABLE else "gpt-4.1"
    response = await generate_response(model, user_question, docs)
    return response

Pitfalls I've Hit

Pitfall 1: RAG retrieval quality. After implementing RAG the results were poor, and I assumed it was an LLM issue, spending two weeks tweaking prompts. The actual root cause: the embedding search was stuffing irrelevant documents into the context. Solution: raise the similarity threshold from 0.6 to 0.78 — better to return nothing than to return garbage.

Pitfall 2: Fine-tuning data quality. My first Fine-tuning run used 200 auto-generated training samples, and the model picked up an "AI-sounding" response style. Switching to 100 hand-crafted samples produced dramatically better results. Fine-tuning quality depends on data quality, not data quantity.

Pitfall 3: Premature Fine-tuning. One project spent three weeks on Fine-tuning right at launch, only to discover post-launch that business requirements had changed and all the training data was wasted. Lesson: run Prompt Engineering for three months first, accumulate real user data, then consider Fine-tuning.

Takeaways

Three things to remember:

  1. 80% of needs can be met with Prompt Engineering — start with the simplest approach and only escalate if results fall short. Before jumping from $38 to $2,400, make sure that quality gap is worth the price
  2. RAG and Fine-tuning solve different problems — RAG addresses "what to know" (knowledge), Fine-tuning addresses "how to act" (behavior). Figure out which one your need actually is
  3. Hybrid is the production standard — FAQ cache + RAG retrieval + Fine-tuned model, layered processing. Simple questions hit the cache (zero cost), complex questions go through RAG (moderate cost), core scenarios use Fine-tuning (high quality)

Next time someone asks "should I use RAG or Fine-tuning?", counter with: "How much knowledge do you need to add? How often does it change? Do you have high-quality training data?" The answer will emerge on its own.

What approach are you using in production? How do you split the hybrid? Come share at the Solo Unicorn Club.