Solo Unicorn Club logoSolo Unicorn
2,680 words

Designing Memory Systems for AI Agents

AI AgentMemory SystemVector StoreRAGConversation ManagementLong-term Memory
Designing Memory Systems for AI Agents

Designing Memory Systems for AI Agents

Opening

During its first month live, my community Q&A Agent was asked the same question three times by the same member: "How do I sign up for offline events?" Each time, the Agent explained from scratch as if hearing it for the first time. This wasn't a capability problem — it was a memory problem. After adding a memory layer, the second time around the Agent could simply say, "Same as last time — here's the link. You signed up for the March event previously." Response quality scores jumped from 7.2 to 9.6. Memory is the dividing line between an AI Agent that's merely a "tool" and one that's truly an "assistant."

The Problem

LLMs are inherently stateless — each request is independent, and the model doesn't remember what the previous conversation was about. The mainstream approach to "memory" is stuffing conversation history into the prompt, but this has three hard limitations:

  1. Context windows are finite: Even with a 200K token window, a few dozen turns fill it up
  2. Costs grow linearly: The longer the history, the more tokens each request burns
  3. No long-term accumulation: When the conversation ends, the memory vanishes

A real memory system needs to solve three problems: what to remember, how to store it, and how to retrieve it.

Core Architecture: Three-Layer Memory Model

I divide Agent memory into three layers, modeled after the human memory system:

┌─────────────────────────────────────────┐
│  Short-term Memory (Working Memory)     │
│  Context window for the current session │
│  Storage: In-memory / Message Array     │
│  Retention: Current session only        │
├─────────────────────────────────────────┤
│  Long-term Memory                       │
│  Cross-session knowledge and facts      │
│  Storage: Vector Store (Qdrant/Pinecone)│
│  Retention: Permanent                   │
├─────────────────────────────────────────┤
│  Episodic Memory                        │
│  User profiles, preferences, history    │
│  Storage: Relational DB (PostgreSQL)    │
│  Retention: Permanent, periodically     │
│  updated                                │
└─────────────────────────────────────────┘

Layer 1: Short-term Memory — Conversation Window Management

The most basic form of memory, and the easiest to get wrong. The core question: how do you compress when the conversation gets too long?

from dataclasses import dataclass, field
from openai import OpenAI

client = OpenAI()

@dataclass
class WorkingMemory:
    """Short-term memory: manages the context window for the current conversation"""
    messages: list[dict] = field(default_factory=list)
    max_tokens: int = 8000       # Token ceiling for short-term memory
    summary_threshold: int = 6000 # Trigger compression above this value

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        # Check if compression is needed
        if self._estimate_tokens() > self.summary_threshold:
            self._compress()

    def _compress(self):
        """Compression strategy: keep the last 3 turns + summarize history"""
        # Keep the system prompt (first message) and the last 3 turns (last 6 messages)
        system = self.messages[0] if self.messages[0]["role"] == "system" else None
        recent = self.messages[-6:]
        old = self.messages[1:-6] if system else self.messages[:-6]

        if not old:
            return

        # Use a smaller model to generate a history summary
        summary_response = client.chat.completions.create(
            model="gpt-4.1-mini",  # Use a cheap model for compression
            messages=[
                {"role": "system", "content": "Compress the following conversation into a summary of key information. Retain important facts, decisions, and user preferences. Keep it under 200 words."},
                {"role": "user", "content": str(old)}
            ],
            max_tokens=300,
        )
        summary = summary_response.choices[0].message.content

        # Reassemble the message list
        self.messages = []
        if system:
            self.messages.append(system)
        self.messages.append({
            "role": "system",
            "content": f"[Conversation History Summary] {summary}"
        })
        self.messages.extend(recent)

    def _estimate_tokens(self) -> int:
        """Rough token estimate"""
        total_chars = sum(len(m["content"]) for m in self.messages)
        return int(total_chars * 0.7)  # Rough coefficient for mixed content

Key design decisions:

  • Set the compression threshold at roughly 75% of max_tokens to leave room for new messages
  • Use GPT-4.1-mini for summaries — costs about $0.0003 per compression
  • Keep the last 3 turns verbatim (users remember recent messages most vividly, and compression loses detail)

Layer 2: Long-term Memory — Vector Store

Cross-session knowledge accumulation. Valuable information from each conversation is written to a vector store; when the next conversation begins, relevant memories are retrieved based on the current topic.

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
from openai import OpenAI
from uuid import uuid4
from datetime import datetime

class LongTermMemory:
    """Long-term memory: vector store-based cross-session knowledge base"""

    def __init__(self):
        self.qdrant = QdrantClient(url="http://localhost:6333")
        self.openai = OpenAI()
        self.collection = "agent_memory"

    async def remember(self, content: str, metadata: dict):
        """Store a new memory"""
        # Generate embedding
        embedding = self.openai.embeddings.create(
            model="text-embedding-3-small",
            input=content
        ).data[0].embedding

        # Write to vector store
        self.qdrant.upsert(
            collection_name=self.collection,
            points=[PointStruct(
                id=str(uuid4()),
                vector=embedding,
                payload={
                    "content": content,
                    "timestamp": datetime.now().isoformat(),
                    "access_count": 0,    # Times retrieved
                    "importance": metadata.get("importance", 0.5),
                    **metadata
                }
            )]
        )

    async def recall(self, query: str, top_k: int = 5,
                     min_relevance: float = 0.75) -> list[dict]:
        """Retrieve relevant memories"""
        query_vector = self.openai.embeddings.create(
            model="text-embedding-3-small",
            input=query
        ).data[0].embedding

        results = self.qdrant.query_points(
            collection_name=self.collection,
            query=query_vector,
            limit=top_k,
            score_threshold=min_relevance,
        )

        memories = []
        for r in results.points:
            memories.append({
                "content": r.payload["content"],
                "relevance": r.score,
                "timestamp": r.payload["timestamp"],
                "importance": r.payload["importance"],
            })
            # Update access count
            self._update_access_count(r.id)

        return memories

    async def forget(self, older_than_days: int = 90,
                     min_access_count: int = 0):
        """Forgetting mechanism: prune low-value memories"""
        cutoff = datetime.now() - timedelta(days=older_than_days)
        # Delete memories older than 90 days that have never been retrieved
        self.qdrant.delete(
            collection_name=self.collection,
            points_selector=Filter(
                must=[
                    FieldCondition(key="timestamp", range=Range(lt=cutoff.isoformat())),
                    FieldCondition(key="access_count", range=Range(lte=min_access_count)),
                    FieldCondition(key="importance", range=Range(lt=0.8)),
                ]
            )
        )

Forgetting is essential. A memory system without forgetting will grow indefinitely, and retrieval quality degrades as the data volume increases. My strategy: automatically prune memories that haven't been retrieved in 90 days and have an importance score below 0.8.

Layer 3: Episodic Memory — User Profiles

Structured user information, stored in a traditional database.

from dataclasses import dataclass
from datetime import datetime
import json

@dataclass
class UserProfile:
    """Episodic memory: structured user profile"""
    user_id: str
    name: str
    interests: list[str]           # Topics of interest
    expertise_level: str           # beginner / intermediate / expert
    preferred_language: str        # zh / en
    interaction_count: int         # Total interactions
    last_topics: list[str]         # Recently discussed topics
    preferences: dict              # Personalization preferences
    first_seen: datetime
    last_seen: datetime

class EpisodicMemory:
    """Episodic memory manager"""

    def __init__(self, db_connection):
        self.db = db_connection

    async def update_profile(self, user_id: str,
                             conversation: list[dict]):
        """Extract information from conversations to update user profiles"""
        profile = await self.get_profile(user_id)

        # Use an LLM to extract structured information from the conversation
        extraction = await client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": """Extract user information from the conversation and output JSON:
{
    "new_interests": ["topic1", "topic2"],
    "expertise_indicators": "beginner/intermediate/expert",
    "preferences": {"key": "value"},
    "key_facts": ["important fact 1", "important fact 2"]
}
Only extract explicitly mentioned information — do not speculate."""},
                {"role": "user", "content": json.dumps(conversation, ensure_ascii=False)}
            ],
            response_format={"type": "json_object"},
        )
        updates = json.loads(extraction.choices[0].message.content)

        # Incrementally update the profile
        profile.interests = list(set(profile.interests + updates.get("new_interests", [])))
        profile.last_topics = updates.get("new_interests", profile.last_topics)
        profile.interaction_count += 1
        profile.last_seen = datetime.now()

        await self.save_profile(profile)

    async def get_context_for_agent(self, user_id: str) -> str:
        """Generate user context for the Agent"""
        profile = await self.get_profile(user_id)
        return f"""User information:
- Name: {profile.name}
- Areas of interest: {', '.join(profile.interests[-5:])}
- Expertise level: {profile.expertise_level}
- Total interactions: {profile.interaction_count}
- Recent topics: {', '.join(profile.last_topics[-3:])}
- Preferences: {json.dumps(profile.preferences, ensure_ascii=False)}"""

Implementation Details: Integrating All Three Layers

class AgentMemorySystem:
    """Complete system integrating all three memory layers"""

    def __init__(self):
        self.working = WorkingMemory()
        self.long_term = LongTermMemory()
        self.episodic = EpisodicMemory(db)

    async def prepare_context(self, user_id: str,
                               user_message: str) -> list[dict]:
        """Prepare complete context for the Agent (fusing all three memory layers)"""

        # 1. Episodic memory: get user profile
        user_context = await self.episodic.get_context_for_agent(user_id)

        # 2. Long-term memory: retrieve relevant knowledge
        memories = await self.long_term.recall(user_message, top_k=3)
        memory_text = "\n".join([
            f"- [{m['timestamp'][:10]}] {m['content']}"
            for m in memories
        ]) if memories else "No relevant historical memories"

        # 3. Short-term memory: current conversation history
        self.working.add("user", user_message)

        # 4. Assemble full context
        system_prompt = f"""You are a community assistant.

{user_context}

Relevant historical memories:
{memory_text}

Answer the user's question based on the above information. If historical memories contain relevant content, reference it."""

        messages = [{"role": "system", "content": system_prompt}]
        messages.extend(self.working.messages[1:])  # Skip original system prompt
        return messages

    async def post_response(self, user_id: str,
                            user_message: str, agent_response: str):
        """Post-response memory updates"""
        # Update short-term memory
        self.working.add("assistant", agent_response)

        # Decide if it's worth storing in long-term memory
        if await self._is_worth_remembering(user_message, agent_response):
            await self.long_term.remember(
                content=f"Q: {user_message}\nA: {agent_response}",
                metadata={"user_id": user_id, "importance": 0.6}
            )

        # Update user profile
        await self.episodic.update_profile(
            user_id,
            [{"role": "user", "content": user_message},
             {"role": "assistant", "content": agent_response}]
        )

Lessons from the Field

Production Data

Here's a comparison after adding the memory layer to my 8-Agent community system:

Metric Without Memory With Memory Change
Q&A satisfaction 72% 89% +17%
Response quality for repeated questions 6.8/10 9.6/10 +41%
Average response tokens 380 290 -24%
Cost per message $0.014 $0.016 +14%
User return rate 34% 52% +53%

The incremental cost of the memory layer (+14%) is far outweighed by the value it delivers (satisfaction +17%, return rate +53%). And response tokens actually went down, because with memory the Agent no longer needs to explain everything from scratch.

Cost Breakdown

Component Monthly Cost
Embedding generation (text-embedding-3-small) $1.80
Qdrant Cloud (long-term memory) $16.00
PostgreSQL (episodic memory) $5.00
Short-term memory compression (GPT-4.1-mini) $2.40
Profile extraction (GPT-4.1-mini) $3.60
Total $28.80

Pitfalls I Encountered

Pitfall 1: Storing everything. Initially I wrote every line of conversation to long-term memory. Three weeks later, the vector store had 50,000 records and retrieval quality fell off a cliff. Solution: add an _is_worth_remembering filter that only stores content with substantive information. Record count dropped to 8,000 and retrieval accuracy returned to 90%+.

Pitfall 2: Compression loses information. A user said "that price I mentioned earlier," but that message had already been compressed into a summary and the details were gone. Solution: during compression, apply special handling for messages containing numbers, prices, dates, and other key data — preserve the original text.

Pitfall 3: Delayed profile updates. Profile updates were asynchronous, and occasionally the Agent used a stale profile within the same conversation. Solution: make profile updates synchronous (update before generating the reply). This adds about 200ms of latency but ensures consistency.

Takeaways

Three key takeaways:

  1. All three memory layers are essential — Short-term memory ensures conversational coherence, long-term memory accumulates knowledge, and episodic memory understands the user. An Agent with only short-term memory is a goldfish; one with only long-term memory doesn't recognize its users.
  2. Forgetting is just as important as remembering — A memory system without forgetting becomes a junkyard. Regularly prune low-value memories to maintain retrieval quality.
  3. The ROI on a memory layer is exceptional — $28.80/month buys you a +17% satisfaction boost and +53% return rate. If your Agent doesn't have a memory layer yet, this should be your top optimization priority.

If you're building an Agent system, start with short-term memory (simplest — 2 hours to implement), then add long-term memory (half a day), and finally episodic memory (1–2 days). Follow this sequence, and you'll see measurable improvements at every step.

Does your Agent have memory? What storage solution are you using? Come share at the Solo Unicorn Club.