How to Connect Your AI Agent to Company Data — A Step-by-Step Guide

Introduction

I've built the data integration layer for AI agents at three different companies, and I've noticed a pattern: 90% of agent projects stall not because the model isn't good enough, but because the data won't connect. Your agent's capability ceiling isn't the LLM's capability ceiling — it's the quality and breadth of the data it can access. This article breaks down all four integration approaches — direct database connections, API integration, file system crawling, and RAG knowledge bases — so you can see exactly how each one works.

The Problem

A typical enterprise AI agent needs access to these data sources:

Structured data: Customer tables, order tables, product tables in PostgreSQL / MySQL
Semi-structured data: Notion documents, Confluence pages, Google Docs
Unstructured data: PDF contracts, Slack chat logs, email attachments
Real-time data: Inventory levels, pricing data, user behavior from live APIs

Most tutorials only cover RAG (chunk documents -> vectorize -> retrieve), but in production, RAG is just one of four data integration approaches. Blindly shoving everything into RAG degrades retrieval quality and sends costs through the roof.

Core Architecture

Design Principles

Match the approach to the data type: Use SQL queries for structured data — don't force it into RAG
Least privilege: The agent should only access the data it needs, via read-only accounts with row-level permissions
Cache first: Cache frequently queried data locally to reduce latency and cost

Four Integration Approaches

                    +------------------+
                    |    AI Agent      |
                    |   (Claude API)   |
                    +--------+---------+
                             |
        +--------------------+--------------------+
        |                    |                    |
   +----v-----+      +------v-------+     +------v-------+
   | SQL Query |      | API Integration|     | RAG Retrieval|
   |(Structured)|      | (Real-time)  |     |(Unstructured)|
   +----+-----+      +------+-------+     +------+-------+
        |                    |                    |
   +----v-----+      +------v-------+     +------v-------+
   |PostgreSQL |      | External APIs|     | Vector DB    |
   |  MySQL    |      | (Stripe etc.)|     | (Qdrant)     |
   +----------+      +--------------+     +--------------+

Implementation Details

Approach 1: Direct SQL Connection (Structured Data)

This is the most straightforward approach, and also the easiest to get wrong. The core challenge: how to get an LLM to generate SQL queries that are both safe and correct.

import anthropic
import psycopg2
from typing import Any

# Database connection (using a read-only account)
def get_db_connection():
    return psycopg2.connect(
        host="localhost",
        database="company_db",
        user="agent_readonly",  # Read-only permissions
        password="****"
    )

# Get database schema info (provide context to the agent)
def get_schema_context() -> str:
    conn = get_db_connection()
    cursor = conn.cursor()
    cursor.execute("""
        SELECT table_name, column_name, data_type
        FROM information_schema.columns
        WHERE table_schema = 'public'
        ORDER BY table_name, ordinal_position
    """)
    schema = cursor.fetchall()
    conn.close()

    # Format into agent-readable text
    schema_text = "Database table structure:\n"
    current_table = ""
    for table, column, dtype in schema:
        if table != current_table:
            schema_text += f"\nTable: {table}\n"
            current_table = table
        schema_text += f"  - {column} ({dtype})\n"

    return schema_text

# Agent generates SQL and executes it
def query_with_agent(question: str) -> dict[str, Any]:
    client = anthropic.Anthropic()
    schema = get_schema_context()

    # Step 1: Have the agent generate SQL
    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        system=f"""You are a SQL generator.
Generate a PostgreSQL query based on the user's question and the database schema.
Output only the SQL statement, no explanations.
You may only generate SELECT statements — INSERT/UPDATE/DELETE are forbidden.

{schema}""",
        messages=[{"role": "user", "content": question}]
    )

    sql = response.content[0].text.strip()

    # Safety check: only allow SELECT
    if not sql.upper().startswith("SELECT"):
        raise ValueError(f"Illegal SQL operation: {sql[:50]}")

    # Execute the query
    conn = get_db_connection()
    cursor = conn.cursor()
    cursor.execute(sql)
    results = cursor.fetchall()
    columns = [desc[0] for desc in cursor.description]
    conn.close()

    return {"sql": sql, "columns": columns, "rows": results}

Critical security measures:

The database account only has SELECT permissions
An additional SQL keyword whitelist check at the code level
Query timeout configured (to prevent slow full-table scans from generated queries)
Row limit enforced (LIMIT 100)

Approach 2: API Integration (Real-Time Data)

The core of API integration is Tool Use. The Claude API natively supports tool calling — you define your tools' schemas, and the model decides when to invoke them:

import anthropic
import httpx

client = anthropic.Anthropic()

# Define the tools available to the agent
tools = [
    {
        "name": "get_customer_info",
        "description": "Retrieve detailed customer information from the CRM",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {
                    "type": "string",
                    "description": "Customer ID"
                }
            },
            "required": ["customer_id"]
        }
    },
    {
        "name": "get_order_status",
        "description": "Check order status",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "Order number"
                }
            },
            "required": ["order_id"]
        }
    }
]

# Tool execution functions
def execute_tool(name: str, params: dict) -> str:
    if name == "get_customer_info":
        resp = httpx.get(
            f"https://api.crm.com/v1/customers/{params['customer_id']}",
            headers={"Authorization": f"Bearer {CRM_API_KEY}"}
        )
        return resp.text
    elif name == "get_order_status":
        resp = httpx.get(
            f"https://api.orders.com/v1/orders/{params['order_id']}",
            headers={"Authorization": f"Bearer {ORDER_API_KEY}"}
        )
        return resp.text

# Agent conversation loop (with tool use)
def agent_chat(user_message: str):
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        # If the model decides to call a tool
        if response.stop_reason == "tool_use":
            tool_block = next(
                b for b in response.content if b.type == "tool_use"
            )
            result = execute_tool(tool_block.name, tool_block.input)

            # Feed the tool result back to the model
            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [{
                    "type": "tool_result",
                    "tool_use_id": tool_block.id,
                    "content": result
                }]
            })
        else:
            # Model responds directly — return the result
            return response.content[0].text

Approach 3: RAG Knowledge Base (Unstructured Data)

RAG is the right tool for PDFs, documents, and knowledge bases — unstructured data in general. Best practices in 2026 look quite different from two years ago, with the key improvements being in chunking strategy and hybrid retrieval.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import anthropic
import hashlib

# Initialize Qdrant vector database
qdrant = QdrantClient(url="http://localhost:6333")
claude = anthropic.Anthropic()

# Create a collection
qdrant.create_collection(
    collection_name="company_docs",
    vectors_config=VectorParams(
        size=1024,      # Embedding dimensions
        distance=Distance.COSINE
    )
)

# Document chunking (semantic chunking, not fixed-length)
def semantic_chunk(text: str, max_tokens: int = 400) -> list[str]:
    """Chunk by paragraph and semantic boundaries, not fixed character count"""
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        # Estimate token count (~1.5 chars/token for Chinese, ~4 chars/token for English)
        estimated_tokens = len(current_chunk) / 1.5
        if estimated_tokens + len(para) / 1.5 > max_tokens:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para
        else:
            current_chunk += "\n\n" + para

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Generate embeddings (using Voyage AI or OpenAI)
def get_embedding(text: str) -> list[float]:
    """Call the embedding API to get a vector"""
    import httpx
    resp = httpx.post(
        "https://api.voyageai.com/v1/embeddings",
        headers={"Authorization": f"Bearer {VOYAGE_API_KEY}"},
        json={
            "model": "voyage-3-large",
            "input": [text],
            "input_type": "document"
        }
    )
    return resp.json()["data"][0]["embedding"]

# Index a document
def index_document(doc_path: str, doc_text: str):
    chunks = semantic_chunk(doc_text)
    points = []

    for i, chunk in enumerate(chunks):
        embedding = get_embedding(chunk)
        point_id = hashlib.md5(f"{doc_path}:{i}".encode()).hexdigest()
        points.append(PointStruct(
            id=point_id,
            vector=embedding,
            payload={
                "text": chunk,
                "source": doc_path,
                "chunk_index": i
            }
        ))

    qdrant.upsert(collection_name="company_docs", points=points)

# RAG retrieval + generation
def rag_query(question: str) -> str:
    # Retrieve relevant documents
    query_embedding = get_embedding(question)
    results = qdrant.search(
        collection_name="company_docs",
        query_vector=query_embedding,
        limit=5  # Return top 5
    )

    # Assemble context
    context = "\n\n---\n\n".join([
        f"Source: {r.payload['source']}\n{r.payload['text']}"
        for r in results
    ])

    # Call Claude to generate the answer
    response = claude.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=2048,
        system="""Answer questions based on the provided document content.
If the documents do not contain relevant information, explicitly state "No relevant information found in the documents."
Cite specific sources in your answer.""",
        messages=[{
            "role": "user",
            "content": f"Document content:\n{context}\n\nQuestion: {question}"
        }]
    )

    return response.content[0].text

2026 RAG Best Practices:

Chunk size: 300-500 tokens is the sweet spot. Too small and you lose context; too large and you introduce noise
Embedding model selection: Voyage AI's voyage-3-large performs excellently in mixed Chinese-English scenarios, with 1024 dimensions
Hybrid retrieval: Combining keyword search (BM25) with vector search improves recall by roughly 15%
Rebuild indexes regularly: After source documents are updated, you must re-chunk and re-embed — otherwise retrieval quality degrades over time

Approach 4: File System Crawling

For local files (product manuals as PDFs, contract documents), you can use LlamaIndex or build your own crawler:

from pathlib import Path

# Supported file formats
SUPPORTED_EXTENSIONS = {".pdf", ".md", ".txt", ".docx", ".csv"}

def crawl_directory(root_dir: str) -> list[dict]:
    """Recursively crawl all supported files in a directory"""
    documents = []
    root = Path(root_dir)

    for file_path in root.rglob("*"):
        if file_path.suffix.lower() in SUPPORTED_EXTENSIONS:
            text = extract_text(file_path)  # Call the appropriate parser by format
            documents.append({
                "path": str(file_path),
                "text": text,
                "modified": file_path.stat().st_mtime
            })

    return documents

Real-World Results

Production Data

On a customer service agent project for an e-commerce company, I used SQL direct connection + API integration + RAG simultaneously:

Data Source	Approach	Query Latency	Monthly Cost
Order database	SQL direct connection	120ms	$0 (own infrastructure)
Logistics API	Tool Use	350ms	$50 (API call fees)
Product documentation	RAG (Qdrant)	280ms	$30 (embedding + storage)
Customer service scripts	RAG (Qdrant)	250ms	Included above

Total RAG knowledge base: 1,200 documents, 42,000 chunks, vector dimension 1024.

Pitfalls I Hit

Pitfall 1: Don't run structured data through RAG. I initially embedded order data into the vector store. For precise queries like "User A's last 5 orders," RAG's recall was under 60%. Switching to a direct SQL connection brought accuracy to 100%.

Pitfall 2: Choose your embedding model and LLM separately. Don't assume that because you're using Claude for generation, you should also use Claude for embeddings. Dedicated embedding models (like Voyage AI) significantly outperform general-purpose LLMs on retrieval tasks.

Pitfall 3: Forgetting to rebuild the index after document updates. Product prices got updated, but the vector store still had the old data. The agent answered customers with outdated prices, causing complaints. The fix: set up a cron job to re-crawl and re-index every night.

Takeaways

Three core takeaways:

Match your integration approach to your data type — use SQL for structured data, API Tool Use for real-time data, and RAG for unstructured documents. Don't shove everything into RAG.
Security first — read-only database accounts, least-privilege API access, sanitized query results. The scope of operations an agent can perform determines the blast radius when things go wrong.
RAG is not a silver bullet — it solves the problem of "finding relevant information in a large corpus of documents," but for precise queries, real-time data, and aggregate calculations, SQL and APIs are the better fit.

If you're connecting data to an agent, start with the simplest single data source — say, hooking up one API via Tool Use — validate the end-to-end flow, then expand to multiple data sources.

What data integration challenges have you run into with your agent projects? Let's discuss.

How to Connect Your AI Agent to Company Data — A Step-by-Step Guide

How to Connect Your AI Agent to Company Data — A Step-by-Step Guide

Introduction

The Problem

Core Architecture

Design Principles

Four Integration Approaches

Implementation Details

Approach 1: Direct SQL Connection (Structured Data)

Approach 2: API Integration (Real-Time Data)

Approach 3: RAG Knowledge Base (Unstructured Data)

Approach 4: File System Crawling

Real-World Results

Production Data

Pitfalls I Hit

Takeaways

Keep reading.

Building a Customer Service Agent That Actually Works

Designing Memory Systems for AI Agents

Building Your First AI Agent Team with Claude + n8n