RAG Systems: Building Intelligent Knowledge Retrieval Across 4 Projects

Foundation

What is RAG (and why it matters)

RAG solves the core problem with LLMs: they don't know your data. By retrieving relevant context before generating, you get answers grounded in your actual knowledge base — not hallucinated plausibility.

The pattern is simple in concept and surprisingly tricky in production. You embed your documents into a vector space, store those vectors, and at query time you find the closest vectors to your question, pull those chunks as context, and pass them to a language model to synthesize an answer. The model doesn't need to memorize anything — it reads from its retrieved context the same way a human would skim a document before answering a question.

Where it gets hard: chunking strategy, embedding model selection, context window management, and evaluation. Each of these can quietly degrade retrieval quality without obvious error signals. Building four RAG systems across very different domains forced me to understand each of these failure modes concretely, not just theoretically.

OpenAI Embeddings Pinecone Supabase pgvector Python LangChain FastAPI

Portfolio

Systems I've Built

Four production RAG systems across animal welfare, media, civic tech, and personal productivity. Each had a different knowledge structure, a different user interaction model, and a different set of constraints.

🐾

AfA Resource Chatbot (Open Paws)

Animal advocacy knowledge base for the Animals, Food, and Agriculture (AfA) working group. OpenAI embeddings over a curated corpus of advocacy resources, policy documents, and research papers. Conversational Q&A interface that cites sources and acknowledges when information is absent. Built to let advocates find resources they didn't know existed without manually searching 200+ documents.

🎙

Podcast Transcript Retrieval

Semantic search across a large corpus of audio transcripts — enabling search by concept rather than keyword. Multi-source ingestion pipeline: transcripts from multiple podcasters, normalized into a uniform chunk format. Supports queries like "what did he say about AI risk in 2023" and returns the closest relevant transcript passages with timestamps.

⚖

Open Permit Legal Framework Search

40 legal frameworks made semantically searchable by permit type and jurisdiction. The RAG layer sits between the user's permit document and Gemini's letter generation — it retrieves the relevant legal framework before the generation step, grounding the output in the correct jurisdiction's statutes rather than Gemini's training data.

🧠

Personal Knowledge Graph

Cross-project context retrieval for my own development workflow. Notes, architecture decisions, prior Claude Code sessions, and project-specific constraints embedded and retrievable. The problem this solves: starting a new session on an existing project and needing to re-explain context that I've already documented somewhere. RAG surfaces it instead.

Technical

Core Architecture

Every RAG system I've built follows the same fundamental pipeline, with variations at each stage based on the domain's constraints.

RAG Pipeline

1. Documents     →  Raw source material (PDFs, URLs, transcripts, markdown)
2. Chunking      →  Split into overlapping segments (512-1024 tokens typical)
3. Embedding     →  OpenAI text-embedding-3-small or -large → float vectors
4. Vector Store  →  Upsert to Pinecone or Supabase pgvector with metadata
                                    ↓
5. Query         →  User question → embed with same model
6. Retrieve      →  Cosine similarity search → top-k chunks
7. Augment       →  Inject retrieved chunks into LLM prompt as context
8. Generate      →  LLM synthesizes answer grounded in retrieved context
9. Cite          →  Return source metadata alongside the answer

The most consequential decisions happen at stages 2 and 6. Chunking strategy determines what information units are retrievable. Retrieval k-value and reranking strategy determine what the model actually sees. Both are domain-specific — what works for legal documents doesn't work the same way for podcast transcripts.

Infrastructure

Dual Vector Storage Strategy

I use both Pinecone and Supabase pgvector across my RAG projects — not as redundancy, but because they serve genuinely different use cases. Using both was a deliberate architectural decision, not indecision.

Selected

Pinecone

Cloud-managed vector database. Fast cold starts with no infrastructure to manage. Production-ready with SLA guarantees. Best choice when vector search is the primary operation and you want to stay out of the infrastructure layer. Used for Open Permit legal framework search and podcast retrieval where performance matters most.

Selected

Supabase pgvector

PostgreSQL extension for vector similarity search. SQL joins with structured data — combine semantic search with exact-match filters in a single query. Self-hostable and cost-effective at medium scale. Used for AfA Resource Chatbot where I need to join embedding results with structured metadata stored in the same Supabase instance.

The real question isn't "which vector store is better" — it's "what other data does this vector search need to interact with?" If your use case requires SQL joins or you already have Supabase, pgvector is often the right call. If you need a managed, standalone service with minimal ops overhead, Pinecone wins.

Learnings

The Challenges

Four RAG systems across different domains produced four different failure modes. These aren't theoretical — each one bit me in production.

✂

Chunking Strategy

Chunk size is a fundamental recall tradeoff. Too small: individual chunks lack enough context to be meaningful. Too large: you include irrelevant text that dilutes the signal and eats context window. Overlapping windows help but increase storage cost. Legal documents and conversational transcripts need entirely different strategies.

💰

Embedding Model Selection

OpenAI's text-embedding-3-large outperforms -small on most benchmarks, but costs 5x more per token. For a small knowledge base, the quality gain justifies the cost. For a large corpus with frequent re-indexing, -small is often good enough. I default to -small and only upgrade to -large when retrieval quality evaluation shows meaningful gaps.

📏

Context Window Management

More retrieved chunks is not always better. Stuffing 10 chunks into the prompt increases coverage but degrades synthesis quality — the model has more to process and more noise to ignore. I use k=3-5 by default and add reranking on top-10 results to get the most relevant 3-5. Quality over quantity in the context window.

📊

Evaluation

Measuring RAG quality is genuinely hard. Precision at k gives you retrieval quality. Answer faithfulness (does the answer stick to the context?) and answer relevance (does the answer address the question?) require LLM-as-judge or human evaluation. I use a small golden dataset per project and run eval on every significant change to the chunking or retrieval config.

Takeaway

Key Lesson

RAG is infrastructure work, not just prompt engineering. The quality of your retrieval determines the quality of your generation. You can't prompt-engineer your way out of bad chunking, a mismatched embedding model, or a context window stuffed with irrelevant passages. Get the retrieval right first — generation is easy once the right context is in front of the model.