RAG Architecture: What Your Team Needs to Know

The Core Components

A production RAG system has three moving parts that each need their own attention.

Retriever: turns a user query into a set of candidate documents, usually via vector similarity search. Modern retrievers also support hybrid search (vector plus keyword) to handle exact-match queries like product SKUs or error codes.
Reranker: takes the top N candidates from the retriever and reorders them using a more expensive cross-encoder model. Cohere Rerank and BGE Reranker are common choices. Skipping this step is one of the most common reasons RAG quality plateaus.
Generator: the LLM that takes the reranked context plus the user query and produces an answer. GPT-4.1, Claude Sonnet, and open-source models like Llama 3 and Qwen all work depending on your latency, cost, and privacy requirements.

The naive version of RAG stops at retriever plus generator. Adding a reranker typically lifts answer quality by 15 to 30 percent on internal eval sets, which is why it has become standard in serious deployments.

Chunking Strategies

Chunking is how you split source documents into pieces that can be embedded and retrieved. Bad chunking is the single biggest cause of bad RAG output, and it is the part most teams underinvest in.

Fixed size: split by token count with some overlap. Fast to implement, works for uniform content like blog posts and docs. Breaks down on structured content like tables, code, or legal clauses.
Semantic chunking: split at natural boundaries like paragraphs or sentences, often using an embedding model to detect topic shifts. Better quality, more expensive to run.
Hierarchical chunking: store multiple levels (small chunks for precise retrieval, larger parent chunks for context). Retrieval hits the small chunks, but the generator sees the parent. This is the pattern most production systems converge on.

The chunks you retrieve are the ceiling on your answer quality. No reranker or prompt engineering can recover from chunks that do not contain the answer.

Embedding Models

Your embedding model determines how well semantic similarity maps to actual relevance for your domain. The market has matured significantly in the last two years.

OpenAI text-embedding-3: strong general-purpose choice, available in small and large variants with configurable dimensions
Cohere embed-v3: competitive quality with strong multilingual support and a separate reranker from the same vendor
BGE (open source): from BAAI, consistently near the top of the MTEB leaderboard, self-hostable
Jina embeddings: strong long-context support, good for document-heavy use cases
Voyage AI: often leads domain-specific benchmarks, including code and legal

For most teams, starting with OpenAI or Cohere and swapping to open-source BGE if you need on-prem or cost control is a reasonable path. The key is having an eval set so you can actually measure the difference.

Vector Databases

The vector database stores your embeddings and runs similarity search at query time. The right choice depends on scale, existing infrastructure, and operational tolerance.

Pinecone: fully managed, low-ops, strong at scale. Default choice when you do not want to run infrastructure. Learn more on the Pinecone skill page.
Weaviate: open source with hybrid search built in, strong modular architecture, available managed or self-hosted
pgvector: Postgres extension, lets you keep vectors alongside relational data. Great for teams already on Postgres and operating at moderate scale.
Chroma: lightweight, developer-friendly, ideal for local development and smaller production workloads. See the ChromaDB skill page for more.
Qdrant: open source, Rust-based, strong performance and filtering. Growing quickly in production use.

A reasonable pattern: start on Chroma or pgvector during development, move to Pinecone or Qdrant when you hit production scale and need managed operations or tight filtering.

Framework Choice

You do not strictly need a framework, but most teams use one to avoid rebuilding standard patterns.

LlamaIndex: retrieval-first, with strong abstractions for indexing, chunking, and query pipelines. Cleaner for pure RAG workloads.
LangChain: agent-integrated, broader scope, more moving parts. Better fit when RAG is one piece of a larger agentic system. See the LangChain skill page for more context.

Both are Python-first, both work with every major model provider, and both are fine choices. The most common mistake is adopting a framework without understanding what it abstracts and then fighting it when you need custom behavior.

Evaluation

You cannot improve what you cannot measure. Eval is the difference between a demo that works in the meeting and a system that works in production.

RAGAS: open-source framework for RAG-specific metrics including faithfulness, answer relevance, context precision, and context recall
TruLens: strong for production traces and monitoring, tracks feedback signals across requests
Custom eval sets: 50 to 200 real questions from your domain with expected answers, graded manually or with an LLM judge

A minimal eval loop: start with 30 real questions and expected answers, run your system, grade the output, fix the biggest issue, repeat. Teams that do this ship RAG that works. Teams that skip it ship demos.

Common Pitfalls

Most failed RAG projects share a small set of issues:

Bad chunking: fixed-size chunks on structured content, no overlap, or chunks that cut across section boundaries
No reranker: taking the top K from vector search directly into the generator instead of reranking first
No eval loop: shipping changes without a measurable baseline, which makes regression invisible
Over-retrieving: stuffing 20 chunks into the prompt in the hope that more context helps. It usually does not.
Ignoring hybrid search: pure vector search misses exact-match queries like product codes, order numbers, and error strings

Hiring for RAG

RAG engineering sits at the intersection of ML and production software. You need both, and the ratio matters.

RAG engineer: comfortable with embeddings, vector databases, and evaluation loops; also able to ship production code, handle latency budgets, and debug distributed systems. See the RAG engineer role page.
ML engineer: deeper on embeddings, fine-tuning, and model selection; often owns the embedding pipeline and model evaluation. See the ML engineer role page.
Data engineer: owns the pipelines that keep source documents flowing into the vector store, handles incremental updates, and manages data freshness

A typical production RAG team has 1 to 2 RAG engineers, 1 ML engineer, and a data engineer shared with the broader platform. Smaller teams combine roles, which works if the engineer you hire genuinely covers both sides. Larger teams add a dedicated evals engineer and an infrastructure specialist for the vector database operations.

Key Takeaways

RAG has three core components: retriever, reranker, and generator. Skipping the reranker is the most common quality limiter.
Chunking strategy matters more than model choice for most teams. Start with hierarchical chunking.
Vector database choice depends on scale and ops tolerance: Chroma or pgvector for dev, Pinecone or Qdrant for production.
Evaluation is not optional. A 30-question eval set beats zero every time.
RAG engineers need both ML fundamentals and production software skills; a typical team has 1 to 2 RAG engineers plus ML and data support.

Frequently Asked Questions

Do I need a reranker for a small corpus?

If your corpus is under a few hundred documents and queries are simple, you may get away without one. For most production use cases with thousands of documents or nuanced queries, a reranker is worth the latency cost.

Should I fine tune my embedding model?

Usually not as a first step. Start with a strong off-the-shelf model like text-embedding-3 or BGE, build an eval set, and only fine tune if you see consistent failures that fine tuning can address. Most teams never need to.

How much does a production RAG system cost to run?

It depends on scale, but the dominant costs are usually generation (LLM calls) and embedding updates. Vector database costs are often smaller than expected. A mid-sized production RAG can run anywhere from a few hundred to several thousand dollars per month.

What is the difference between RAG and fine tuning?

RAG retrieves relevant context at query time and feeds it to the model. Fine tuning adjusts the model weights on your data. RAG is better for factual grounding with changing data; fine tuning is better for teaching style, format, or narrow task behavior. Most production AI systems use both.

Can I build RAG without a framework?

Yes. A minimal RAG system is under 200 lines of Python: an embedding call, a vector store client, and a generation call. Frameworks help when you add complexity like agents, multi-step retrieval, or evaluation pipelines.

Hire RAG Talent with South

South places vetted RAG engineers from Latin America who have shipped production retrieval systems with real eval loops. Start a search and see a shortlist within a week.

Start hiring with South