Enterprise RAG Reference Architecture

Context: where enterprise RAG programmes go off the rails

Most enterprise RAG (retrieval-augmented generation) programmes start with the same naive shape: a vector database, a single embedding model, a flat document corpus chunked at 500 tokens, and an LLM prompt that says "Answer the question using only the context below." For a 100-document demo this works. For a 100,000-document production knowledge base it does not.

The failure modes are predictable. Relevance drops because chunk boundaries cut across logical units. The model hallucinates because retrieval surfaces context that LOOKS relevant but doesn't actually answer the question. Latency degrades because the vector index isn't sharded. Cost balloons because embeddings get regenerated on every doc edit. Compliance breaks because the corpus has PII that wasn't filtered at ingestion.

This piece is the reference architecture Dcrayons applies on enterprise RAG engagements in 2026. It covers four areas the public RAG documentation under-foregrounds: the vector DB selection logic, the embedding pipeline design, the chunking + reranking discipline, and the eval-suite that turns "we tried RAG and it kind of works" into a defensible production system.

Vector DB selection: pgvector vs Pinecone vs Weaviate

The first architectural decision is which vector database to build on. The 2026 landscape:

pgvector (PostgreSQL extension). Right when: the data already lives in Postgres, the corpus is under 10 million vectors, query latency budget is greater than 100 ms, the team prefers one system to operate. Trade-off: hybrid search (vector + keyword) needs careful index design; recall at the top-1000-document scale lags purpose-built vector DBs.

Pinecone. Right when: query volume exceeds a few million vectors with strict latency SLAs (less than 50 ms p95), the team wants a fully managed service, multi-region replication is required. Trade-off: cost grows linearly with vector count + query rate; pricing is per-pod rather than per-query at scale.

Weaviate. Right when: hybrid search is core (BM25 + vector together), modular ML modules (rerankers, summarisers) need first-class support, the team is comfortable operating it themselves. Trade-off: more configuration surface; smaller community + integrations vs Pinecone.

Qdrant. Right when: open-source + self-host + simplicity matter, hybrid search is needed but Weaviate feels heavy. Trade-off: smaller ecosystem; some advanced features (multi-tenant isolation, payload filtering) require careful schema design.

Cloud-native (AWS OpenSearch, Azure AI Search, GCP Vertex AI Search). Right when: the customer is committed to a specific cloud and wants vector search inside the same data plane. Trade-off: vendor lock-in; performance and feature parity with specialised vector DBs varies.

The Dcrayons decision rule: default to pgvector if the corpus is under 5 million vectors and the existing stack runs Postgres. Reach for Pinecone when latency + scale outgrow pgvector. Use Weaviate when hybrid search quality matters more than operational simplicity. Cloud-native is a fit only when the customer's existing commitment to that cloud is strong.

Embedding pipeline design: ingestion, refresh, governance

Embeddings are the substrate. The pipeline that produces + maintains them is more consequential than the model that consumes them.

Embedding model choice. OpenAI text-embedding-3-large (3072 dimensions) is the default high-quality option in 2026. Voyage AI voyage-3 (1024 dim) is the lighter + cheaper option with strong quality. Cohere embed-english-v3 + embed-multilingual-v3 are the right choice for multilingual corpora (the multilingual model handles Hindi + Arabic + Urdu first-class). For self-hosted: BGE + Nomic embed are the open-source baselines that close most of the quality gap.

Chunking strategy. Fixed-size chunking (500 tokens with 50-token overlap) is the naive starting point. Semantic chunking (split at paragraph or section boundaries) is materially better for editorial content. Recursive chunking (split at the highest-level structure first, then drill down only if the chunk is too long) is the right shape for documents with headings + nested structure (markdown, HTML, Confluence exports).

For mixed-content corpora (PDF reports + HTML blog posts + Slack threads + Notion docs), the right pipeline routes by source type: PDFs get layout-aware chunking via Unstructured or LlamaParse; HTML gets structure-aware chunking by heading hierarchy; Slack threads get conversation-aware chunking that preserves the speaker context.

Refresh discipline. When a source document changes, the embeddings derived from it must regenerate. The pipeline tracks a content-hash per chunk; ingestion compares the live hash against the index hash and re-embeds only the changed chunks. Without this, the index drifts from reality (the doc was updated but the answer still references the old version) or costs balloon (every edit triggers a full re-embed).

PII + sensitive content filtering. At ingestion time, every chunk runs through a PII detection step (Presidio + custom rules) before reaching the index. Chunks containing PII get either redacted, tagged with an access-restricted label, or dropped entirely depending on the compliance posture. This is the seam where DPDP + GDPR + sectoral regulations land.

Metadata + filter design. Every chunk carries metadata: source document ID, source URL, last-modified timestamp, author, access-tier, language, content-type. Retrieval queries filter on metadata BEFORE running vector similarity (the right pattern for "tenant X gets only tenant-X data") or AFTER vector similarity (right pattern for "lift recent docs"). The architecture decision is locked at index design; changing it later requires re-indexing.

Chunking + reranking discipline: the retrieval quality pattern

Vector similarity alone gives mediocre retrieval. Production RAG layers a retrieval cascade.

Stage 1: vector search returns top-K candidates. K is typically 20-100. The vector model is fast but not the most discriminating signal.

Stage 2: reranker reorders the candidates. Cross-encoder rerankers (Cohere Rerank, BGE Reranker, Voyage Rerank) score each query-document pair more accurately than the bi-encoder used for vector similarity. They are slower per pair, which is why they run only on the top-K from stage 1, not the whole index.

Stage 3: top-N reranked candidates feed the LLM context. N is typically 5-15. Beyond this the LLM context window fills up + answer quality degrades from too much noise.

The retrieval cascade typically improves answer quality by 10-30 percent on internal eval suites for the same LLM + the same corpus. The cost trade-off: reranker adds 100-300 ms of latency + a per-pair API call (Cohere Rerank is roughly $1 per 1000 documents reranked).

Hybrid search (BM25 keyword + vector) is the other big lever. BM25 catches exact-match queries (proper nouns, product codes, SKUs) that pure vector search often misses. The hybrid score is a weighted sum; the weight is tuned per corpus on the eval suite.

The eval suite: what makes RAG production-grade

An eval suite is the gate between "we tried RAG and it kind of works" and a defensible production system. The Dcrayons eval-suite pattern:

Held-out question set with known correct answers. 50-200 questions sampled from real user queries (anonymised) or written by domain experts. Each has an expected answer + a list of source documents that should appear in the retrieved context. This is the test set.

Per-question metrics. Retrieval recall (did the right source documents appear in the top-N?), answer correctness (LLM-as-judge or human review on a sample), answer faithfulness (did the answer stick to the retrieved context or hallucinate beyond it?), citation accuracy (when the answer claims source X says Y, does X actually say Y?), latency p50/p95.

Regression suite. Every change. new embedding model, new chunking strategy, new reranker, new LLM version. runs through the eval suite. Results compared against the production baseline. Regressions block the change.

Production sample monitoring. A daily sample of real user queries gets logged + scored (LLM-as-judge for the auto-graded metrics, human review on a small sample). Score drift week-over-week triggers investigation: did the corpus shift, did user intent shift, did a model version change behind the scenes?

Without an eval suite, RAG quality drifts silently. With one, every architectural decision has a measurable outcome the team can defend to the CFO + CTO.

Production checklist: the rollout sequence

For an enterprise RAG programme handling 50,000+ documents + 10,000+ queries per day:

Vector DB selected against latency + scale + ops requirements (default pgvector under 5M vectors, Pinecone above)
Embedding model chosen against language + budget + quality eval (default text-embedding-3-large + voyage-3 for cost-sensitive)
Source-type-routed chunking pipeline (PDFs + HTML + Slack + Notion each with appropriate chunking)
Content-hash-based refresh + change detection so embeddings stay in sync with source
PII + sensitive-content filtering at ingestion + metadata tagging for access control
Hybrid search (BM25 + vector) configured + weights tuned per corpus
Reranker layer (Cohere Rerank or equivalent) on top-K candidates
Eval suite of 50-200 held-out questions + automated regression on every change
Production sample monitoring + weekly review cadence
Cost monitoring per query + per-month budget alerts; embedding-regen cost tracked separately
Latency p50/p95 dashboards + alerting on degradation
Compliance + audit posture: PII filter validated quarterly, access-control test cases in the eval suite

References + linked context

Dcrayons glossary: vector-database, fine-tuning, eval-suite, model-context-protocol
Anthropic reference architecture: see /learn?tag=ai-emerging-tech for the Claude on AWS Mumbai + ZDR + audit-trail pattern that pairs with this RAG architecture

If your enterprise RAG programme is hitting a vector-DB scaling wall, an embedding-refresh discipline gap, or a retrieval-quality plateau, this is the architecture we deploy. Reach out via the contact form for a 30-minute review against your current setup.

Tagsragaivector-databaseembeddingenterpriseblog

Enterprise RAG Architecture: Vector DBs, Embedding Pipelines, and the Retrieval Discipline

Context: where enterprise RAG programmes go off the rails

Vector DB selection: pgvector vs Pinecone vs Weaviate

Embedding pipeline design: ingestion, refresh, governance

Chunking + reranking discipline: the retrieval quality pattern

The eval suite: what makes RAG production-grade

Production checklist: the rollout sequence

References + linked context

Related Articles

WhatsApp Business API in India: How to Run WhatsApp Marketing Without Getting Your Number Banned

Enterprise Amazon Presence: Multi-Marketplace Operations, Account-Health Governance, and the Catalog Pattern

Production Claude on AWS Mumbai for Enterprise Data Residency: Reference Architecture, ZDR Boundaries, and the Audit-Trail Pattern We Run

Want to grow your digital presence?