Ask HN: How would you architect a RAG system for 10M+ documents today?

RAG systemssemantic searchAI architecturelarge-scale document processing

No synthesized answer yet. Check the discussion below.

Discussion (4 comments)

Showing 2 comments of 4

pgelephant2025

about 1 month ago

A corpus with 10M documents in PostgreSQL asks for a design that avoids noise and focuses on stable parts. The database stays the source of truth. Store documents, chunks, metadata, and timestamps. Keep chunk sizes in a predictable range so your retrieval stays stable across updates. Build a small ingestion pipeline that reads changes from PostgreSQL through logical decoding or a queue table. Feed those items to a worker that refreshes embeddings, updates the search index, and marks the work complete. This keeps the system fresh without large reindex jobs.

Use a strong embedding model and keep it fixed for a long period. Re-embedding 10M items is expensive. Run the embedding service behind a small API. Track model versions so you know the age of each vector. For retrieval, stay with hybrid search. Use BM25 and an ANN index. Join both results and rerank with a cross encoder. This produces better output than single-mode retrieval and avoids the overhead of graph systems.

LightRAG and GraphRAG help only when you have clear entities and relations across your corpus. They also bring heavy maintenance work. Entity extraction, graph updates, and link revisions add load with small gains unless your domain needs multi hop lookup. Most teams skip graph methods until they confirm real gaps in answer quality. Hybrid retrieval with reranking delivers stable output at this scale with lower upkeep.

Keep RAG logic inside one service. The chat layer sends a query. The service runs the retrieval steps, builds a context window, and sends a compact prompt to the model. This avoids coupling the LLM tier with indexes or pipelines. Incremental updates remain simple because all maintenance lives in the ingestion workers.

At this scale, the main threats are drift between PostgreSQL and the indexes, slow indexing pipelines, and uneven query latency. Run periodic checks that compare PostgreSQL counts with your search indexes. Log every query with its retrieved chunks and model output. Replay those logs after each change to catch regressions. Use sharding only when a single node reaches memory limits.

The short answer is to keep the system simple. PostgreSQL for storage. A vector index for embeddings. A lexical index for text. A small rerank model. A thin RAG service that binds them together. Graph methods stay out unless your domain depends on structured relations. Regular updates, a clean pipeline, and strong logging will give you a system that behaves well with 10M documents and grows without surprises.

mikert89

about 1 month ago

chunk the documents, use contextual embeddings, put into the vectordb in postgres

2 more comments available on Hacker News

Resources