Production Rag: What I Learned From Processing 5m+ Documents
Posted3 months agoActive2 months ago
blog.abdellatif.ioTechstoryHigh profile
calmpositive
Debate
60/100
RagLarge Language ModelsNatural Language ProcessingInformation Retrieval
Key topics
Rag
Large Language Models
Natural Language Processing
Information Retrieval
The author shares their experience processing 5M+ documents using a Retrieval-Augmented Generation (RAG) system, and the discussion revolves around the challenges and best practices for building such systems.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
34m
Peak period
106
Day 1
Avg / period
19
Comment distribution114 data points
Loading chart...
Based on 114 loaded comments
Key moments
- 01Story posted
Oct 20, 2025 at 11:55 AM EDT
3 months ago
Step 01 - 02First comment
Oct 20, 2025 at 12:30 PM EDT
34m after posting
Step 02 - 03Peak activity
106 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 3, 2025 at 8:59 AM EST
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45645349Type: storyLast synced: 11/22/2025, 11:00:32 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Chunking strategy is a big issue. I found acceptable results by shoving large texts to to gemini flash and have it summarize and extract chunks instead of whatever text splitter I tried. I use the method published by Anthropic https://www.anthropic.com/engineering/contextual-retrieval i.e. include full summary along with chunks for each embedding.
I also created a tool to enable the LLM to do vector search on its own .
I do not use Langchain or python.. I use Clojure+ LLMs' REST APIs.
Not sensitive to latency at all. My users would rather have well researched answers than poor answers.
Also, I use batch mode APIs for chunking .. it is so much cheaper.
I've struggled to find a target market though. Would you mind sharing what your use case is? It would really help give me some direction.
That is, there is nothing here that one could not easily write without a library.
Ingestion + Agentic Search are two areas that we're focused on in the short term.
The only place I see that actually operates on chunks does so by fetching them from Redis, and AFAICT nothing in the repo actually writes to Redis, so I assume the chunker is elsewhere.
https://github.com/agentset-ai/agentset/blob/main/packages/j...
What does query generation mean in this context, it’s probably not SQL queries right?
One of the key features in Claude Code is "Agentic Search" aka using (rip)grep/ls to search a codebase without any of the overhead of RAG.
Sounds like even RAG approaches use a similar approach (Query Generation).
Or am I completely misunderstanding how Claude Code works?
https://jakobs.dev/learnings-ingesting-millions-pages-rag-az...
The big LLM-based rerankers (e.g. Qwen3-reranker) are what you always wanted your cross-encoder to be, and I highly recommend giving them a try. Unfortunately they're also quite computationally expensive.
Your metadata/tabular data often contains basic facts that a human takes for granted, but which aren't repeated in every text chunk - injecting it can help a lot in making the end model seem less clueless.
The point about queries that don't work with simple RAG (like "summarize the most recent twenty documents") is very important to keep in mind. We made our UI very search-oriented and deemphasized the chat, to try to communicate to users that search is what's happening under the hood - the model only sees what you see.
To give a real world example, the way Claude Code works versus how Cursor's embedded database works.
If you want something as simple as "suggest similar tweets" or something across millions of things then embeddings still work.
But if you want something like "compare the documents across these three projects" then you would use full text metadata extraction. Keywords, summaries, table of contents, etc to determine data about each document and each chunk.
The difference is this feature explicitly isn't designed to do a whole lot, which is still the best way to build most LLM-based products and sandwich it between non-LLM stuff.
This, combined with a subsequent reranker, basically eliminated any of our issues on search.
Disclosure: I work at MS and help maintain our most popular open-source RAG template, so I follow the best practices closely: https://github.com/Azure-Samples/azure-search-openai-demo/
So few developers realize that you need more than just vector search, so I still spend many of my talks emphasizing the FULL retrieval stack for RAG. It's also possible to do it on top of other DBs like Postgres, but takes more effort.
AI Search team's been working with the Sharepoint team to offer more options, so that devs can get best of both worlds. Might have some stuff ready for Ignite (mid November).
The capability was there for years, but it was expensive - something like $0.60 per 1000 items indexed, then sometimes after copilot was added it became free for up to 50 million items, and now it's free for unlimited items - you just can't beat that for price... https://techcommunity.microsoft.com/blog/microsoft365copilot...
That's why I write blog posts like https://blog.pamelafox.org/2024/06/vector-search-is-not-enou...
Moreover I am curious why you guys use bm25 over SPLADE?
That query generation approach does not extract structured data. I do maintain another RAG template for PostgreSQL that uses function calling to turn the query into a structured query, such that I can construct SQL filters dynamically. Docs here: https://github.com/Azure-Samples/rag-postgres-openai-python/...
I'll ask the search about SPLADE, not sure.
Of course, agentic retrieval is just better quality-wise for a broader set of scenarios, usual quality-latency trade-off.
We don't do SPLADE today. We've explored it and may get back to it at some point, but we ended up investing more on reranking to boost precision, we've found we have fewer challenges on the recall side.
Shameless plug: plpgsql_bm25: BM25 search implemented in PL/pgSQL (The Unlicense / PUBLIC DOMAIN)
https://github.com/jankovicsandras/plpgsql_bm25
There's an example Postgres_hybrid_search_RRF.ipynb in the repo which shows hybrid search with Reciprocal Rank Fusion ( plpgsql_bm25 + pgvector ).
One thing I’m always curious about is if you could simplify this and get good/better results using SPLADE. The v3 models look really good and seem to provide a good balance of semantic and lexical retrieval.
What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?
Here's sample code: https://docs.cohere.com/reference/rerank
If you generate embeddings (of the query, and of the candidate documents) and compare them for similarity, you're essentially asking whether the documents "look like the question."
If you get an LLM to evaluate how well each candidate document follows from the query, you're asking whether the documents "look like an answer to the question."
An ideal candidate chunk/document from a cosine-similarity perspective, would be one that perfectly restates what the user said — whether or not that document actually helps the user. Which can be made to work, if you're e.g. indexing a knowledge base where every KB document is SEO-optimized to embed all pertinent questions a user might ask that "should lead" to that KB document. But for such documents, even matching the user's query text against a "dumb" tf-idf index will surface them. LLMs aren't gaining you any ground here. (As is evident by the fact that webpages SEO-optimized in this way could already be easily surfaced by old-school search engines if you typed such a query into them.)
An ideal candidate chunk/document from a re-ranking LLM's perspective, would be one that an instruction-following LLM (with the whole corpus in its context) would spit out as a response, if it were prompted with the user's query. E.g. if the user asks a question that could be answered with data, a document containing that data would rank highly. And that's exactly the kind of documents we'd like "semantic search" to surface.
embeddings are a lossy compression, so if you feed the chunks with the prompt at the same time, the results are better. But you can't do this for your whole db, that's why the filtering with cosine similarity at the beginning.
Here is that leaderboard https://huggingface.co/spaces/mteb/leaderboard?benchmark_nam...
Voyage-3-large seems like SOTA right now
Once Bedrock KB backed by S3 Vectors is released from Beta it'll eat everybody's lunch.
I'm correcting you less out of pedantry, and more because I find the correct term to be funny.
SOTA for what? Isn't it just a vector store?
We have very different ideas about the meaning of self-hosted.
For example - if a "self hosted" service supports off-site backups is it self hosted or just well designed?
There is a big difference between communicating with external services (your example) vs REQUIRING external services (what parent is complaining about).
If in your example the system can run correctly with just local backups I would consider it self-hosted.
I’ve probably missed a huge wave of programming technology because of this, and I’ve figured out a way to make it work for a consistent paycheck over these past 20 years.
I’m also not a great example, I think I’ve watched 7 whole hours of YouTube videos ever, and those were all for car repair help.
I shy away from tech that needs to be online/connected/whatever.
Title: ... Author: ... Text: ...
for each chunk, instead of just passing the text
whats this roundtrip? also the chronology of the LLM (4.1) doesnt match the rest of the stack (text-embedding-large-3), feels weird
a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error
Again, these were only observed in RAG when you pass lots of chunks, GPT-5 is probably a better model for other taks.
- Classic RAG: `User -> Search -> LLM -> User`
- Agentic RAG: `User <-> LLM <-> Search`
Essentially instead of having a fixed loop, you provide the search as a tool to the LLM, which does three things:
- The LLM can search multiple times
- The LLM can adjust the search query
- The LLM can use multiple tools
The combination of these three things has solved a majority of classic RAG problems. It improves user queries, it can map abbreviations, it can correct bad results on its own, you can also let it list directories and load files directly.
- Depends on your use case to let the model understand when and when not to use tools - gpt-5 s VERY persistent and often searches more than 10 times in a single run depending on the results.
We're using pydantic AI where the entire Agent loop is taken care of by the framework. Highly recommend.
* Upload documents via API into a Google Workspace folder * Use some sort of Google AI search API on those documents in that folder
…placing documents for different customers into different folders.
Or the Azure equivalent whatever that is.
Anyone here successfully transitioned into legal space? My gut always been legal to the space where LLM can really be useful, the first one is in programming.
Could you share more about chunking strategies you used?
Query expansions and non-naive chunking give the biggest bang for the bug, with chunking being the most resource intensive task, if the input data is chunk (pun intended).