Why Your Rag Costs $2,400/month (and How We Cut It by 73%)

Posted23 days ago

helain

2 points

0 comments

Tech Discussionstory

informativepositive

Debate

20/100

LLMCost ReductionVector Database

Key topics

LLM

Cost Reduction

Vector Database

You're running RAG in production. Then the AWS bill lands. $2,400/month for 50 queries/day. $48 per query.

We built a RAG system for enterprise clients and realized most production RAGs are optimization disasters. The literature obsesses over accuracy while completely ignoring unit economics.

The Three Cost Buckets Vector Database (40-50% of bill) Standard RAG pipelines do 3-5 unnecessary DB queries per question. We were making 5 round-trips for what should've been 1.5.

LLM API (30-40%) Standard RAG pumps 8-15k tokens into the LLM. That's 5-10x more than necessary. We found: beyond 3,000 tokens of context, accuracy plateaus. Everything beyond that is noise and cost.

Infrastructure (15-25%) Vector databases sitting idle, monitoring overhead, unnecessary load balancing.

What Actually Moved the Needle Token-Aware Context (35% savings) Budget-based assembly that stops when you've used enough tokens. Before: 12k tokens/query. After: 3.2k tokens. Same accuracy.

python def _build_context(self, results, settings): max_tokens = settings.get("max_context_tokens", 2000) current_tokens = 0 for result in results: tokens = self.llm.count_tokens(result) if current_tokens + tokens <= max_tokens: current_tokens += tokens else: break Hybrid Reranking (25% savings) 70% semantic + 30% keyword scoring. Better ranking means fewer chunks needed. Top-20 → top-8 retrieval while maintaining quality.

Embedding Caching (20% savings) Workspace-isolated cache with 7-day TTL. We see 45-60% hit rate intra-day.

python async def set_embedding(self, text, embedding, workspace_id=None): key = f"embedding:ws_{workspace_id}:{hash(text)}" await redis.setex(key, 604800, json.dumps(embedding)) Batch Embedding (15% savings) Batch API pricing is 30-40% cheaper per token. Process 50 texts simultaneously instead of individu

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 46234309Type: storyLast synced: 12/11/2025, 5:30:14 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

View on HN