12h agoPosted Nov 26, 2025 at 8:38 AM EST

I Built an Open-Weights Memory System That Reaches 80.1% on the Locomo Benchmark

1 points

0 comments

Mood

informative

Sentiment

positive

Category

startup_launch

Key topics

Long-Term Memory Architectures

Retrieval Pipelines

Natural Language Processing

Information Retrieval

I’ve been experimenting with long-term memory architectures for agent systems and wanted to share some technical results that might be useful to others working on retrieval pipelines.

Benchmark: LoCoMo (10 runs × 10 conversation sets) Average accuracy: 80.1% Setup: full isolation across all 10 conv groups (no cross-contamination, no shared memory between runs)

Architecture (all open weights except answer generation)

1. Dense retrieval

BGE-large-en-v1.5 (1024d)

FAISS IndexFlatIP

Standard BGE instruction prompt: “Represent this sentence for searching relevant passages.”

2. Sparse retrieval

BM25 via classic inverted index

Helps with low-embedding-recall queries and keyword-heavy prompts

3. MCA (Multi-Component Aggregation) ranking A simple gravitational-style score combining:

keyword coverage

token importance

local frequency signal MCA acts as a first-pass filter to catch exact-match questions. Threshold: coverage ≥ 0.1 → keep top-30

4. Union strategy Instead of aggressively reducing the union, the system feeds 112–135 documents directly to a re-ranker. In practice this improved stability and prevented loss of rare but crucial documents.

5. Cross-Encoder reranking

bge-reranker-v2-m3

Processes the full union (rare for RAG pipelines, but worked best here)

Produces a final top-k used for answer generation

6. Answer generation

GPT-4o-mini, used only for the final synthesis step

No agent chain, no tool calls, no memory-dependent LLM logic

Performance

<3 seconds per query on a single RTX 4090

Deterministic output between runs

Reproducible test harness (10×10 protocol)

Why this worked

Three things seemed to matter most:

MCA-first filter to stabilize early recall

Not discarding the union before re-ranking

Proper dense embedding instruction, which massively affects BGE performance

Notes

LoCoMo remains one of the hardest public memory benchmarks: 5,880 multi-hop, temporal, negation-rich QA pairs derived from human–agent conversations. Would be interested to compare with others working on long-term retrieval, especially multi-stage ranking or cross-encoder heavy pipelines.

Github: https://github.com/vac-architector/VAC-Memory-System

Memory System: I built an open-weights memory system that reaches 80.1% on the LoCoMo benchmark

Snapshot generated from the HN discussion

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 46057283Type: storyLast synced: 11/26/2025, 1:40:08 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.