Vector Database That Can Index 1b Vectors in 48m
Posted4 months agoActive4 months ago
vectroid.comTechstoryHigh profile
skepticalmixed
Debate
80/100
Vector DatabasesDatabase PerformanceAi/ml Infrastructure
Key topics
Vector Databases
Database Performance
Ai/ml Infrastructure
Vectroid claims to have built a vector database that can index 1 billion vectors in 48 minutes, sparking discussion about the practicality and necessity of such performance, as well as comparisons to existing solutions.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
40m
Peak period
39
0-6h
Avg / period
10.8
Comment distribution65 data points
Loading chart...
Based on 65 loaded comments
Key moments
- 01Story posted
Sep 12, 2025 at 12:56 PM EDT
4 months ago
Step 01 - 02First comment
Sep 12, 2025 at 1:36 PM EDT
40m after posting
Step 02 - 03Peak activity
39 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 15, 2025 at 1:08 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45224141Type: storyLast synced: 11/20/2025, 2:36:48 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Nitpick: could be wrong but I don’t think minutes is an SI derived unit.
Maybe not impossible using shared/lossy storage if they were sparsely scattered over a large space ?
But anyways - minutes. Thanks.
Edit: Gemini suggested that this sort of (lossy) storage size could be achieved using "Product Quantization" (sub vectors, clustering, cluster indices), giving an example of 256 dimensional vectors being stored at an average of 6 bits per vector, with ANN being one application that might use this.
1. Has a technical system they think could be worth a fortune to large enterprises, containing at least a few novel insights to the industry.
2. Knows that competitors and open source alternatives could copy/implement these in a year or so if the product starts off open source.
3. Has to put food on the table and doesn’t want to give massive corporations extremely valuable software for free.
Open source has its place, but it is IMO one of the ways to give monopolies massive value for free. There are plenty of open source alternatives around for vector DBs. Do we (developers) need to give everything away to the rich
(not)
Secondly, as I know, the blocker with approximate neighbor search is often not insertion, but search. And if this search was worth a fortune to me, I'd simply embarrassingly parallelize it on CPUs or on GPUs.
Vectroid co-founder here. We're huge fans of open source. My co-founder, Talip, made Hazelcast, which is open source.
It might make sense to open source all or part of Vectroid at some point in the future, but at the moment, we feel that would slow us down.
I hate vendor lock-in just as much as the next person. I believe data portability is the ACTUAL counter to vendor lock-in. If we have clean APIs to get your data in, get your data out, and the ability to bulk export your data (which we need to implement soon!), then there's less of a concern, in my opinion.
I also totally understand and respect that some people only want open source software. I'm certainly like that w/ my homelab setup! Except for Plex... Love Plex... Usually.
Today, the differences are going to be performance, price, accuracy, flexibility, and some intangible UI elegance.
Performance: We actually INITIALLY built Vectroid for the use-case of billions of vectors and near single digit millisecond latency. During the process of building and talking to users, we found that there are just not that many use-cases (yet!) that are at that scale and require that latency. We still believe the market will get there, but it's not there today. So we re-focused on building a general purpose vector search platform, but we stayed close to our high performance roots, and we're seeing better query performance than the other serverless, object storage backed vector DBs. We think we can get way faster too.
Price: We optimized the heck out of this thing with object storage, pre-emptible virtual machines, etc. We've driven our cost down, and we're passing this to the user, starting with a free tier of 100GB. Actual pricing beyond that coming soon.
Accuracy: With our initial testing, we see recall greater or equal to competitors out there, all while being faster.
Flexibility: We are going to have a self managed version for users who want to run on their own infra, but admittedly, we don't have that today. Still working on it.
Other Product Elegance: My co-founder, Talip, made Hazelcast, and I've always been impressed by how easy it is to use and how the end to end experience is so elegant. As we continue to develop Vectroid, that same level of polish and focus on the UX will be there. As an example, one neat thing we rolled out is direct import of data from Hugging Face. We have lots of other cool ideas.
Apologies for the long winded answer. Feel free to ping us with any additional questions.
I run a lot of search-related benchmarks (https://github.com/ashvardanian) and curious if you’ve compared to other engines on the same hardware setup, tracing recall, NDCG, indexing, and query speeds.
B200 spec:
* 8TB/sec HBM bandwidth
* 10 PetaOPs assuming int8.
* 186GB of VRAM.
If we work with 512-dimensional int8 embeddings, then we need 512GB VRAM to hold them, so assuming we have 8xB200 node (~500k$++), we can easily hold them (125M vectors per GPU).
It takes about 1000 OPs to do the dot product between two vectors, so we need to do 1000*1B = 1TeraOPs, spread over 8 GPUs, that's 125 GigaOPs per GPU, so a fraction of a ms.
Now the bottleneck will be data movement between HBM -> chips, since we have 125M vectors per GPU, aka 64GB, we can move them in ~8 ms.
Here you go, the most expensive vector search in history, giving you the same performance as a regular CPU-based vectorDB for only 1000x the price.
That doesn't fit in anyone's video ram.
Each MI325x has 256 GB of HBM. So you would need ~32 of em if it was 2 bytes per scalar.
For 1024 dimensions even with 8 bit quantization you are looking at a terrabyte of data. Lets make it binary vectors, it is still 128GB of VRAM.
WAT?
I went looking around last year and couldn’t really find many options, but I might have been looking in the wrong places.
Elasticsearch is at least good / at hiding the Lucene zoo under the hood.
They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.
Is 1 bln vectors practical?
If you mostly just want to find a particular single vector if possible and don't care so much what the second-best result is, you can get away with much smaller embeddings.
And if you do want to cover all possible pairs, 6500 dimensions or so should be enough. (Their empirical results roughly fit a cubic polynomial.)
Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)
There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.
But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.
Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?
DataFusion nailed this balance between an embedded query engine and a standalone database system. It brings just the right amount of batteries that it’s not a super generic thing that does nothing useful out of the box, but it doesn’t bring so many that it needs to compete with full database systems.
I believe the maintainers refer to it as “the IR of databases” and I’ve always liked that analogy. That’s what I’d like to see for vector engines.
Maybe what we need as a pre-requisite is the equivalent of arrow/parquet ecosystem for vectors. DataFusion really leverages those standards for interoperability and performance. This also goes a long way toward the architectural decisions you reference — Arrow and Parquet are a solid, “good enough” choice for in-memory and storage formats that are efficient and flexible and well-supported. Is there something similar for vector storage?
Open source at https://github.com/spiceai/spiceai
Used in ClickHouse and a few other DBMS.
Disclaimer: I wrote duckdb-vss
It's crazy how people add bloat and complexity to their stuff just because they want to do medium scale RAG with ca. 2 million embeddings.
Here comes the punchline, you do not need a fancy vector database in this case. I stumbled over https://github.com/sqliteai/sqlite-vector which is a SQLite extension and I wonder why no one else did this before, but it simply implements a highly optimized brute force search over the vectors, so you get sub 100ms queries over millions of vectors with perfect recall. It uses dynamic runtime dispatch that makes use of the available SIMD instructions your CPU has. Turns out this might be all you need. No need for memory a memory hungry search index (like HNSW) or writing a huge index to disk (like DiskANN).
> For production or managed service use, please contact SQLite Cloud, Inc for a commercial license.
https://github.com/duckdb/duckdb-vss
Since duckdb is already columnar, it goes brrrrr with single digit millisecond vector similarly lookups.