Vector Database That Can Index 1b Vectors in 48m

4 months ago

4 replies

Proprietary closed-source lock-in. Nothing to see here.

CuriouslyC

4 months ago

2 replies

Seriously. The amount of lift a SaaS product needs to give me is insane for me to even bother evaluating it, and there's a near zero percent chance I'll use it in my core.

esseph

4 months ago

I really feel like we're heading down the slope of a large section of the internet dieing off, and if that happens I think it may fracture even more than it already has globally.

kcb

4 months ago

Especially a product that demands access to large quantities of your most sensitive data to be useful.

stronglikedan

4 months ago

1 reply

Nothing for you to see here. Surely you just aren't their target customer.

4 months ago

1 reply

So who is? Who really needs to index 1 billion new vectors every 48 minutes, or perhaps equivalently 1 million new vectors every 3 seconds?

hansvm

4 months ago

If HNSW were accurate enough (and if this DB were much faster) then I'd have a use case. I wound up going down a different route to create a differentiable database for ML shenanigans though.

HEmanZ

4 months ago

2 replies

What do you think an alternative is for someone who:

1. Has a technical system they think could be worth a fortune to large enterprises, containing at least a few novel insights to the industry.

2. Knows that competitors and open source alternatives could copy/implement these in a year or so if the product starts off open source.

3. Has to put food on the table and doesn’t want to give massive corporations extremely valuable software for free.

Open source has its place, but it is IMO one of the ways to give monopolies massive value for free. There are plenty of open source alternatives around for vector DBs. Do we (developers) need to give everything away to the rich

mhuffman

4 months ago

1 reply

Traditionally the most profitable approach is offering enterprise support and consulting.

cluckindan

4 months ago

1 reply

Enterprises are so very fond of choosing novel open source technologies, too!

(not)

gloomyday

4 months ago

I have been working for 4 years with "enterprise" software, and I feel like the whole field is some kind of collective insanity.

4 months ago

Let's say the best open source product has a feature score of 70/100, and the best closed source product has a feature score of 85/100, and this is me being generous with the latter. The issue is that just by being closed source, it immediately loses 20/100, bringing its score to 65/100, which is below the open offering. A closed source product carries substantial risk if the company behind it were to stop maintaining it, which is why the adjustment by -20 applies.

Secondly, as I know, the blocker with approximate neighbor search is often not insertion, but search. And if this search was worth a fortune to me, I'd simply embarrassingly parallelize it on CPUs or on GPUs.

hungarianhc

4 months ago

Not that locked in - you can just move your vectors to another platform, no?

Vectroid co-founder here. We're huge fans of open source. My co-founder, Talip, made Hazelcast, which is open source.

It might make sense to open source all or part of Vectroid at some point in the future, but at the moment, we feel that would slow us down.

I hate vendor lock-in just as much as the next person. I believe data portability is the ACTUAL counter to vendor lock-in. If we have clean APIs to get your data in, get your data out, and the ability to bulk export your data (which we need to implement soon!), then there's less of a concern, in my opinion.

I also totally understand and respect that some people only want open source software. I'm certainly like that w/ my homelab setup! Except for Plex... Love Plex... Usually.

softwaredoug

4 months ago

1 reply

Not trying to be snarky, just curious -- How is this different from TurboPuffer and other serverless, object storage backed vector DBs?

hungarianhc

4 months ago

1 reply

Hey! It's a great question. Co-founder of Vectroid here.

Today, the differences are going to be performance, price, accuracy, flexibility, and some intangible UI elegance.

Performance: We actually INITIALLY built Vectroid for the use-case of billions of vectors and near single digit millisecond latency. During the process of building and talking to users, we found that there are just not that many use-cases (yet!) that are at that scale and require that latency. We still believe the market will get there, but it's not there today. So we re-focused on building a general purpose vector search platform, but we stayed close to our high performance roots, and we're seeing better query performance than the other serverless, object storage backed vector DBs. We think we can get way faster too.

Price: We optimized the heck out of this thing with object storage, pre-emptible virtual machines, etc. We've driven our cost down, and we're passing this to the user, starting with a free tier of 100GB. Actual pricing beyond that coming soon.

Accuracy: With our initial testing, we see recall greater or equal to competitors out there, all while being faster.

Flexibility: We are going to have a self managed version for users who want to run on their own infra, but admittedly, we don't have that today. Still working on it.

Other Product Elegance: My co-founder, Talip, made Hazelcast, and I've always been impressed by how easy it is to use and how the end to end experience is so elegant. As we continue to develop Vectroid, that same level of polish and focus on the UX will be there. As an example, one neat thing we rolled out is direct import of data from Hugging Face. We have lots of other cool ideas.

Apologies for the long winded answer. Feel free to ping us with any additional questions.

f311a

4 months ago

1 reply

I’m curious, what’s the tech stack behind this?

4 months ago

1 reply

Vectroid is pure Java solution based on modified version of Lucene. We use a custom built FileSystem to work directly with GCS (Google cloud object store). It is a terraform/helm managed Kubernetes deployment.

f311a

4 months ago

Interesting, perhaps, you can write a blog post about it. It would be interesting to read about what kind of changes you made to Lucene.

4 months ago

1 reply

Very curious about the hardware setup used for this benchmark!

4 months ago

1 reply

No special hardware. Google Cloud vms. We use multiple of them during index building.

4 months ago

1 reply

The question is how many, and what kind of VMs you use? It greatly affects performance :)

I run a lot of search-related benchmarks (https://github.com/ashvardanian) and curious if you’ve compared to other engines on the same hardware setup, tracing recall, NDCG, indexing, and query speeds.

4 months ago

We shard the data and index on about 6 x n2-standard-96 spot instances so the total cost of indexing the entire deep1b is less than $12. We are working on to make it $6. We separate indexing and query VMs. For queries we use dedicated VMs. USearch numbers look great and are better than ours if you run the query and indexing on the same VM/node. We believe design-wise distributed, task-oriented design is the right way to handle vector search for thousands of tenants with different size datasets. Data ingest is also a separate task for us so Ingest, Index and Query are all handled by different cluster of VMs.

1999-03-31

4 months ago

3 replies

1B vectors is nothing. You don’t need to index them. You can hold them in VRAM on a single node and run queries with perfect accuracy in milliseconds

lyu07282

4 months ago

1 reply

Show your math lol

sailingparrot

4 months ago

1 reply

I assume by "node" OP meant something like a DGX node. Which yea, that would work, but not everyone (no one?) wants to buy a 500k system to do vector search.

B200 spec:

* 8TB/sec HBM bandwidth

* 10 PetaOPs assuming int8.

* 186GB of VRAM.

If we work with 512-dimensional int8 embeddings, then we need 512GB VRAM to hold them, so assuming we have 8xB200 node (~500k$++), we can easily hold them (125M vectors per GPU).

It takes about 1000 OPs to do the dot product between two vectors, so we need to do 1000*1B = 1TeraOPs, spread over 8 GPUs, that's 125 GigaOPs per GPU, so a fraction of a ms.

Now the bottleneck will be data movement between HBM -> chips, since we have 125M vectors per GPU, aka 64GB, we can move them in ~8 ms.

Here you go, the most expensive vector search in history, giving you the same performance as a regular CPU-based vectorDB for only 1000x the price.

lyu07282

4 months ago

Thanks for doing the math! I suppose if we are charitable in practice we would of course index and only offload partially to VRAM (FAISS does that with IVF/PQ and similar).

adastra22

4 months ago

1 reply

1B x 4096 = 4T scalars.

That doesn't fit in anyone's video ram.

kingstnap

4 months ago

Well we have AI GPUs now so you could do it.

Each MI325x has 256 GB of HBM. So you would need ~32 of em if it was 2 bytes per scalar.

eknkc

4 months ago

I guess for 2D vectors that would work?

For 1024 dimensions even with 8 bit quantization you are looking at a terrabyte of data. Lets make it binary vectors, it is still 128GB of VRAM.

WAT?

cluckindan

4 months ago

3 replies

How is this different from running tuned HNSW vector indices on Elasticsearch?

4 months ago

2 replies

Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.

ab5tract

4 months ago

1 reply

Can I ask you which alternatives exist at the layer Lucene occupies?

I went looking around last year and couldn’t really find many options, but I might have been looking in the wrong places.

4 months ago

For Vector Search the top 2 are: Meta’s FAISS and (my) Unum’s USearch. Lucene powers Elastic, Solr, MongoDB Atlas, AWS OpenSearch, Azure Cognitive Search. USearch powers ClickHouse, DuckDB, YugaByte, TiDB, ScyllaDB, MemGraph, KuzuDB, Lantern, and a few big closed source names that don’t mention it, as far as I know. FAISS has the highest usage among Python developers, but if you are indexing large collections you should consider alternatives.

cluckindan

4 months ago

Interesting, then, that Vectroid would choose to fork it.

Elasticsearch is at least good / at hiding the Lucene zoo under the hood.

wwdmaxwell

4 months ago

Aside from being serverless. This is like elasticsearch but with a kind of built in redis-like layer, I think.

4 months ago

co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...

kgeist

4 months ago

2 replies

There was recently this paper: https://arxiv.org/abs/2508.21038

They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.

Is 1 bln vectors practical?

4 months ago

I would think that 1 bln refers to the row count, not to a vector's length.

yorwba

4 months ago

Those numbers are for the case where you want all possible pairs of two vectors to have a corresponding query that returns those vectors as the top two results.

If you mostly just want to find a particular single vector if possible and don't care so much what the second-best result is, you can get away with much smaller embeddings.

And if you do want to cover all possible pairs, 6500 dimensions or so should be enough. (Their empirical results roughly fit a cubic polynomial.)

chatmasta

4 months ago

5 replies

I would like to see a “DataFusion for Vector databases,” i.e. an embeddable library that Does One Thing Well – fast embedding generation, index builds, retrieval, etc. – so that different systems can glue it into their engines without reinventing the core vector capabilities every time. Call it a generic “vector engine” (or maybe “embedding engine” to avoid confusion with “vectorized query engine.”)

Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)

There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.

But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.

Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?

4 months ago

2 replies

I think we have so many of those nice open source libraries but the problem is not the library or the algorithm (hsnw or ivf derivatives).. the problem is figuring out the right distributed architecture to balance cost, accuracy (recall) and speed (latency). I believe no single library will give you all that. For instance if you don't separate writes (indexing) from reads (queries) and scale them separately then your indexing will either suck or your indexing will destroy your read latency. You won't be able to scale as easily either. I believe that is why AWS created Aurora and Google Cloud created AlloyDB to scale relational databases (mysql/postgresql) by separating the reads/writes, implementing a scalable storage backend and by offloading a lot of shared works (replication, compaction, indexing) to cluster of machines.

chatmasta

4 months ago

Yeah, I feel like these libraries are all one level lower than what I’m asking for. We need something that makes more assumptions (e.g. “I’m running as a component of some kind of database”) but… makes less decisions? Is more flexible? Idk. This is the hard part.

DataFusion nailed this balance between an embedded query engine and a standalone database system. It brings just the right amount of batteries that it’s not a super generic thing that does nothing useful out of the box, but it doesn’t bring so many that it needs to compete with full database systems.

I believe the maintainers refer to it as “the IR of databases” and I’ve always liked that analogy. That’s what I’d like to see for vector engines.

Maybe what we need as a pre-requisite is the equivalent of arrow/parquet ecosystem for vectors. DataFusion really leverages those standards for interoperability and performance. This also goes a long way toward the architectural decisions you reference — Arrow and Parquet are a solid, “good enough” choice for in-memory and storage formats that are efficient and flexible and well-supported. Is there something similar for vector storage?

whakim

4 months ago

I couldn't agree with this more. I don't think the majority of problems with vector search at scale are vector search problems (although filtering + ANN is definitely interesting), they're search-problems-at-scale problems.

jeadie

4 months ago

We’re building vector indexes into Datafusion for search (starting with S3 vectors).

Open source at https://github.com/spiceai/spiceai

AlexClickHouse

4 months ago

USearch is this type of library: https://github.com/unum-cloud/usearch

Used in ClickHouse and a few other DBMS.

itake

4 months ago

why not use this? https://github.com/facebookresearch/faiss

maxxen

4 months ago

Soo… usearch? Its literally one header file (of what use to be strict c++11). Funnily enough that is what is used in the official duckdb-vss extension.

Disclaimer: I wrote duckdb-vss

4 months ago

2 replies

I think the whole field of vector databases is mostly just one huge misunderstanding. Most of you are not Google or any other big tech company so so won't have billions of embeddings.

It's crazy how people add bloat and complexity to their stuff just because they want to do medium scale RAG with ca. 2 million embeddings.

Here comes the punchline, you do not need a fancy vector database in this case. I stumbled over https://github.com/sqliteai/sqlite-vector which is a SQLite extension and I wonder why no one else did this before, but it simply implements a highly optimized brute force search over the vectors, so you get sub 100ms queries over millions of vectors with perfect recall. It uses dynamic runtime dispatch that makes use of the available SIMD instructions your CPU has. Turns out this might be all you need. No need for memory a memory hungry search index (like HNSW) or writing a huge index to disk (like DiskANN).

stevesimmons

4 months ago

2 replies

Might be all you need, except an open source licence:

> For production or managed service use, please contact SQLite Cloud, Inc for a commercial license.

graphGL

4 months ago

1 reply

Try https://github.com/asg017/sqlite-vec - Apache 2, Mozilla backed!

4 months ago

I'd be cautious. Project seems abandoned. And I wouldn't say it's one of those cases where a piece of software is just finished and doesn't need any changes.

https://github.com/duckdb/duckdb-vss

4 months ago

Damn, you're right. That's a deal breaker for me at least.

nojvek

4 months ago

2 replies

There’s vss as duckdb extension too that builds a hnsw index.

Since duckdb is already columnar, it goes brrrrr with single digit millisecond vector similarly lookups.

4 months ago

Oh right, duckdb being columnar is the ultimate brrr factor for such a brute force vector similarity search. But doesn't using a HNSW index completely forfeit this potential advantage?