Ultra Efficient Vector Extension for Sqlite

3 months ago

7 replies

This is a neat project - good API design, performance looks impressive.

Note that this one isn't open source: https://github.com/sqliteai/sqlite-vector/blob/main/LICENSE....

The announcement says:

> We believe the community could benefit from sqlite-vector, which is why we’ve made it entirely free for open-source projects.

But that's not really how this works. If my open source projects incorporate aspects of this project they're no longer open source!

rudedogg

3 months ago

1 reply

Dang, I was really excited about this too.

I guess I'll either stick with sqlite-vec or give turso another look. I'm not fond of the idea of a SQLite fork though.

Do you know if anything else I should take a look at? I know you use a lot of this stuff for your open-source AI/ML stuff. I'd like something I can use on device.

Rendello

3 months ago

You can point DuckDB at a SQLite file and it will read it using its special columnar format. I'm not sure if that's what you need, though.

tanvach

3 months ago

2 replies

There is the 'Additional Grant for Open-Source Projects' section that seems to permit inclusion in open source project. Do you mind explaining why you think this is not enough? I'm not an expert in licenses so genuinely interested in your take.

matharmin

3 months ago

1 reply

Let's say I have an open-source project licensed under Apache 2. The grant allows me to include the extension in my project. But it doesn't allow me to relicense it under Apache 2 or any other compatible license. So if I include it, my project can't be Apache 2-licensed anymore.

Apache 2 is just an example here - the same would apply for practically any open source license.

The one place I imagine it could still work is if the open-source project, say a sqlite browser, includes it as an optional plugin. So the project itself stays open-source, but the grant allows using the proprietary plugin with it.

stavros

3 months ago

1 reply

I don't see why this would infect your project, though. You aren't using the code directly, you're using it as a tool dependency, no? Same way as if your OSS project used an Oracle DB to store data.

nextaccountic

3 months ago

1 reply

Unlike Oracle DB, sqlite gets embedded in your program binary. It's a library, not an external service, and this matters for OSS licenses

stavros

3 months ago

Ah true, I forgot because I always use it in Python, where it's built in.

3 months ago

1 reply

The reason I choose to apply open source licenses to my project as I want other people to be able to use them without any limitations (beyond those set out in the open source license I selected, which are extremely permissive.)

If they depend on software that carries limitations, I can no longer make that promise to my own users.

Or does their extra license term mean I can ship my own project which is the thinnest possible wrapper around theirs but makes it fully open source? That seems unlikely.

b33j0r

3 months ago

I used to think this, but now I feel like anything I write will just be vacuumed up by bots and no human will ever even know about it, unless I include some kind of terms that at least make the work traceable to an artifact.

In this aggregate form, there is little difference between pseudocode snippets in a post like this one, versus a well-maintained library getting scraped.

The more I think about it, I don’t even really crave credit so much as the feedback loop that tells me whether I’m doing anything useful.

I haven’t solved this contradiction, so I still release under the MIT license.

MangoToupe

3 months ago

1 reply

I guess "free software" is well and truly dead as a term with any general cultural weight.

theamk

3 months ago

As long as threads like this appear, not yet.

But not for the lack of trying - people keep trying to redefine it...

F3nd0

3 months ago

1 reply

Even worse, it seems like it’s not Free Software, either.

froh

3 months ago

1 reply

from the way you say this, it seems you confuse cost free with freedom, free software being about the latter, just implying the former.

F3nd0

3 months ago

1 reply

I was talking about freedom, hence the capitalisation to make that even clearer.

The parent only talked about ‘open source’, which has a huge overlap with Free Software, but the two still have different formal definitions (not to mention the completely different ideas behind them). This still left the (admittedly unlikely) possibility of the software in question being Free (as in freedom), so I felt it worth pointing out it wasn’t that, either. A common way to talk about software which is explicitly both Free and open-source at the same time is to call it Free and Open-Source Software.

Dylan16807

3 months ago

I think the confusion is that "even worse" sounds like something meaningful but any license split between those two would be quite a fine hair and people tend to treat them as the same.

I mean, can you name any licenses that are one or the other but not both?

And I explicitly don't mean whether one of OSI or FSF approved a license when the other rejected it, because sometimes they make that decision based on nitpicks and not because of differences in principles.

thorsson12

3 months ago

Ah, yes, this is a "source available" project, not what you would normally call an "open source" project. Still cool!

OutOfHere

3 months ago

In contrast, https://github.com/asg017/sqlite-vec is dual-licensed under Apache and MIT, which makes it open source.

dmezzetti

3 months ago

Odd licensing strategy here. It's like someone that wants the cachet of saying they are open source without being it.

fifilura

3 months ago

3 replies

Honest question, I just want to learn, what are vector databases used for?

aaronblohowiak

3 months ago

1 reply

finding similar things quickly, where the "shape" of a thing can be defined by a vector (like embeddings for instance). this can be used in lots of machine learning applications

fifilura

3 months ago

1 reply

I figured it would be something like this. And vectors as rows in a regular table would be too slow then?

sandyarmstrong

3 months ago

sqlite does not have native support for a vector-like column type. Extensions like this and sqlite-vec build on the BLOB column type, and provide additional functions to let you efficiently search and manipulate this vector data.

3 months ago

They're useful for embeddings, which let you turn articles (and images and other content) into a huge array of floating point numbers that capture the semantics of the content. Then you can use a vector database to find similar items to each other - or similar items to the user's own search query.

I wrote a big tutorial about embeddings a couple of years ago which still holds up today: https://simonwillison.net/2023/Oct/23/embeddings/

bawolff

3 months ago

Everyone is saying AI, but also things like image similarity search.

kamranjon

3 months ago

1 reply

I'd be interested to understand the query performance when compared to the HNSW implementation (Turso?) they mentioned. In general search performance is more important to me, and I don't mind having an increase in insert overhead to have very fast vector search.

dandanua

3 months ago

2 replies

HNSW is not accurate. I guess brute-force means that sqlite-vector returns the best match.

kamranjon

3 months ago

Right but libsql(Turso) uses HNSW - so i'd be curious to know how performance of sqlite-vector compares - they do say they "Use a brute-force-like approach, but highly optimized." - which to me, would be very interesting to see compared with a HNSW approach.

3 months ago

I believe a common approach with this kind of inaccurate index is to use the index to get the top 100 and then calculate the exact distance against those 100 matches to get the top 10.

ashvardanian

3 months ago

1 reply

My usual line of feedback would be to start with a more aggressive benchmark. Indexing 100K dense vector (100ish MB here) is not generally a good idea. Brute-force search at that scale is already trivial at 10 GB/s/core.

3 months ago

1 reply

They say int he post that they're doing optimized brute-force search, which honestly makes a lot of sense for the local-scaled context.

Vector databases are often over-optimized for getting results into papers (where the be-all-end-all measure is recall@latency benchmarking). Indexed search is very rigid - filtered search is a pain, and you're stuck with the metric you used for indexing.

At smaller data scales, you get a lot more flexibility, and shoving things into a indexed search mindlessly will lead to a lot of pain. Providing optimized flexible search at smaller scales is quite valuable, imo.

ashvardanian

3 months ago

1 reply

Ah, I see the article does mention "brute-force-like" — I must have skimmed past that. I'd be curious what exactly is meant by it in practice.

A small note: since the project seems to include @maratyszcza's fp16 library (MIT), it might be nice to add a line of attribution: https://github.com/maratyszcza/fp16

And if you ever feel like broadening the benchmarks, it could be interesting to compare with USearch. It has had an SQLite connector for a while and covers a wide set of hardware backends and similarity metrics, though it's not yet as straightforward to use on some OSes: https://unum-cloud.github.io/usearch/sqlite

3 months ago

1 reply

To be clear, I'm not the author of the post. But I do maintain a library for folks working with large audio datasets, built on a combination of SQLite and usearch. :)

dotancohen

3 months ago

1 reply

What library is that? My current project is working with voice recordings. My personal collection of voice recordings spans 20 years and measures in the high tens of GiB.

3 months ago

1 reply

Here you go: https://github.com/google-research/perch-hoplite

It's geared towards bioacoustics in particular. It's pretty easy to provide a wrapper for any given model though. Feel free to send a ping if you try it out for speech; will be happy to hear how it goes.

dotancohen

3 months ago

1 reply

Interesting. Audio search isn't a problem I've thought about addressing, as I'll have transcriptions anyway. But knowing this exists might inspire some additional features or use cases that I haven't thought of yet. Thank you.

3 months ago

Yep, makes sense - conversion to text and then aligning the text with the audio is a very reasonable way to handle large volumes of speech data. For bioacoustics, we tend to have a loooooot of variation for which there is no real notation, and which may be from areas where we haven't seen much training data, or on taxa where we don't have lots of scientists (eg, insects). So working with the raw audio embeddings tends to be best.

aabhay

3 months ago

1 reply

The repo mentions approximate NN search but the article implies this is mainly brute force. Is there any indexing at all then? If not, is the approximate part an app-space thing e.g. storing binary vectors alongside the real ones?

In addition, if things are brute forced, wouldn’t a columnar db perform better than a row-based one? E.G. DuckDB?

OutOfHere

3 months ago

1 reply

A columnar database is completely irrelevant to vector search. Vectors aren't stored in columns. Traditional indexing too is altogether irrelevant because brute force means a full pass through the data. Specialized indexes can be relevant, but then the search is generally approximate, not exact.

3 months ago

1 reply

How's a database being columnar irrelevant to vector search? This very vector search extension shows that brute force search can work surprisingly well up to a certain dataset size and at this point columnar storage is great because it gives a perfect memory access pattern for the vector search instead of iterating over all the rows of a table and only accessing the vector of a row.

OutOfHere

3 months ago

That makes sense. I withdraw my comment.

mholt

3 months ago

1 reply

> [sqlite-vec] works via virtual tables, which means vectors must live in separate tables and queries become more complex.

Not really, you can just call the distance function directly and your vector blob can be in any regular field in any regular table, like what I do. Works great.

More info: https://github.com/asg017/sqlite-vec/issues/196

3 months ago

1 reply

Sorry for not being on topic, just wanted to say hi @mholt and for making and maintaining Caddy! Happy Caddy user here.

mholt

3 months ago

Thank you, that's always nice to read! I will pass this along to our maintainer team.

3 months ago

1 reply

It's unfortunate that this one is not really open source, it has Elastic License 2.0 license.

But it's still a wonderful example for how far you can get with brute-force vector similarity search that has been optimized like crazy by making use of SIMD. Not having to use a fancy index is such a blessing. Think of all the complexities that you don't have when not using an index. You don't have these additional worries about memory and/or disk usage or insert/delete/update costs and you can make full use of SQL filters. It's crazy to me what kind of vector DBs people put up with. They use custom query languages, have to hold huge indices in memory or write humongous indices to disk for something that's not necessarily faster than brute-force search for tables with <5M rows. And let's be honest who has those gigantic tables with more than 5M rows? Not too many I'd assume.

benjiro

3 months ago

1 reply

> And let's be honest who has those gigantic tables with more than 5M rows? Not too many I'd assume.

/Looks around all innocent ... does 57 billion count? Hate to tell ya but plenty of use cases when you deal with large datasets and normal table design will get out of hand. And row overhead will bite!

We ended up writing a specialized encoder / decoder to store the information in bytea fields to reduce it to a measly 3 billion (row packing is the better term) but we also lost some of the advantages of having the database btree index on some of the date fields.

Thing is, the moment you deal with big data, things starts to collaps fast if you want to deal with brute force, vs storage vs index handeling.

I can think of a lot more projects that can expand to crazy numbers, if you follow basic database normalization.