Ultra Efficient Vector Extension for Sqlite
Posted3 months agoActive3 months ago
marcobambini.substack.comTechstoryHigh profile
calmmixed
Debate
60/100
SqliteVector DatabasesDatabase Performance
Key topics
Sqlite
Vector Databases
Database Performance
The post discusses a new vector extension for SQLite, which has impressive performance but raises concerns about its non-open-source license and potential use cases.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3d
Peak period
40
72-84h
Avg / period
9.3
Comment distribution56 data points
Loading chart...
Based on 56 loaded comments
Key moments
- 01Story posted
Sep 23, 2025 at 10:33 AM EDT
3 months ago
Step 01 - 02First comment
Sep 26, 2025 at 11:34 AM EDT
3d after posting
Step 02 - 03Peak activity
40 comments in 72-84h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 29, 2025 at 2:21 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45347619Type: storyLast synced: 11/20/2025, 6:24:41 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Note that this one isn't open source: https://github.com/sqliteai/sqlite-vector/blob/main/LICENSE....
The announcement says:
> We believe the community could benefit from sqlite-vector, which is why we’ve made it entirely free for open-source projects.
But that's not really how this works. If my open source projects incorporate aspects of this project they're no longer open source!
I guess I'll either stick with sqlite-vec or give turso another look. I'm not fond of the idea of a SQLite fork though.
Do you know if anything else I should take a look at? I know you use a lot of this stuff for your open-source AI/ML stuff. I'd like something I can use on device.
Apache 2 is just an example here - the same would apply for practically any open source license.
The one place I imagine it could still work is if the open-source project, say a sqlite browser, includes it as an optional plugin. So the project itself stays open-source, but the grant allows using the proprietary plugin with it.
If they depend on software that carries limitations, I can no longer make that promise to my own users.
Or does their extra license term mean I can ship my own project which is the thinnest possible wrapper around theirs but makes it fully open source? That seems unlikely.
In this aggregate form, there is little difference between pseudocode snippets in a post like this one, versus a well-maintained library getting scraped.
The more I think about it, I don’t even really crave credit so much as the feedback loop that tells me whether I’m doing anything useful.
I haven’t solved this contradiction, so I still release under the MIT license.
But not for the lack of trying - people keep trying to redefine it...
The parent only talked about ‘open source’, which has a huge overlap with Free Software, but the two still have different formal definitions (not to mention the completely different ideas behind them). This still left the (admittedly unlikely) possibility of the software in question being Free (as in freedom), so I felt it worth pointing out it wasn’t that, either. A common way to talk about software which is explicitly both Free and open-source at the same time is to call it Free and Open-Source Software.
I mean, can you name any licenses that are one or the other but not both?
And I explicitly don't mean whether one of OSI or FSF approved a license when the other rejected it, because sometimes they make that decision based on nitpicks and not because of differences in principles.
I wrote a big tutorial about embeddings a couple of years ago which still holds up today: https://simonwillison.net/2023/Oct/23/embeddings/
Vector databases are often over-optimized for getting results into papers (where the be-all-end-all measure is recall@latency benchmarking). Indexed search is very rigid - filtered search is a pain, and you're stuck with the metric you used for indexing.
At smaller data scales, you get a lot more flexibility, and shoving things into a indexed search mindlessly will lead to a lot of pain. Providing optimized flexible search at smaller scales is quite valuable, imo.
A small note: since the project seems to include @maratyszcza's fp16 library (MIT), it might be nice to add a line of attribution: https://github.com/maratyszcza/fp16
And if you ever feel like broadening the benchmarks, it could be interesting to compare with USearch. It has had an SQLite connector for a while and covers a wide set of hardware backends and similarity metrics, though it's not yet as straightforward to use on some OSes: https://unum-cloud.github.io/usearch/sqlite
It's geared towards bioacoustics in particular. It's pretty easy to provide a wrapper for any given model though. Feel free to send a ping if you try it out for speech; will be happy to hear how it goes.
In addition, if things are brute forced, wouldn’t a columnar db perform better than a row-based one? E.G. DuckDB?
Not really, you can just call the distance function directly and your vector blob can be in any regular field in any regular table, like what I do. Works great.
More info: https://github.com/asg017/sqlite-vec/issues/196
But it's still a wonderful example for how far you can get with brute-force vector similarity search that has been optimized like crazy by making use of SIMD. Not having to use a fancy index is such a blessing. Think of all the complexities that you don't have when not using an index. You don't have these additional worries about memory and/or disk usage or insert/delete/update costs and you can make full use of SQL filters. It's crazy to me what kind of vector DBs people put up with. They use custom query languages, have to hold huge indices in memory or write humongous indices to disk for something that's not necessarily faster than brute-force search for tables with <5M rows. And let's be honest who has those gigantic tables with more than 5M rows? Not too many I'd assume.
/Looks around all innocent ... does 57 billion count? Hate to tell ya but plenty of use cases when you deal with large datasets and normal table design will get out of hand. And row overhead will bite!
We ended up writing a specialized encoder / decoder to store the information in bytea fields to reduce it to a measly 3 billion (row packing is the better term) but we also lost some of the advantages of having the database btree index on some of the date fields.
Thing is, the moment you deal with big data, things starts to collaps fast if you want to deal with brute force, vs storage vs index handeling.
I can think of a lot more projects that can expand to crazy numbers, if you follow basic database normalization.
That surely counts as a case where brute-force search will not do :) I'm intrigued though, do you really need to make searches over all those vectors or could you filter the candidates down to something <5M ? As I wrote, this is one of the nice advantages of no-index brute-force search, you can use good 'ol SQL WHERE clauses to limit the amount of candidates in many cases and then the brute-force search is not as expensive. Complex indices like HNSW or DiskANN don't play as nice with filters.
I know that open source projects rarely protect their trademarks so maybe they are legally in the clear, but this still feels like bordering on fraud.
Seriously, look at their logo. This seems like a clear attempt to mislead consumers.
https://fossa.com/blog/dual-licensing-models-explained/
The maintainer of libxml2, Nick Wellnhofer, will be moving all his future contributions over to an AGPL fork as corporate users of libxml2 were unwilling to contribute financially.
https://news.ycombinator.com/item?id=45288488
https://gitlab.gnome.org/GNOME/libxml2/-/issues/976#note_253...
I am however still looking for a fast CPU-bound embedding model (fastembed is quite slow on small ARM chips).
[1] https://news.ycombinator.com/item?id=44063950