Lite^3
github.comKey Features
Tech Stack
Key Features
Tech Stack
Perhaps I should have posted this URI instead: https://lite3.io/design_and_limitations.html
Lite^3 deserves to be noticed by HN. u/eliasdejong (the author) posted it 23 days ago but it didn't get very far. I'm hoping this time it gets noticed.
"outperforms the fastest JSON libraries (that make use of SIMD) by up to 120x depending on the benchmark. It also outperforms schema-only formats, such as Google Flatbuffers (242x). Lite³ is possibly the fastest schemaless data format in the world."
^ This should be a bar graph at the top of the page that shows both serializing sizes and speeds.
It would also be nice to see a json representation on the left and a color coded string of bytes on the right that shows how the data is packed.
Then the explanation follows.
I'm sorry we missed that Show HN (https://news.ycombinator.com/item?id=45992832)! It belonged in the SCP (https://news.ycombinator.com/item?id=26998308).
The overridden space is never recovered, causing buffer size
to grow indefinitely.
Is the garbage at least zeroed? Otherwise seems like it could "leak" overwritten values when sending whole buffers via memcpyDon't get me wrong, I find this type of data structures interesting and useful, but it's misleading to call it "serialization", unless my understanding is wrong.
How does machine architecture play into it? It sounds like int sizes are the same regardless of word sizes of the machine, the choices made just happen to have high performance for common machine architectures. Or is it about endianess? Do big endian machines even exist anymore?
I'm very confused by your comment.
Apache Arrow is trying to do something similar, using Flatbuffer to serialize with zero-copy and zero-parse semantics, and an index structure built on top of that.
Would love to see comparisons with Arrow
A closer comparison would be to FlatBuffers which is used by Arrow IPC, a major difference being TRON is schemaless.
The documentation page is out of date, the format now resolves collisions through quadratic probing.
That's not a criticism of anything with lite3, lite3 sounds really cool. The json angle is just an odd part to concentrate on.
First of all, hello Hacker News :)
Many of the comments seem to address the design of key hashing. The reason for using hashed keys inside B-tree nodes instead of the string keys directly is threefold:
1) The implementation is simplified.
2) When performing a lookup, it is faster to compare fixed-sized elements than it is to do variable length string comparison.
3) The key length is unlimited.
I should say the documentation page is out of date regarding hash collisions. The format now supports probing thanks to a PR merged yesterday. So inserting colliding keys will actually work.
It is true that databases and other formats do store string keys directly in the nodes. However as a memory format, runtime performance is very important. There is no disk or IO latency to 'hide behind'.
Right now the hash function used is DJB2. It has the interesting property of somewhat preserving the lexicographical ordering of the key names. So hashes for keys like "item_0001", "item_0002" and "item_0003" are actually more likely to also be placed sequentially inside the B-tree nodes. This can be useful when doing a sequential scan on the semantic key names, otherwise you are doing a lot more random access. Also DJB2 is so simple that it can be calculated entirely by the C preprocessor at compile time, so you are not actually paying the runtime cost of hashing.
We will be doing a lot more testing before DJB2 is finalized in the spec, but might later end up with a 'better' hash function such as XXH32.
Finally, TRON/Lite³ compared to other binary JSON formats (BSON, MsgPack, CBOR, Amazon Ion) is different in that:
1) none of the formats mentioned provide direct zero-copy indexed access to the data
2) none of the formats mentioned allow for partial mutation of the data without rewriting most of the document
This last point 2) is especially significant. For example, JSONB in Postgres is immutable. When replacing or inserting one specific value inside an object or array, with JSONB you will rewrite the entire document as a result of this, even if it is several megabytes large. If you are performing frequent updates inside JSONB documents, this will cause severe write amplification. This is the case for all current Postgres versions.
TRON/Lite³ is designed to blur the line between memory and serialization format.
What you should be imagining instead is a document database entirely built around Lite³-encoded documents, using something like rollback journals instead of MVCC.
We're doing something similar in my company, storing zero-serialization immutable [1] docs in a key-value store (which are read via mmap with zero copying disk-to-usage) and using a mutable [2] overlay patch format for updates. In our analytics use cases, compact storage is very important, in-place mutability is irrelevant (again because of Copy-on-Write at the key-value store level), and the key advantage is zero serialization overhead.
What I'm saying is that Lite³ is a very timely and forward-looking format, but the merging of immutable and mutable formats into one carries tradeoffs that you probably want to discuss, and the discussion into the appropriate use cases is very much worth having.
[1] https://github.com/andreyvit/edb/blob/main/kvo/immutable.go [2] https://github.com/andreyvit/edb/blob/main/kvo/mutable.go
A serialization format does not care about versioning or rollbacks. It is simply trying to organize data such that it can be sent over a network. If updates can be made in-place without requiring re-serialization, then that is always a benefit.
Write amplification is still a fact however that I think deserves to be mentioned. To tackle this problem in the context of DBs/MVCC, you would have to use techniques other than in-place mutation like you mention: overlay/COW. Basically, LMDB-style.
And yes I think databases is where this technology will eventually have the greatest potential, so that is where I am also looking.
You will be able to access the same data in different languages using APIs specific to your language.
Right now someone has already made a (private) Go port. And Rust is also in the works.
About to troll you on this point. I'm glad I wasn't too hasty. Looking forward to trying it out once it's there!
It's protobuf/grpc based but uses json for serialization to make use of sqlite's json filtering functionality. However, it cannot be said to be zero-copy. It serializes binary protos into json and stores the binary protos directly for fast access, which allows you to skip deserialization when pulling out query results
It's just dishonest.
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.