Kaitai Struct: declarative binary format parsing language
Mood
supportive
Sentiment
positive
Category
tech
Key topics
Binary Parsing
Kaitai Struct
Data Formats
Kaitai Struct is a declarative binary format parsing language that has received positive feedback from the community for its ease of use and versatility, with discussions highlighting its applications and comparisons to other tools.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
9d
Peak period
44
Day 10
Avg / period
26.5
Based on 53 loaded comments
Key moments
- 01Story posted
Oct 14, 2025 at 10:51 AM EDT
about 1 month ago
Step 01 - 02First comment
Oct 23, 2025 at 3:43 PM EDT
9d after posting
Step 02 - 03Peak activity
44 comments in Day 10
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 25, 2025 at 2:41 AM EDT
about 1 month ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
> Kaitai Struct is in a similar space, generating safe parsers for multiple target programming languages from one declarative specification. Again, Wuffs differs in that it is a complete (and performant) end to end implementation, not just for the structured parts of a file format. Repeating a point in the previous paragraph, the difficulty in decoding the GIF format isn't in the regularly-expressible part of the format, it's in the LZW compression. Kaitai's GIF parser returns the compressed LZW data as an opaque blob.
Taking PNG as an example, Kaitai will tell you the image's metadata (including width and height) and that the compressed pixels are in the such-and-such part of the file. But unlike Wuffs, Kaitai doesn't actually decode the compressed pixels.
---
Wuffs' generated C code also doesn't need any capabilities, including the ability to malloc or free. Its example/mzcat program (equivalent to /bin/bzcat or /bin/zcat, for decoding BZIP2 or GZIP) self-imposes a SECCOMP_MODE_STRICT sandbox, which is so restrictive (and secure!) that it prohibits any syscalls other than read, write, _exit and sigreturn.
(I am the Wuffs author.)
Wuffs is intended for files. But, would it be a bad idea to use it to parse network data from untrusted endpoints?
There's also a "wget some JSON and pipe that to what Wuffs calls example/jsonptr" example at https://nigeltao.github.io/blog/2020/jsonptr.html#sandboxing
Kaitai is for describing, encoding and decoding file formats. Wuffs is for decoding images (which includes decoding certain file formats). Kaitai is multi-language, Wuffs compiles to C only. If you wrote a parser for PNGs, your Kaitai implementation could tell you what the resolution was, where the palette information was (if any), what the comments look like and on what byte the compressed pixel chunk started. Your Wuffs implementation would give you back the decoded pixels (OK, and the resolution).
Think of Kaitai as an IDL generator for file formats, perhaps. It lets you parse the file into some sort of language-native struct (say, a series of nested objects) but doesn't try to process it beyond the parse.
I gave a guest lecture in a friend's class last week where we used Kaitai to back out the file format used in "Where in Time is Carmen Sandiego" and it was a total blast. (For me. Not sure that the class agreed? Maybe.) The Web IDE made this super easy -- https://ide.kaitai.io/ .
(On my youtube page I've got recordings of streams where I work with Kaitai to do projects like these, but somehow I am not able to work up the courage to link them here.)
RE, especially of older and more specialized software, involves dealing with odd undocumented binary formats. Which you may have to dissect carefully with a hex editor and a decompiler, so that you can get at the data inside.
Kaitai lets you prototype a parser for formats like that on the go, quick and easy.
For harder cases? You take the binaries that read or write your compressed files, load them in your tool (typically Ghidra nowadays), and track down the code that does it.
Then you either recognize what that code does (by staring at it really really hard), or try to re-implement it by hand while reading up on various popular compression algos in hope that doing this enlightens you.
Third option now: feed the decompiled or reimplemented code to the best LLM you have access to, and ask it. Those things are downright superhuman at pattern matching known algorithms, so use them, no such thing as "cheating" in RE.
The "hard mode" is compression implemented in hardware, with neither a software encoder or a software decoder available. In which case you better be ready for a lot of "feed data to the magic registers, see results, pray they give you a useful hint" type of blind hardware debugging. Sucks ass.
The "impossible" is when you have just the compressed binaries, with no encoder or decoder or plaintext data available to you at all. Better hope it's something common or simple enough or it's fucking hopeless. Solving that kind of thing is cryptoanalysis level of mind fuck and I am neither qualified enough nor insane enough to advise on that.
Another thing. For practical RE? ALWAYS CHECK PRIOR WORK FIRST. You finding an open source reimplementation? Good job, that's what you SHOULD be doing, no irony, that's what you should be doing ALWAYS. Always check whether someone has been there and done that! Always! Check whether someone has worked on this thing, or the older version of it, or another game in the same engine - anything at all. Can save you literal months of banging your head against the wall.
And yes, searching for the reimplementation beforehand would have saved me some hours :D
Whatever reads or writes this data has to be able to compress or decompress it. And with any luck, you'll be able to take the compression magic sauce from there.
https://github.com/kaitai-io/kaitai_struct_compiler/commits/...
It would be premature to review now because there are some missing features and stuff that has to be cleaned up.
But I am interested in finding someone experienced in Zig to help the maintainer with a sanity check to make best practices are being followed. (Would be willing to pay for their time.)
If comptime is used, it would be minimal. This is because code-generation is being done anyway so that can be an explicit alternative to comptime. But we have considered using it in a few places to simplify the code-generation.
KS isn't for general data mangling, it's for "I have this format and I need a de novo parser for it that works under explicit rules" and you're willing to do the work of fully implementing it from the bytes up.
I did NOT have fun trying to use Kaitai to pack the files back together. Not sure if this has improved at all but a year or so ago you had to build dependencies yourself and the process was so cumbersome it ended up being easier to just write imperative code to do it myself.
This seems to say flags is a sort of unsigned integer.
Is there a way to break the flags into big endiaN bits where the first two bits are either 01 or 10 but not 00 or 11 with 01 meaning DATA and 01 meaning POINTER with the next five bits as a counter of segments and the next bit is 1 if the default is BLACK and 1 if the default is WHITE ?
https://github.com/dloss/binary-parsing
Personally I like GNU Poke.
For example:
structures:
struct a { strz b, strz c, int d, str[d] e }
constraints:
len(b) > len(c)
d > 6
d <= 10
e ~ /^ABC\d+/The serialization branch for Python [1] (I haven't tried the Java one) has generally done the job for me, though I've had to patch a few edge cases.
One feature I've often wished for is access to physical offsets within the file being parsed (e.g. being able to tell that a field foo that you just parsed starts at offset 0x100 from the beginning of the file). As far as I know, you only get relative offsets to the parent structure.
0: https://github.com/anvilsecure/garmin-ciq-app-research/blob/...
Seems like eveything has to be JSON and text based these days, because binary is more difficult DX.
When reading articles discussing binary formats, I usually see them using box diagrams of packets, description tables or hexdumps.
This neatly describes nested structure, names and ”types” - just enough.
I wonder if there’s a hexdump like viewer in IDEs that can present binary files like this? I can also imagine a simple UI to make the files editable using this.
Also, the newest Kaitai release added (long awaited) serialization support! I haven't had a chance to try it out.
https://kaitai.io/news/2025/09/07/kaitai-struct-v0.11-releas...
DFDL is heavily encroaching on Kaitai structs territory.
I used Kaitai in a IoT project for building data ingress parsers and it was great. But not having write support was a bummer.
Their reference parsers for Mach-O and DER work quite nicely in abi3audit[1].
[1]: https://github.com/pypa/abi3audit/tree/main/abi3audit/_vendo...
highly recommended if you like functional languages
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.