Thoughts on the Word Spec in Rust
Posted3 months agoActive3 months ago
tritium.legalTechstory
calmmixed
Debate
40/100
RustParsingDocument Formats
Key topics
Rust
Parsing
Document Formats
The article discusses the author's experience implementing a Word document parser in Rust, and the discussion revolves around parsing complexities, round-tripping, and handling unknown data.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
3d
Peak period
15
66-72h
Avg / period
5.8
Comment distribution23 data points
Loading chart...
Based on 23 loaded comments
Key moments
- 01Story posted
Oct 6, 2025 at 8:28 AM EDT
3 months ago
Step 01 - 02First comment
Oct 9, 2025 at 4:16 AM EDT
3d after posting
Step 02 - 03Peak activity
15 comments in 66-72h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 10, 2025 at 3:15 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45490627Type: storyLast synced: 11/20/2025, 12:50:41 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
It's definitely not impossible in the future.
I just don't think there is enough interest right now in contributing to the underlying tech without generalizing it so much that it basically becomes an inferior LibreOffice.
Instead, the business model for Tritium is to give away the niche legal product for free to the community, but charge commercial users who need more granular control over its network activity, etc. This gives smaller start-ups, law offices and in-house shops a chance to benefit from the niche features while reserving for more demanding organizations to express an interest in and benefit from advanced features.
You could make a special exemption for non-profits and public defenders.
Giving it away for free just creates potential for freeloaders.
Great product idea by the way! Hard to believe lawyers have gone without this for so long.
[NOTE: one dragon would be the memory consumption alluded to in the article.]
The reason for asking, is that I've had a shower thought of building custom PDF/doc reader for myself, that would allow me to easily take notes and integrate with Anki. Been doing that in Obsidian with the pdf plugin, but it's too slow. At the same time, I've heard that PDF spec is not easy to work with, so I'm curious about your experience on that front.
There's actually an example PDF in the bundle if you click "Fetch Example" from the web preview at: https://tritium.legal/preview.
Under the hood, Tritium is using PDFium[1]. That's the same library used by Chrome, for example. The PDF spec is another animal that will be tackled in due course, but most legal users only need to view and comment on PDFs.
Try and find a binding to PDFium from your language of choice and start at that layer. PDFs are complex beasts, most of which complexity it may not be necessary to try to tackle in the first instance.
[1] https://pdfium.googlesource.com/pdfium/
Essentially, this system works great if you know the exact hardware and compiler toolchain, and you never expect to upgrade it with things that might break memory layout. Obviously this does not hold for Word: it was written originally in a 32-bit world and now we live in a 64-bit one, MSVC has been upgraded many times, etc. There's also address space concern: if you embed your pointers, are you SURE that you're always going to be able to load them in the same place in the address space?
The overhead of deserialization is very small with a properly written file format, it's nowhere near worth the sacrifice in portability. This is not why Word is slow.
And then you have things like cap'n'proto if you want to control your memory layout. [1]
But for "productivity" files, you are essentially right. Portability and simplicity of the format is probably what matters.
[0]: https://www.hytradboi.com/2025/05c72e39-c07e-41bc-ac40-85e83...
[1]: https://capnproto.org/
That's exactly the point!
(For example, if Rust would detect a version change, it could rewrite the data into a compatible format, etc.)
The reason I hated it though was because it was very hard to version. I know the Word team had that problem, especially when the mandate came down for older versions to be able to read newer versions. Hard enough to organize the disk format so old versions can ignore stuff, but now you're putting the same requirements on the in-memory representation. Maybe Word did it better.
It's interesting to see how this has played out 24 years later with "vibe coding" and how Amazon does business.
> Indeed during the recent dotcom mania a bunch of quack business writers suggested that the company of the future would be totally virtual — just a trendy couple sipping Chardonnay in their living room outsourcing everything. What these hyperventilating “visionaries” overlooked is that the market pays for value added. Two yuppies in a living room buying an e-commerce engine from company A and selling merchandise made by company B and warehoused and shipped by company C, with customer service from company D, isn’t honestly adding much value.
https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-...
Justified under the idea that unexpected tags should be uncommon by the fact they are unexpected (if its common you should have expected it), and can be relegated to a less-performant cold-path as a result.
It would probably mean not having the most fun time ever for the developer depending on docx-rs if an explicit requirement is interacting with and modifying a tag that ends up in the "whatever" bucket, but at least you could make sure that you (de)serialize losslessly.