Das Problem Mit German Strings
Key topics
The debate around "German Strings" sparked a lively discussion, with some commenters poking fun at the title edit that initially changed "mit" to "MIT". As it turned out, the author confirmed it was a mistake, and the title was corrected. The conversation then shifted to the actual topic, with some participants clarifying that the term "German Strings" refers to a specific string format style, and others pointing out that the concept has been reinvented multiple times. The discussion also touched on the nuances of the German language, with commenters chiming in on its grammatical rules and word order.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
20m
Peak period
9
0-6h
Avg / period
3.8
Based on 23 loaded comments
Key moments
- 01Story posted
Aug 26, 2025 at 4:59 PM EDT
4 months ago
Step 01 - 02First comment
Aug 26, 2025 at 5:19 PM EDT
20m after posting
Step 02 - 03Peak activity
9 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 29, 2025 at 4:25 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
> The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) and was introduced to Arrow as a new StringViewArray[^3] type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.
Seems to be nothing more than they were invented at a German university. I spent quite some time thinking it had something to do with German’s sometimes-SOV word order.
If you refer to subclauses in the German language: here the rule is rather "the finite verb is at the end of the subclause".
Umbra: A Disk-Based System with In-Memory Performance
https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
Section 3.1 covers string handling.
This article (also linked from tfa) explains German strings in more detail.
https://cedardb.com/blog/german_strings
- two 64-bits words representation
- fixed, 32 bits length
- short strings (<12 bytes) are stored in-place
- long strings store a 4 byte prefix in-place + pointer to the rest
- two bits are used as flags in the pointer to further optimize some use-cases
It would be better just to take the storage requirement on the chin and not add a gratuitous variation in encoding which will bite you on the ass somehow (or someone else).
As much as possible, pick one way of doing one thing. Your stuff already has thousands of things to do. Each time you do something in two or more ways, you add combinations between that and surrounding things being done in two or more ways.
Oh, you don't have to test the combinations because the code is bug free, is that the argument? Which is because of some good organization?
Those things are nicely isolated so 3 + 4 + 2 unit tests, and we are done?
In a wider view, that depends. If one is using a general-purpose heap for string storage and a 64-bit instruction set architecture, the heap is often aligning and padding out allocations to such multiples already.