Das Problem Mit German Strings

Posted5 months agoActive4 months ago

asubiotto

79 points

23 comments

polarsignals.comTech Discussionstory

informativeneutral

Debate

20/100

ProgrammingCharacter EncodingGerman Language

Key topics

Programming

Character Encoding

German Language

The debate around "German Strings" sparked a lively discussion, with some commenters poking fun at the title edit that initially changed "mit" to "MIT". As it turned out, the author confirmed it was a mistake, and the title was corrected. The conversation then shifted to the actual topic, with some participants clarifying that the term "German Strings" refers to a specific string format style, and others pointing out that the concept has been reinvented multiple times. The discussion also touched on the nuances of the German language, with commenters chiming in on its grammatical rules and word order.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

20m

Peak period

0-6h

Avg / period

3.8

Comment distribution23 data points

Loading chart...

Based on 23 loaded comments

Key moments

01Story posted
Aug 26, 2025 at 4:59 PM EDT
5 months ago
Step 01
02First comment
Aug 26, 2025 at 5:19 PM EDT
20m after posting
Step 02
03Peak activity
9 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Aug 29, 2025 at 4:25 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (23 comments)

Showing 23 comments

dekhn

5 months ago

1 reply

did the hacker news title editor change the "mit" to "MIT"?

asubiottoAuthor

5 months ago

1 reply

Seems like it. Changed it back!

dang

5 months ago

1 reply

Oops, sorry.

Tadpole9181

5 months ago

1 reply

Haha, is that automated or was someone trying to be helpful?

dang

5 months ago

It's automated. And of course it's usually right, but the wrong cases stand out like sore thumbs.

thayne

5 months ago

2 replies

So... why are they called Getman strings?

mathieuh

5 months ago

3 replies

https://datafusion.apache.org/blog/2024/09/13/string-view-ge...

> The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) and was introduced to Arrow as a new StringViewArray[^3] type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.

Seems to be nothing more than they were invented at a German university. I spent quite some time thinking it had something to do with German’s sometimes-SOV word order.

aleph_minus_one

5 months ago

2 replies

> I spent quite some time thinking it had something to do with German’s sometimes-SOV word order.

If you refer to subclauses in the German language: here the rule is rather "the finite verb is at the end of the subclause".

kaladin-jasnah

5 months ago

1 reply

I think this is also called V2 word order.

aleph_minus_one

5 months ago

V2 word order (finite verb comes second) is what is used in main clauses.

yorwba

5 months ago

It also applies to infitives and participles and the verb in nominalized noun-verb compounds. So the rule is closer to "the verb is at the end of its grammatical unit, except for the finite verb in a main clause, which appears in second position." https://en.wikipedia.org/wiki/V2_word_order

andai

5 months ago

1 reply

Here is the paper in question:

Umbra: A Disk-Based System with In-Memory Performance

https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf

Section 3.1 covers string handling.

This article (also linked from tfa) explains German strings in more detail.

https://cedardb.com/blog/german_strings

chombier

5 months ago

1 reply

my tl;dr: after reading the article:

- two 64-bits words representation

- fixed, 32 bits length

- short strings (<12 bytes) are stored in-place

- long strings store a 4 byte prefix in-place + pointer to the rest

- two bits are used as flags in the pointer to further optimize some use-cases

imtringued

4 months ago

Seems like they missed an opportunity to have a 8 byte version for strings that fit in the 4 byte prefix.

jandrewrogers

5 months ago

This general string format style has been invented many times over the decades. Unfortunately, we seem to need to relearn the tradeoffs each time.

on_the_train

5 months ago

They aren't. They're called German style strings. People just like to clickbait and prey on curiosity of techies.

kazinator

5 months ago

1 reply

> Because it is difficult to assume what the best encoding will be for any given workload, database systems should dynamically choose encodings based on storage and workload characteristics.

It would be better just to take the storage requirement on the chin and not add a gratuitous variation in encoding which will bite you on the ass somehow (or someone else).

As much as possible, pick one way of doing one thing. Your stuff already has thousands of things to do. Each time you do something in two or more ways, you add combinations between that and surrounding things being done in two or more ways.

kccqzy

5 months ago

1 reply

The combinatorial explosion problem is nicely solved by defining good interfaces. C++ gives you iterators and algorithms that work on iterators. Clojure has sequence interfaces and functions that work on all sequence types.

kazinator

5 months ago

1 reply

That just improves the organization of the program; it doesn't get rid of the increased risks of doing the same thing in N ways that could be pined down to one.

kccqzy

5 months ago

1 reply

Please elaborate. What are the risks of doing the same thing in N ways, other than code organization issues leading to duplicate or messy code?

kazinator

4 months ago

Do this thing in 3 ways, do that one in 4, do another one in 2 and you have 3x4x2 = 24 combinations which are entirely gratuitous compared to the 1 combination that exists if all three things are done one way each.

Oh, you don't have to test the combinations because the code is bug free, is that the argument? Which is because of some good organization?

Those things are nicely isolated so 3 + 4 + 2 unit tests, and we are done?

atoav

5 months ago

Well as long as you know the difference betwen lowercase ß and uppercase ẞ (introduced in 2008) everything is probably just gonna be fine.

JdeBP

5 months ago

> Because each element requires at least a 16 byte representation, both tiny and repeated short strings use more memory than they otherwise would.

In a wider view, that depends. If one is using a general-purpose heap for string storage and a 64-bit instruction set architecture, the heap is often aligning and padding out allocations to such multiples already.

View full discussion on Hacker News

ID: 45032261Type: storyLast synced: 11/20/2025, 3:10:53 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN