SQL Design Patterns (2010)
Key topics
The debate around SQL's `DISTINCT` operator has sparked a lively discussion, with many commenters weighing in on its usage and implications. While some view it as a "query smell" or a sign of underlying issues with the query or data understanding, others see it as a convenient syntactic shortcut. A notable consensus emerges that `DISTINCT` is often misused, with one commenter estimating a 70% chance that it's being used to fix a problem that should be solved differently. The discussion highlights the nuances of SQL design patterns and the importance of understanding the underlying data.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
4d
Peak period
24
84-96h
Avg / period
16
Based on 32 loaded comments
Key moments
- 01Story posted
Aug 27, 2025 at 12:59 AM EDT
4 months ago
Step 01 - 02First comment
Aug 30, 2025 at 3:11 PM EDT
4d after posting
Step 02 - 03Peak activity
24 comments in 84-96h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 31, 2025 at 11:15 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I wish SQL had a strict mode syntax that forces you to use something like `select one` (like LINQ’s Single()) or `select many` to catch these kinds of bugs.
SELECT DISTINCT ON is different, and useful.
- For a question worth 2 points, if you use the word "DISTINCT" when it wasn't needed, you lose 0.5 points.
- If you don't use "DISTINCT" when it was necessary, you lose all 2 points.
Footnote on page 3.
The set of data structures that you use to model and index a dataset is worth understanding, and designing in that space is a skill worth learning.
Assuming you are familiar with trees and hashmaps, you have all the important building blocks. You can imagine a database as a bunch of trees, hashmaps and occasionally other stuff, protected by a lock. First you acquire the lock, then you update some of the data structures, and maybe that requires you to update some of the other data structures (like indexes) for consistency. Then you release the lock.
By default, most data will live in a BTree with an integer primary key, and that integer is taken from a counter that you increment for new inserts. Indexes will be BTrees where the key is stuff you want to query on, and the value is the primary key in the main table.
Using just those data structures you should be able to plan for any query or insert pattern. It helps to figure this out yourself in a programming language for a few practice cases, so you know you can do it. Eventually it will be easy to figure out what tables and indexes you need in your head. In the real world, this stuff is jotted down in design docs, often as SQL or even just bullets.
That's really all you need, and that's where I recommend getting out of the rabbit hole. Query planners are pretty good. You can usually just write SQL and if you did the work to understand what the tables and indexes should be, the planner will figure out how to use them to make the query fast.
Also, in no way does SQL hide anything - it’s a declarative language, and will produce exactly what you tell it to, provided you understand what it is you asked it to do. The query engine is somewhat of a black box, but that is completely orthogonal.
Not really sure what you’re trying to argue here.
That’s the point. In an imperative language if you don’t yet understand (or make a typo, or whatever), you can just print/console.log and find out.
I’ve seen junior devs, data analysts, and LLMs spin their wheels trying to figure out why adding a join isn’t producing the output they want. I don’t think they would figure it out using SQL alone if you gave them a month.
Depending on how you write your query and how you structure your data, a query can take 0.005 seconds or 500 seconds.
SQL hiding the execution is an extremely leaky abstraction. To get the performance you need, you have to plan your possible queries in advance together with how to structure the data.
I mean, it doesn't matter if you only have 100 rows per table, but once you're dealing with multiple tables with millions of rows each it's everything.
But you're not prevented from finding out how your query was executed. For example EXPLAIN (MySQL, Postgres) or query analyser for MSSQL.
Ha ha, no, SQL implementations can conform to the standard in unexpected ways.
NULL = NULL
Is that true or false? We didn't know until 2003.
https://en.wikipedia.org/wiki/Null_(SQL)#Criticisms
The SQL2003 changes were more about how constraints should treat NULL than anything intrinsic about NULL itself.
The indexing is where the main optimisations take place - hashmap indexes, or clustering indexes for priority queues.
What am I missing?
And with SQL macros becoming a thing it is now easily possible to store patterns as reausable functions with parameters.
----
[0] Because you want the ordinal of the row in the input table/view, not your output.
[1] You could also use a sub-query, in most cases a good query planner will see the equivalence and do the same thing for either. The CTE option is easier to read and maintain IMO.
[2] In databases, like sports, CTEs can result in headaches!
[3] Or veterans of postgres, where until a few years ago CTEs were an optimisation gate, blocking predicate push-down and making some filtered queries a lot more expensive (though often no more so than the naive Cartesian product method).