SQL Anti-Patterns
Posted3 months agoActive2 months ago
datamethods.substack.comTechstoryHigh profile
calmmixed
Debate
70/100
SQLDatabase DesignAnti-Patterns
Key topics
SQL
Database Design
Anti-Patterns
The article discusses common SQL anti-patterns to avoid, sparking a discussion on the nature of these patterns and their relation to SQL language design and database schema design.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
144
Day 1
Avg / period
32
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 18, 2025 at 8:56 AM EDT
3 months ago
Step 01 - 02First comment
Oct 18, 2025 at 10:28 AM EDT
2h after posting
Step 02 - 03Peak activity
144 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 27, 2025 at 2:47 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45626985Type: storyLast synced: 11/20/2025, 8:32:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I think most people think an anti-pattern is an aberration in the "solution" section that creates more problems.
So here, the anti-pattern is that people use a term so casually (e.g., DevOps) that no one knows what it's referring to anymore.
(The problem: need a way to refer to concept(s) in a pithy way. The solution: make up or reuse an existing word/phrase to incorporate the concept(s) by reference so that it can can, unambiguously, be used as a replacement for the longer description. )
Strange choice of example! I'm not sure I agree that your example is a common problem, and I'm even less sure that the proposed solution to it is generally useful.
it isn't, is the thing.
if you read the book design patterns, they spell out what a pattern is.
if you read the book anti-patterns, he spells out what an anti-pattern is.
people have gotten the wrong idea by learning the phrases from casual usage.
but also, the book anti-patterns is pretty clear here
Is this code for 'use a lookup table' or am I falling behind on the terminology? The modern term should be 'sum table' or something similar surely.
'Landed table'? Is that the 'fact table', the one that contains the codes that need to be looked-up?
* in whatever order they're used
if your case statement is just a series of straighahead "WHEN x=this THEN that", you're very lucky.
the nasty case statements are the ones were the when expression sometimes uses different pieces of data and/or the ordering of the statements is important.
Why wouldn’t you store this information in a table and query it when you need it? What if you need to support other languages? With a table you can just add more columns for more languages!
query WHERE name = ‘abc’
create an indexed UPPER(name) column"
Should there be an "or" between these 2 points, or am I missing something? Why create an UPPER index column and not use it?
Unfortunately I learned this the hard way!
For example, you define an index on UPPER(name_column), and in your query you can use WHERE UPPER(name_to_search_for) = UPPER(name_column), and it will use the index.
> query WHERE name = ‘ABC’
> create an indexed UPPER(name) column
The point is that the index itself is already on the data with the function applied. So it's not a full scan, the way the original query was.
Of course, in this particular example you just want to use a case-insensitive collation to begin with. But the general concept is valid.
Otoh, it seems a fairly stable language (family of dialects?) so finding the pitfalls has long leverage
Any time I see DISTINCT in a query I immediately become suspicious that the query author has an incomplete understanding of the data model, a lack of comprehension of set theory, or more likely both.
Though fairly recently I learned that even with all the correct joins in place, sometimes adding a DISTINCT within a CTE can dramatically increase performance. I assume there’s some optimizations the query planner can make when it’s been guaranteed record uniqueness.
Distinct is also easily explained to users, who are probably familiar with Excel’s “remove duplicate rows”.
It can also be great for exploring unfamiliar databases. I ask applicants to find stuff in a database they would never see by scrolling, and you’d be surprised how many don’t find it.
>less verbose
Well…
In any case, it depends. OP nicely guarded himself by writing “overusing”, so at that point his pro-tip is just a tautology and we are in agreement: not every use of DISTINCT is an immediate smell.
You could argue that you could fake a semijoin using
or but it doesn't give the same result if t1 has duplicate rows, or if there is more than one t2 matching t1. (You can try to fudge it by replacing * with something else, in which case the problem just moves around, since “duplicate rows” will mean something else.)SELECT * FROM t1 SEMIJOIN t2 USING (x);
although it creates some extra problems for the join optimizer.
Indeed, along that line, I would say that DISTINCT can be used to convey intent... and doing that in code is important.
- I want to know the zipcodes we have customers in - DISTINCT
- I want to know how many customers we have in each zipcode - aggregates
Can you do the first with the second? Sure.. but the first makes it clear what your goal is.
SOMEWHAT-DISTINCT with a fuzzy threshold would also be useful.
At scale, repeated low-cardinality columns matter a great deal.
SELECT zipcode.rural_urban_code, COUNT(*) AS n_customer FROM customer INNER JOIN zipcode USING(zipcode) GROUP BY 1;
> I immediately become suspicious
All I read from that is, when DISTINCT is used, it's worth taking a look to make sure the person in question understands the data/query; and isn't just "fixing" a broken query with it. That doesn't mean it's wrong, but it's a "smell", a "flag" saying pay attention.
There are self-identifying "senior software engineers" that cannot understand what even an XOR is, even after you draw out the entire truth table, all four rows.
It never used to bug me as a junior dev, but once a peer pointed this out it became impossible for me to ignore.
When making a code change which touches a lot of places, it's not always obvious to "zoom out" and read the surrounding context to see if the structure of the code can be updated. The developer may be chewing through a grep list of a few dozen locations that need to be changed.
`if(X&IsFooMask != 0)`
:)
https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-...
There are few other legitimate use cases of the regular `DISTINCT` that I have seen, other than the typical one-off `SELECT DISTINCT(foo) FROM bar`.
I'll test again, really the last time I tested that was two decades ago.
I'm curious, can you demo this?
Do you recall what the database server was?
I also tested this once years later when doing a Python app with sqlite. Similar result, but admittedly that was not a very big table to begin with.
I am meticulous with my database schemas, and periodically review my indexes and covering indexes. I'm no DBA, but I believe that the database is the only real value a codebase has, other than maybe a novel method here and there. So I put care into designing it properly and testing my assumptions.
But if you just have a LIMIT, then no - any RDBMS should stop as soon as it’s reached your requested limit.
"given a BTreeMap<String, Vec<String>>, how do I do .keys() and .len()".
The big justification for its design is to enable compiler optimizations (query planning) but compilers can optimize imperative code very well too, so I wonder if you could get the same benefits with a language that's less declarative.
Oh you looked the schema for t and it said x has a PRIMARY or UNIQUE constraint?
Ah well two minutes after you looked at the schema Tom removed the UNIQUE constraint. Now your scratching your head when you get duplicates.
Sql is a bag language not a set language. The contract with relation t is that if the runtime can find there rel t and attribute x it will return it. You may end up with rows or not, and you may end up with duplicates or not, and the type of x may change between subsequent execution.
So if you want a set you need to say so using DISTINCT. At runtime the query planner will check the schema and if the attribute is UNIQUE or PRIMARY it will not have to do a deduplication.
In fact, IIRC, using DISTINCT (usually bad for performance, btw) is an SQL advice by CJ Date in https://www.oreilly.com/library/view/sql-and-relational/9781...
Certain languages, formats and tools do this correctly by default. For the others you need a source of truth that you generate from.
Though sure, known to negatively affect performance, I think in some database systems more than in others?
In sqlite, the view definition will be automatically expanded and one of the columns in the output will automatically be distinguished with an alias. Which column name changes is dependent on the order of tables in the join. This can absolutely break code.
In postgres, the view columns are qualified at definition time so nothing changes immediately. But when the view definition gets updated you will get a failure in the DDL.
In any system, a large column can be added to one of the constituent tables and cause a performance problem. The best advice is to avoid these problems and never use "select *" in production code.
This mirrors how adding additional fields to an object type in a programming language usually isn’t considered a breaking change, but changing the type of an existing field is.
> Schema evolution can break your view, which can have downstream effects
Select * is the problem itself in the face of schema evolution and things like name collision.
In a better language, this would be a pipeline. Pipelines are conceptually simple but annoying to debug, compared to putting intermediate results in a variable or file. Are there any debuggers that let you look at intermediate results of pipelines without modifying the code?
If you want to build a pipeline and store each intermediate result, most tooling will make that easy for you. E.g. in dbt, just put each subquery in its separate file, and the processing engine will correctly schedule each subresult after the other. Just make sure you have enough storage available, it's not uncommon for intermediate results to be hundreds of times larger than the end result (e.g. when you perform a full table join in the first CTE, and do target filtering in another).
In some languages, a series of assignments and a large expression will often compile to the same thing, but if written as assignments, it will make it easier to set breakpoints.
F# in the visual studio debugger does a pretty good job of this in recent versions.
Frankly, that sounds like one of those things that totally makes sense in the author’s head, but inconsiderately creates terrible code ergonomics and needless cognitive load for anyone reading it. You know to just ignore those expressions when you’re reading it because you wrote it and know they have no effect, but to a busy code reviewer, it’s annoying functionless clutter making their job more annoying. “Wait, that should do nothing… but does it actually do something hackish and ‘clever’ that they didn’t comment? Let’s think about this for a minute.” Use an editor with proper formatting capability, and don’t use executable expressions for formatting in code that other people look at.
It means you can copy paste any where condition because they are all of the form
otherwise one condition is of the form So adding it to a query that already has a where clause is a bit more awkward.I use this technique for some analytics queries that all rely on the same base table. It’s not uncommon to start with copying an old query and just adding or removing conditions and grouping/aggregating until I get the right data. Using this format also makes commenting out any condition trivial.
I don’t see how you could achieve this result with just IDE formatting.I've seen it used in dozens of places, in particular places that programmatically generate the AND parts of queries. I wasn't really that confused the first time I saw it and I was never confused any time after that.
No, you ask the DB to EXPLAIN itself to you.
I wrote a small tutorial (~9000 words in two parts) on how to design complicated queries so that they don't need DISTINCT and are basically correct by construction.
https://kb.databasedesignbook.com/posts/systematic-design-of...
Edit: it’s also actually a book!
Using != or NOT IN (...) is almost always going to be inefficient (but can be OK if other predicates have narrowed down the result set already).
Also, understand how your DB handles nulls. Are nulls and empty strings the same? Does null == null? Not all databases do this the same way.
Also in regards to indexing. The DBs I've used have not indexed nulls, so a "WHERE col IS NULL" is inefficient even though "col" is indexed.
If that is the case and you really need it, have a computed column with a char(1) or bit indicating if "col" is NULL or not, and index that.
If your business rules say that "not applicable" or "no entry" is a value, store a value that indicates that, don't use NULL.
I guess you would handle it in the application and not in the query, right?
If you have a table of customers and someone of them don't have addresses, it's standard to leave the address fields NULL. If some of them don't belong to a company, it's standard to leave the company_id field NULL.
This is literally what NULL is for. It's a special value precisely because missing data or a N/A field is so common.
If you're suggesting mandatory additional has_address and has_customer_id fields, I would disagree. You'd be reinventing a database tool that already exists precisely for that purpose.
Kinda. You need null for outer joins, but you could have a relational DBMS that prohibits nullable columns in tables. Christopher Date thought that in properly normalised designs, tables should never use nullable columns. Codd disagreed. [0]
> If you're suggesting mandatory additional has_address and has_customer_id fields, I would disagree. You'd be reinventing a database tool that already exists precisely for that purpose.
The way to do it without using a nullable column is to introduce another table for the 'optional' data, and use a left outer join.
[0] https://en.wikipedia.org/wiki/First_normal_form#Christopher_...
I mean, you could, but having separate tables for every optional field would be an organizational and usability nightmare. Queries would be longer and slower for no good reason. Not to mention a gigantic waste of space with all those repeated primary keys and their indexes.
And you could have databases that prohibited NULL values, but we mostly don't, because they're so useful.
I think this indicates that declaring and managing state is too onerous in SQL.
The idea that having a separate table for every optional field is too unworkable isn't an issue with SQL. It's a fundamentally overcomplicated idea. It's like a programming language where every variable that could be null had to be put in its own file. It would be terrible design.
I remember working on ERP systems with 40+ column tables, most of which were null. With no clear constraints on which options should or shouldn’t enable or make mandatory other options. This becomes incredibly obvious and natural when you group in additional tables.
My tables are incredibly concise and the cache loves this.
Those are rookie numbers. Add a zero to that number and we're talking.
And for us, a good portion of the data, a considerable fraction of those fields will have data, and which fields will vary between customers.
All except some key fields are NULL-able since the user can save and resume their work.
Just to display our main screen would require 100-150 joins using a separate table per optional.
I'm pretty sure the database would not love that.
Having a bunch of grouped optional values in other tables adds a ton of maintenance and query complexity.
The "clear constraints" belong in business logic, whether in triggers at the database level or before queries get executed at the application level.
Now, splitting up tables can produce performance optimizations. I'm not saying to never do it. But it's a tradeoff that increases complexity.
It sure is. Consider a database language that innately supported algebraic data types. Instead of:
you have: This completely clarifies the relationships between the nullable columns (can both be null, or only one null?), and the database storage layer would manage how to actually store this. This is a direct consequence of SQL's semantics and how it implements the relational calculus.https://www.postgresql.org/docs/current/tutorial-inheritance...
Agreed that queries would tend be longer as you'd need joins, although views could help, especially for read operations.
Regarding storage-efficiency and query speed, agreed that it could well hurt both, but it's going to depend. If a column holds null in almost all rows, we would expect it to be more space-efficient to use a separate table and a left outer join. Query speed could also improve for queries that don't reference the nullable column, as the 'main' table would be smaller in storage. (I'm assuming a rowstore here.)
Translating status codes into English or some other natural language? That's better done in the application, not the database. Maybe even leave it to the frontend if you have one. As a rule of thumb, any transformation that does not affect which rows are returned can be applied in another layer after those rows have been returned. Just because you know SQL doesn't mean you have to do everything in SQL.
Deeply nested subqueries? You might want to split that up into simpler queries. There's nothing shameful about throwing three stones to kill three birds, as long as you don't fall into the 1+N pattern. Whoever has to maintain your code will thank you for not trying to be too clever.
Also, a series of simple queries often run faster than a single large query, because there's a limit to how well the query planner can optimize an excessively complicated statement. With proper use of transactions, you shouldn't have to worry about the data changing under your feet as you make these queries.
59 more comments available on Hacker News