Log Level 'error' Should Mean That Something Needs to Be Fixed
Key topics
The debate rages on: should log level 'error' strictly mean something's gone awry and needs fixing? Some argue that errors should signal defects that require immediate attention, while others point out that certain non-defect scenarios, like database timeouts or 4xx responses from downstream services, shouldn't be logged as errors. The discussion reveals a nuanced consensus: application logic shouldn't log errors unless it terminates immediately after, and error-level log entries should indicate a defect or unexpected condition that warrants investigation. As commenters hash out the details, it becomes clear that context is king – the same error might be a non-issue in one org, but a critical problem in another.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3d
Peak period
140
72-84h
Avg / period
40
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 17, 2025 at 7:08 AM EST
17 days ago
Step 01 - 02First comment
Dec 20, 2025 at 10:57 AM EST
3d after posting
Step 02 - 03Peak activity
140 comments in 72-84h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 22, 2025 at 9:52 AM EST
12 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)
* Network error
* Downstream service overloaded
* Invalid request
Basically:
So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?
Not OP, but this part hits the same for me.
In the case your client app is killing the DB through too many calls (e.g. your cache is not working) you should be able to detect it and react, without waiting for the DB team to come to you after they investigated the whole thing.
But you can't know in advance if the DB connection errors are your fault or not, so logging it to cover the worse case scenario (you're the cause) is sensible.
I feel you're thinking about system wide downtime with everything timing out consistently, which would be detected by the generic database server vitals and basic logs.
But what if the timeouts are sparse and only 10 or 20% more than usual from the DB POV, but it affects half of your registration services' requests ? You need it logged application side so the aggregation layer has any chance of catching it.
On writing to ERROR or not, the hresholds should be whatever your dev and oncall teams decides. Nobody outside of them will care, I feel it's like arguing which drawer the socks should go.
I was in an org where any single error below CRITICAL was ignored by the oncall team , and everything below that only triggered alerts on aggregation or special conditions. Pragmatically, we ended up slicing it as ERROR=goes to the APM, anything below=no aggregation, just available when a human wants to look at it for whatever reason. I'd expect most orgs to come with that kind of split, where the levels are hooked to processes, and not some base meaning.
1. As with exceptions vs. return values in code, the low-level code often doesn't know how the higher-level caller will consider a particular error. A low-level error may or may not be a high-level error; the low-level code can't know that, but the low-level code is the one doing the logging. The low-level logging might even be a third party library. This is particularly tricky when code reuse enters the picture: the same error might be "page the on-call immediately" level for one consumer, but "ignore, this is expected" for another consumer.
2. When I personally see database timeouts at work it's never the database's fault, 99 times out of 100 it's the caller's fault for their crappy query. How is the code supposed to know? I log timeouts as errors so someone looks at it and we get a stack trace leading me to the bad query.
In the general case, the person writing the software has no way of knowing that "the database is owned by a separate oncall rotation". That's about your organization chart.
Admittedly, they'd be justified in assuming that somebody is paying attention to the database. On the other hand, they really can't be sure that the database is going to report anything useful to anybody at all, or whether it's going to report the salient details. The database may not even know that the request was ever made. Maybe the requests are timing out because they never got there. And definitely maybe the requests are timing out because you're sending too many of them.
I mean, no, it doesn't make sense to log a million identical messages, but that's rate limiting. It's still an error if you can't access your database, and for all you know it's an error that your admin will have to fix.
As for metrics, I tend to see those as downstream of logs. You compute the metric by counting the log messages. And a metric can't say "this particular query failed". The ideal "database timeout" message should give the exact operation that timed out.
My rough guess is that 75% of incidents on internal services were only reported by service consumers (humans posting in channels) across everywhere I’ve worked. Of the remaining 25% that were detected by monitoring, the vast majority were detected long after consumers started seeing errors.
All the RCAs and “add more monitoring” sprints in the world can’t add accountability equivalent to “customers start calling you/having tantrums on Twitter within 30sec of a GSO”, in other words.
The corollary is “internal databases/backend services can be more technically important to the proper functioning of your business, but frontends/edge APIs/consumers of those backend services are more observably important by other people. As a result, edge services’ users often provide more valuable telemetry than backend monitoring.”
In practice, drivers don’t obey the speed limit and backend service operators don’t prioritize monitoring. As a result, it’s a good idea to wear seatbelts and treat downstream failures as urgent errors in the logs of consuming services.
Maybe or maybe not. If the connection problem is really due to the remote host then that's not the problem of the sender. But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...
If you know the deployment scenario then you can make reasonable decisions on logging levels but quite often code is generic and can be deployed in multiple configurations so that's hard to do
But if you are the SMTP library and that you unilaterally log that as an error. That is an issue.
At the end of the day, this is just bikeshedding about how to collapse ultra specific alerting levels into a few generic ones. E.g. RFC 5424 defines 8 separate log levels for syslog and, while that's not a ceiling by any means, it's easy to see how there's not really going to be a universally agreed way to collapse even just these 4 categories.
There’s a whole industry of “we’ll manage them for you” which is just enabling dysfunction.
The closest we have is something like Java with exceptions in type signatures, but we would have to ban any kind of exception capture except from final programs, and promote basically any logger call int an exception that you could remotely suppress.
We could philosophize about a world with compilers made out of unobtanium - but in this reality a library author cannot know what conditions are fixable or necessitate a fix or not. And structured logging lacks has way too many deficiencies to make it work from that angle.
- An error is an event that someone should act on. Not necessarily you. But if it's not an event that ever needs the attention of a person then the severity is less than an error.
Examples: Invalid credentials. HTTP 404 - Not Found, HTTP 403 Forbidden, (all of the HTTP 400s, by definition)
It's not my problem as a site owner if one of my users entered the wrong URL or typed their password wrong, but it's somebody's problem.
A warning is something that A) a person would likely want to know and B) wouldn't necessarily need to act on
INFO is for something a person would likely want to know and unlikely needs action
DEBUG is for something likely to be helpful
TRACE is for just about anything that happens
EMERG/CRIT are for significant errors of immediate impact
PANIC the sky is falling, I hope you have good running shoes
Some of these things can be ameliorated with well-behaved UI code, but a lot cannot, and if your primary product is the API, then you're just going to have scads of ERRORs to triage where there's literally nothing you can do.
I'd argue that anything that starts with a 4 is an INFO, and if you really wanted to be through, you could set up an alert on the frequency of these errors to help you identify if there's a broad problem.
In other words, of course you don't alert on errors which are likely somebody else's problem. You put them in the log stream where that makes sense and can be treated accordingly.
Personally, I'd further qualify that. It should be logged as an error if the person who reads the logs would be responsible for fixing it.
Suppose you run a photo gallery web site. If a user uploads a corrupt JPEG, and the server detects that it's corrupt and rejects it, then someone needs to do something, but from the point of view of the person who runs the web site, the web site behaved correctly. It can't control whether people's JPEGs are corrupt. So this shouldn't be categorized as an error in the server logs.
But if you let users upload a batch of JPEG files (say a ZIP file full of them), you might produce a log file for the user to view. And in that log file, it's appropriate to categorize it as an error.
You cannot accurately assign responsibility until you understand the problem.
4xx is for client side errors, 5xx is for server side errors.
For your situation you'd respond with an HTTP 400 "Bad Request" and not an HTTP 500 "Internal Server Error" because the problem was with the request not with the server.
That's exactly why you log it as a warning. People get warned all the time about the dangers of smoking. It's important that people be warned about smoking; these warnings save lives. People should pay attention to warnings, which let them know about worrisome concerns that should be heeded. But guess what? Everyone has a story about someone who smoked until they were 90 and died in a car accident. It is not an error that somebody is smoking. Other systems will make their own bloody decisions and firewalling you off might be one of them. That is normal.
What do you think a warning means?
eg. log level WARN, message "This error is...", but it then trips an error in monitoring and pages out.
Probably breaching multiple rules here around not parsing logs like that, etc. But it's cropped up so many times I get quite annoyed by it.
If your parsing, filtering, and monitoring setup parses strings that happen to correspond to log level names in positions other than that of log levels as having the semantics of log levels, then that's a parsing/filtering error, not a logging error.
I could live with 4
Error - alert me now.
Warning - examine these later,
Info - important context for investigations.
Debug - usually off in prod.
How do you know?
Expecting arbitrary services to be able to deal with absolutely any kind of failure in such a way that users never notice is deeply unrealistic.
In your example, "Error while serving file" would be a bad error message, "Failed to read file 'foo/bar.html'" would be acceptable, and "Failed to read file 'foo/bar.html' due to EIO: Underlying device error (disk failure, I/O bus error)." would be perfect (assuming the http server has access to the underlying error produced by the read operation).
The first error message was "No usable public key found, which is required for the deployment" which doesn't tell me what I have to do to correct the problem. There are other examples and discussion of what they should say in the link.
Edit: Here's another one that I filed: https://github.com/containers/podman/issues/20775
At work, I can think of cases where we error when data mismatches between two systems. It’s almost always the fault of system B but we present the mismatch error neutrally. Experienced developers just know to fix B but we shouldn’t rely on that.
What is possible is to include as much information about what the system was trying to do. If there's an file IO error, include the the full path name. Saying "file not found" without saying which file was not found infuriates me like few other things.
If some required configuration option is not defined, include the name of the configuration option and from where it tried to find said configuration (config files, environment, registry etc). And include the detailed error message from the underlying system if any.
Regular users won't have a clue how to deal with most errors anyway, but by including details at least someone with some system knowledge has a chance of figuring out how to fix or work around the issue.
My 3D printer will try to walk you through basic fixes with pictures on the device's LCD panel, but for some errors it will display a QR code to their wiki which goes into a technical troubleshooting guide with complex instructions and tutorial videos.
https://web.mit.edu/jemorris/humor/500-miles
Or you can't connect because of a path MTU error.
Or because the TTL is set to low?
Your software at the server level has no idea what's going wrong at the network level, all you can send is some kind of network problem message.
https://docs.openstack.org/oslo.log/latest/user/guidelines.h...
FWIW, "ERROR: An error has occurred and an administrator should research the event." (vs WARNING: Indicates that there might be a systemic issue; potential predictive failure notice.)
If I have a daily cron job that is copying files to a remote location (e.g. backups), and the _operation_ fails because for some reason the destination is not writable.
Your suggestion would get me _both_ alerts, as I want; the article's suggestion would not alert me about the operation failing because, after all, it's not something happening in the local system, the local program is well configured, and it's "working as expected" because it doesn't need neither code nor configuration fixing.
So for example, a failure to parse JSON might be an error if you're responsible for generating that serialisation, but might be a warning if you're not.
This scenario may or may not yield in data/state loss, it may also be something that you, yourself can't immediately fix. And if it's temporary, what is the point of creating an issue and prioritizing.
I guess my point is that to any such categorization of errors or warnings there are way too many counter examples to be able to describe them like that.
So I'd usually think that Errors are something that I would heuristically want to quickly react to and investigate (e.g. being paged, while Warnings are something I would periodically check in (e.g. weekly).
That being said, I find tying the level to expected action a more useful way to classify them.
If for debugging, the levels seem relevant in the sense of how quickly we are able to use that information to understand what is going wrong. Out of potential sea of logs we want to see first what were the most likely culprits for something causing something to go wrong. So the higher the log level, the higher likelihood of this event causing something to go wrong.
If for alerting, they should reflect on how bad is this particular thing happening for the business and would help us to set a threshold for when we page or have to react to something.
Because what I'd want to know is how often does it fail, which is a metric not a log.
So expose <third party api failure rate> as a metric not a log.
If feeding logs into datadog or similar is the only way you're collecting metrics, then you aren't treating your observablity with the respect it deserves. Put in real counters so you're not just reacting to what catches your eye in the logs.
If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).
By implementing a retry you planned for that third party to be down, so it's just business as usual if it suceeds on retry.
How do you define uptime? What if e.g. it's a social login / data linking and that provider is down? You could have multiple logins and your own e-mail and password, but you still might lose users because the provider is down. How do you log that? Or do you only put it as a metric?
You can't always easily replace providers.
It’s not controversial; you just want something different. I want the opposite: I want to know why/how it fails; counting how often it does is secondary. I want a log that says "I sent this payload to this API and I got this error in return", so that later I can debug if my payload was problematic, and/or show it to the third party if they need it.
For service A, a 500 error may be common and you just need to try again, and a descriptive 400 error indicates the original request was actually handled. In these cases I'd log as a warning.
For service B, a 500 error may indicate the whole API is down, in which case I'd log a warning and not try any more requests for 5 minutes.
For service C, a 500 error may be an anomaly and treat it as hard error and log as error.
Also, you can't know how frequently you'll get 500s at the time you're doing integration, so you'll have to go back after some time to revisit log severities. Which doesn't sound optimal.
I do mean revisiting the log seventies as the behavior of the API becomes known. You start off treating every error as a hard error. As you learn the behavior of the API over time, you adjust the logging and error handling accordingly.
In theory some sort of internal API status tracking thing would be better that has some heuristic of is the API up or down and the error rate. It should warn you when the API is down and when it comes back up. Logging could still show an error or a warning for each request but you don’t need to get an email about each one.
Also, you should have event logs you can look to make administrative decisions. That information surely fits into those, you will want to know about it when deciding to switch to another provider or renegotiate something.
- info - when this was expected and system/process is prepared for that (like automatic retry, fallback to local copy, offline mode, event driven with persistent queue etc) - warning - when system/process was able to continue but in degraded manner, maybe leaving decision to retry to user or other part of system, or maybe just relying on someone checking logs for unexpected events, this of course depends if that external system is required for some action or in some way optional - error - when system/process is not able to continue and particular action has been stopped immediately, this includes situation where retry mechanism is not implemented for step required for completion of particular action - fatal - you need to restart something, either manually or by external watchdog, you don’t expect this kind of logs for simple 5xx
eg: 2.0 for "trace" / 1.0 for "debug" / 0.0 for "info" / -1.0 for "warn" / -2.0 for "error that can be handled"
What about conditions like "we absolutely knew this would happen regularly, but it's something that prevents the completion of the entire process which is absolutely critical to the organization"
The notion of an "error" is very context dependent. We usually use it to mean "can not proceed with action that is required for the successful completion of this task"
Crashing your web stack because one route hit an error is a dumb idea.
And no, calling it a warning is also dumb idea. It is an error.
This article is a navel gazing expedition.
They're kind of right but you can turn any warning into an error and vice versa depending on business needs that outweigh the technical categorisation.
God doesn't tell you the future so good luck figuring out which logs you really need.
For me logs should complement metrics, and can in many instances be replaced by tracing if the spans are annotated sufficiently. But making metrics out of logs is both costly and a bit brittle.
On the other hand it could be that the API had changed slightly. Say they for some reason decided to rename the input parameter postcode to postal_code and I didn’t change my code to fix this. This is 100% a programming error that would be classified as critical but I would not want to panic() the entire server process over it. I just want an alert that hey there is a programming error, go fix it.
But what could also happen is that when I try to construct a request for the external API and the OS is out of memory. Then I want to just crash the process and rely on automatic process restarts to bring it back up. BTW logging an error after malloc() returns NULL needs to be done carefully since you cannot allocate more memory for things like a new log string.
For example, when a process implies a conversion according to the contract/convention, but we know that this conversion may be not the expected result and the input may be based on semantic misconceptions. E.g., assemblers and contextually truncated values for operands: while there's no issue with the grammar or syntax or intrinsic semantics, a higher level misconception may be involved (e.g., regarding address modes), resulting in a correct but still non-functional output. So, "In this individual case, there may be or may be not an issue. Please, check. (Not resolvable on our end.)"
(Disclaimer: I know that this is a very much classic computing and that this is now mostly moved to the global TOS, but still, it's the classic example for a warning.)
if we're talking about logs from our own applications that we have written, the program should know because we can write it in a way that it knows.
user-defined config should be verified before it is used. make a ping to port 25 to see if it works before you start using that config for actual operation. if it fails the verification step, that's not an error that needs to be logged.
i'm following the author's example that an SMTP connection error is something you want to investigate and fix. if you have a different system with different assumptions where your response to a mailserver being unreachable is to ignore it, obviously that example doesn't apply for you. i'm not saying, and i don't think the author is saying that SMTP errors should always or never be logged as errors.
So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.
This is a nontrivial problem when using properly modularized code and libraries that perform logging. They can’t tell whether their operational error is also a program-level error, which can depend on usage context, but still want to log the operational error themselves, in order to provide the details that aren’t accessible to higher-level code. This lower-level logging has to choose some status.
Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of some program-level failure. It also can hamper modularization, because it means you can’t repackage one program’s high-level code as a library for use by other programs, without somehow factoring out the logging code again.
Consider a Smalltalk-like system that doesn’t distinguish between programs and libraries regarding invocation mechanism. How would you handle logging in that case?
It’s very reasonable that a logging framework should allow higher levels to adjust how logging at lower levels is recorded. But saying that libraries should only log debug is not. It’s very legitimate for a library to log “this looks like a problem to me”.
Usually a program is being used by the user to accomplish something, and if logging is meaningful than either in a cli context or a server context. In both cases, errors are more often being seen by people/users than by code. Therefore printing them to logs make sense.
While a lib is being used by a program. So it has a better way to communicate problems with the caller (and exceptions, error values, choose the poison of your language). But I almost never want a library to start logging shit because it’s almost guaranteed to not follow the same conventions as I do in my program elsewhere. Return me the error and let me handle.
It’s analogous to how Go has an implicit rule of that a library should never let a panic occur outside the library. Internally, fine. But at the package boundary, you should catch panics and return them as an error. You don’t know if the caller wants the app to die because it an error in your lib!
A library might also be used in multiple place, maybe deeply in a dependency stack, so the execution context (high level stack) matters more than which library got a failure.
So handling failures should stay in the hands of the developer calling the library and this should be a major constraint for API design.
Ultimately, though, it's important to be using a featureful logging framework (all the better if there's a "standard" one for your language or framework), so the end user can enable/disable different levels for different modules (including for your library).
(Even in C, though... errors should be surfaced as return values from functions causing the error, not just logged somewhere. Debug info, sure, have a registerable callback for that.)
I think creating the hooks is very close to just not doing anything here, if no one is going to use the hooks anyway then you might as well not have them.
The application owner should be able to adjust the contexts up or down.
As an example, Rust's logging ecosystem provides nice facilities for fine-grained tamping down of errors by crate (library) or module name. Other languages and logging libraries let you do this as well.
That capability just isn't adopted everywhere.
You don't want to throw because destroying someone's production isn't worth it. You don't want to silent continue in that state because realistically there's no way for application owner to understand what is happening and why.
One test case here is that your library has existed for a decade and was fast, but Java removed a method that let you make it fast, but you can still run slow without that API. Java has a flag that turns it back on a for a stop gap. How do you expect this to work in your model, you expect to have an onUnnecessarilySlow() callback already set up that all of your users have hooked up which is never invoked for a decade, and then once it actually happens you start calling it and expect it to do something at all sane in those systems?
Second example is all of the scenarios where you're some transitively used library for many users, it makes and callback strategy immediately not work if the person who needs to know about the situation and could take action is the application owner rather than the people writing library code which called you. It would require every library to offer these same callbacks and transitively propagate things, which would only work if it was just such a firm idiomatic pattern in some language ecosystem and I don't believe that it is in any language ecosystem.
>but Java removed a method that let you make it fast, but you can still run slow without that API
I’d like to see an example of that, because this is extremely hypothetical scenario. I don’t think any library is so advanced to anticipate such scenarios and write something to log. And of course Java specifically has longer cycle of deprecation and removal. :)
As for your second example, let’s say library A is smart and can detect certain issues. Library B depending on it is at higher abstraction level, so it has enough business context to react on them. I don’t think it’s necessary to propagate the problem and leak implementation details in this scenario.
https://github.com/protocolbuffers/protobuf/issues/20760
Java 8 is still really popular, probably the most popular single version. It's not just servers in context, but also Android where Java 8 is the highest safe target, it's not clear what decade we'll be in when VarHandle would be safe to use there at all.
VarHandle was Java 9 but MemorySegment was Java 17. And the rest of FFM is only in 25 which is fully bleeding edge.
Protobuf may realistically try to move off of unsafe without the performance regressions without adopting MemorySegment to avoid the problem, but it will take significant and careful engineering time.
That said it's always possible to have waterfall of preferred implementations based on what's supported, it's just always an implementation/verification costs.
It’s very rarely realistic that a client would code up meaningful paths for every possible failure mode in a library. These callbacks are usually reserved for expected conditions.
Yes, that’s the point. You log it until you encounter it for the first time, then you know more and can do something meaningful. E.g. let’s say you build an API client and library offers callback for HTTP 429. You don’t expect it to happen, so just log the errors in a generic handler in client code, but then after some business logic change you hit 429 for the first time. If library offers you control over what is going to happen next, you may decide how exactly you will retry and what happens to your state in between the attempts. If library just logs and starts retry cycle, you may get a performance hit that will be harder to fix.
In an ideal world things like logs and alarms (alerting product support staff) should certainly cleanly separate things that are just informative, useful for the developer, and things that require some human intervention.
If you don't do this then it's like "the boy that cried wolf", and people will learn to ignore errors and alarms since you've trained them to understand that usually no action is needed. It's also useful to be able to grep though log files and distinguish failures of different categories, not just grep for specific failures.
Does it ?
Don't most stacks have an additional level of triaging logs to detect anomalies etc ? It can be your New relic/DataDog/Sentry or a self made filtering system, but nowadays I'd assume the base log levels are only a rough estimate of whether an single event has any chance of being problematic.
I'd bet the author also has strong opinions about http error codes, and while I empathize, those ships have long sailed.
A connection timed out, retrying in 30 secs? That's a warning. Gave up connecting after 5 failed attempts? Now that's an error.
I don't care so much if the origin of the error is within the program, or the system, or the network. If I can't get what I'm asking for, it can't be a mere warning.
116 more comments available on Hacker News