Log Level &#x27;error&#x27; Should Mean That Something Needs to Be Fixed

17 days ago

1 reply

If staging or the canary is logging errors, you block/abort the deploy. If it’s logging warnings, that’s normal.

17 days ago

Unless it is logging more warnings because your new code is failing somehow; maybe it stopped parsing the reply correctly from a "is this request rate limited" service so it is only returning 429 to callers never accepting work.

nlawalker

17 days ago

It seems that the easier rule of thumb, then, is that "application logic should never log an error on its own behalf unless it terminates immediately after", and that error-level log entries should only ever be generated from a higher-level context by something else that's monitoring for problems that the application code itself didn't anticipate.

17 days ago

3 replies

Yes. Examples of non-defects that should not be in the ERROR loglevel:

* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)

* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)

* Network error

* Downstream service overloaded

* Invalid request

Basically:

    logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning

jonathrg

17 days ago

1 reply

4xx is for invalid requests. You wouldn't log a 404 as an error

17 days ago

2 replies

I’m talking about codes you receive from services you call out to.

jonathrg

17 days ago

1 reply

Oh that makes sense.

17 days ago

There are still some special cases, because 404 is used for both “There’s no endpoint with that name” and “There’s no record with the ID you tried to look up.”

17 days ago

What if user sends some sort of auth token or other type of data that you yourself can't validate and third party gives you 4xx for it? You won't know ahead of time whether that token or data is valid, only after making a request to the third party.

Hizonner

17 days ago

1 reply

> Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)

So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?

17 days ago

3 replies

No; I’m not understanding your point about guessing. Could you restate?

makeitdouble

17 days ago

1 reply

> the database is owned by a separate oncall rotation

Not OP, but this part hits the same for me.

In the case your client app is killing the DB through too many calls (e.g. your cache is not working) you should be able to detect it and react, without waiting for the DB team to come to you after they investigated the whole thing.

But you can't know in advance if the DB connection errors are your fault or not, so logging it to cover the worse case scenario (you're the cause) is sensible.

17 days ago

1 reply

I agree that you should detect this, just through a metric rather than putting DB timeouts in the ERROR loglevel.

makeitdouble

17 days ago

But what's the base of your metric ?

I feel you're thinking about system wide downtime with everything timing out consistently, which would be detected by the generic database server vitals and basic logs.

But what if the timeouts are sparse and only 10 or 20% more than usual from the DB POV, but it affects half of your registration services' requests ? You need it logged application side so the aggregation layer has any chance of catching it.

On writing to ERROR or not, the hresholds should be whatever your dev and oncall teams decides. Nobody outside of them will care, I feel it's like arguing which drawer the socks should go.

I was in an org where any single error below CRITICAL was ignored by the oncall team , and everything below that only triggered alerts on aggregation or special conditions. Pragmatically, we ended up slicing it as ERROR=goes to the APM, anything below=no aggregation, just available when a human wants to look at it for whatever reason. I'd expect most orgs to come with that kind of split, where the levels are hooked to processes, and not some base meaning.

electroly

17 days ago

1 reply

I think OP is making two separate but related points, a general point and a specific point:

1. As with exceptions vs. return values in code, the low-level code often doesn't know how the higher-level caller will consider a particular error. A low-level error may or may not be a high-level error; the low-level code can't know that, but the low-level code is the one doing the logging. The low-level logging might even be a third party library. This is particularly tricky when code reuse enters the picture: the same error might be "page the on-call immediately" level for one consumer, but "ignore, this is expected" for another consumer.

2. When I personally see database timeouts at work it's never the database's fault, 99 times out of 100 it's the caller's fault for their crappy query. How is the code supposed to know? I log timeouts as errors so someone looks at it and we get a stack trace leading me to the bad query.

17 days ago

Also everywhere I have worked there are transient network glitches from time to time. Timeout can often be caused by these.

Hizonner

17 days ago

> No; I’m not understanding your point about guessing. Could you restate?

In the general case, the person writing the software has no way of knowing that "the database is owned by a separate oncall rotation". That's about your organization chart.

Admittedly, they'd be justified in assuming that somebody is paying attention to the database. On the other hand, they really can't be sure that the database is going to report anything useful to anybody at all, or whether it's going to report the salient details. The database may not even know that the request was ever made. Maybe the requests are timing out because they never got there. And definitely maybe the requests are timing out because you're sending too many of them.

I mean, no, it doesn't make sense to log a million identical messages, but that's rate limiting. It's still an error if you can't access your database, and for all you know it's an error that your admin will have to fix.

As for metrics, I tend to see those as downstream of logs. You compute the metric by counting the log messages. And a metric can't say "this particular query failed". The ideal "database timeout" message should give the exact operation that timed out.

zbentley

17 days ago

1 reply

I wish I lived in a world where that worked. Instead, I live in a world where most downstream service issues (including database failures, network routing misconfigurations, giant cloud provider downtime, and more ordinary internal service downtime) are observed in the error logs of consuming services long before they’re detected by the owners of the downstream service … if they ever are.

My rough guess is that 75% of incidents on internal services were only reported by service consumers (humans posting in channels) across everywhere I’ve worked. Of the remaining 25% that were detected by monitoring, the vast majority were detected long after consumers started seeing errors.

All the RCAs and “add more monitoring” sprints in the world can’t add accountability equivalent to “customers start calling you/having tantrums on Twitter within 30sec of a GSO”, in other words.

The corollary is “internal databases/backend services can be more technically important to the proper functioning of your business, but frontends/edge APIs/consumers of those backend services are more observably important by other people. As a result, edge services’ users often provide more valuable telemetry than backend monitoring.”

17 days ago

1 reply

But everything you’re describing can be done with metrics and alerts; there’s no need to spam the ERROR loglevel.

zbentley

17 days ago

My point is that just because those problems can be solved with better telemetry doesn’t mean that is actually done in practice.

In practice, drivers don’t obey the speed limit and backend service operators don’t prioritize monitoring. As a result, it’s a good idea to wear seatbelts and treat downstream failures as urgent errors in the logs of consuming services.

mfuzzey

17 days ago

3 replies

I think it's difficult to say without knowing how the system is deployed and administered. "If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system"

Maybe or maybe not. If the connection problem is really due to the remote host then that's not the problem of the sender. But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...

If you know the deployment scenario then you can make reasonable decisions on logging levels but quite often code is generic and can be deployed in multiple configurations so that's hard to do

greatgib

17 days ago

2 replies

The point is that if your program itself take note of the error from the library it is ok. You, as the program owner, can decide what to do with it (error log or not).

But if you are the SMTP library and that you unilaterally log that as an error. That is an issue.

zamadatix

17 days ago

1 reply

The counterpoint made above is while what you describe is indeed the way the author likes to see it that doesn't explain why "an error is something which failed that the program was unable to fix automatically" is supposed to be any less valid a way to see it.

At the end of the day, this is just bikeshedding about how to collapse ultra specific alerting levels into a few generic ones. E.g. RFC 5424 defines 8 separate log levels for syslog and, while that's not a ceiling by any means, it's easy to see how there's not really going to be a universally agreed way to collapse even just these 4 categories.

hinkley

17 days ago

Any robust system isn’t going to rely on reading logs to figure out what to do about undelivered email anyway. If you’re doing logistics the failure to send an order confirmation needs to show up in your data model in some manner. Managing your application or business by logs is amateur hour.

There’s a whole industry of “we’ll manage them for you” which is just enabling dysfunction.

dminuoso

17 days ago

This would require a complete new ecosystem and likely new language where any degradation of code flow becomes communicatable in a standardized and fully documented fashion.

The closest we have is something like Java with exceptions in type signatures, but we would have to ban any kind of exception capture except from final programs, and promote basically any logger call int an exception that you could remotely suppress.

We could philosophize about a world with compilers made out of unobtanium - but in this reality a library author cannot know what conditions are fixable or necessitate a fix or not. And structured logging lacks has way too many deficiencies to make it work from that angle.

colechristensen

17 days ago

2 replies

How about this:

- An error is an event that someone should act on. Not necessarily you. But if it's not an event that ever needs the attention of a person then the severity is less than an error.

Examples: Invalid credentials. HTTP 404 - Not Found, HTTP 403 Forbidden, (all of the HTTP 400s, by definition)

It's not my problem as a site owner if one of my users entered the wrong URL or typed their password wrong, but it's somebody's problem.

A warning is something that A) a person would likely want to know and B) wouldn't necessarily need to act on

INFO is for something a person would likely want to know and unlikely needs action

DEBUG is for something likely to be helpful

TRACE is for just about anything that happens

EMERG/CRIT are for significant errors of immediate impact

PANIC the sky is falling, I hope you have good running shoes

DanHulton

17 days ago

2 replies

If you're logging and reporting on ERRORs for 400s, then your error triage log is going to be full of things like a user entering a password with insufficient complexity or trying to sign up with an email address that already exists in your system.

Some of these things can be ameliorated with well-behaved UI code, but a lot cannot, and if your primary product is the API, then you're just going to have scads of ERRORs to triage where there's literally nothing you can do.

I'd argue that anything that starts with a 4 is an INFO, and if you really wanted to be through, you could set up an alert on the frequency of these errors to help you identify if there's a broad problem.

colechristensen

17 days ago

You have HTTP logs tracked, you don't need to report them twice, once in the HTTP log and once on the backend. You're just effectively raising the error to the HTTP server and its logs are where the errors live. You don't alert on single HTTP 4xx errors because nobody does, you only raise on anomalous numbers of HTTP 4xx errors. You do alert on HTTP 5xx errors because as "Internal" http errors those are on you always.

In other words, of course you don't alert on errors which are likely somebody else's problem. You put them in the log stream where that makes sense and can be treated accordingly.

17 days ago

The frequency is important and so is the answer to "could we have done something different ourselves to make the request work". For example in credit card processing, if the remote network declines, then at first it seems like not your problem. But then it turns out for many BINs there are multiple choices for processing and you could add dynamic routing when one back end starts declining more than normal. Not a 5xx and not a fault in your process, but a chance to make your customer experience better.

adrianmonk

17 days ago

2 replies

> An error is an event that someone should act on. Not necessarily you.

Personally, I'd further qualify that. It should be logged as an error if the person who reads the logs would be responsible for fixing it.

Suppose you run a photo gallery web site. If a user uploads a corrupt JPEG, and the server detects that it's corrupt and rejects it, then someone needs to do something, but from the point of view of the person who runs the web site, the web site behaved correctly. It can't control whether people's JPEGs are corrupt. So this shouldn't be categorized as an error in the server logs.

But if you let users upload a batch of JPEG files (say a ZIP file full of them), you might produce a log file for the user to view. And in that log file, it's appropriate to categorize it as an error.

Arrowmaster

17 days ago

1 reply

Counter argument. How do you know the user uploaded a corrupted image and it didn't get corrupted by your internet connection, server hardware, or a bug in your software stack?

You cannot accurately assign responsibility until you understand the problem.

jeremyjh

16 days ago

This is just trolling. The JPEG is corrupt if the library that reads it says it is corrupt. You log it as a warning. If you upgrade the library or change your upstream reverse proxy, and starting getting 1000x the number of warnings, you can still recognize that and take action without personally inspecting each failed upload to be sure you haven't yet stumbled on the one edge case where the JPEG library is out of spec.

colechristensen

17 days ago

That's the difference between an HTTP 4xx and 5xx

4xx is for client side errors, 5xx is for server side errors.

For your situation you'd respond with an HTTP 400 "Bad Request" and not an HTTP 500 "Internal Server Error" because the problem was with the request not with the server.

solatic

17 days ago

> But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...

That's exactly why you log it as a warning. People get warned all the time about the dangers of smoking. It's important that people be warned about smoking; these warnings save lives. People should pay attention to warnings, which let them know about worrisome concerns that should be heeded. But guess what? Everyone has a story about someone who smoked until they were 90 and died in a car accident. It is not an error that somebody is smoking. Other systems will make their own bloody decisions and firewalling you off might be one of them. That is normal.

What do you think a warning means?

alexwasserman

17 days ago

2 replies

I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.

eg. log level WARN, message "This error is...", but it then trips an error in monitoring and pages out.

Probably breaching multiple rules here around not parsing logs like that, etc. But it's cropped up so many times I get quite annoyed by it.

jonathrg

17 days ago

Stuff like that is a good argument for using structured logging, but even if you are just parsing text logs, surely you can make the parser be a bit more specific when retrieving the log level.

dragonwriter

17 days ago

> I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.

If your parsing, filtering, and monitoring setup parses strings that happen to correspond to log level names in positions other than that of log levels as having the semantics of log levels, then that's a parsing/filtering error, not a logging error.

vpribish

17 days ago

3 replies

I just started playing in the Erlang ecosystem and they have EIGHT levels of logging messages. it seems crazily over-specific, but they are the champions of robust systems.

I could live with 4

Error - alert me now.

Warning - examine these later,

Info - important context for investigations.

Debug - usually off in prod.

emmelaich

17 days ago

I need Notice (between Info and Warning), for important events such as start and shutdown, and successfully connecting to the database, and ready to start serving. These otherwise would be in Info; and enabling Info level produces a torrent of uninteresting muck.

regularfry

17 days ago

The eight levels in Erlang are inherited from syslog, rather than something specific to Erlang itself.

groundzeros2015

17 days ago

The first one should be crashing.

dnautics

17 days ago

1 reply

let's say you a bunch of database timeouts in a row. this might mean that nothing needs to be fixed. But also, the "thing that needs to be fixed" might be "the ethernet cable fell out the back of your server".

How do you know?

17 days ago

1 reply

You have an alert on what users actually care about, like the overall success rate. When it goes off, you check the WARNING log and metric dashboard and see that requests are timing out.

ImPostingOnHN

17 days ago

2 replies

That is a lagging indicator. By the time you're alerted, you've already failed by letting users experience an issue.

17 days ago

1 reply

What alternative would you propose? Page the oncall whenever there's a single query timeout?

dnautics

17 days ago

the alternative i propose is have deep understanding of your system before popping off with dumb one size fits all rules that don't make sense.

danaris

17 days ago

1 reply

Well, yes. If the cable falls out of the server (or there's a power outage, or a major DDoS attack, or whatever), your users are going to experience that before you are aware of it. Especially if it's in the middle of the night and you don't have an active night shift.

Expecting arbitrary services to be able to deal with absolutely any kind of failure in such a way that users never notice is deeply unrealistic.

https://web.mit.edu/jemorris/humor/500-miles

17 days ago

It continues to become more realistic with the passing of time.

rwmj

17 days ago

9 replies

And the second rule is make all your error messages actionable. By that I mean it should tell me what action to take to fix the error (even if that action means hard work, tell me what I have to do).

andoando

17 days ago

1 reply

Error: Possible race condition, rewrite codebase

morkalork

17 days ago

I have written out-of-band sanity checks that have caught race conditions. The recommendation is more like "<Thing> that should be locked, isn't. Check what was merged and deployed in the last 24h, someone ducked it up"

chongli

17 days ago

2 replies

Suppose I'm writing an http server and the error is caused by a flaky power supply causing the disk to lose power when the server attempts to read a file that's been requested. How is the http server supposed to diagnose this or any other hardware fault? Furthermore, why should it even be the http server's responsibility to know about hardware issues at all?

Copenjin

17 days ago

Some of these replies make me wonder if you have ever written any code at all, nonsensical example.

uniq7

17 days ago

The error doesn't need to be extremely specific or point to the actual root cause.

In your example, "Error while serving file" would be a bad error message, "Failed to read file 'foo/bar.html'" would be acceptable, and "Failed to read file 'foo/bar.html' due to EIO: Underlying device error (disk failure, I/O bus error)." would be perfect (assuming the http server has access to the underlying error produced by the read operation).

1123581321

17 days ago

2 replies

Can you please explain this? That sounds like identifying bugs but not fixing them but I realize you don’t mean that. One hopes the context information in the error will make it actionable when it occurs, never completely successfully, of course.

rwmj

17 days ago

1 reply

Here's an example of a bug that I filed about non-actionable error messages: https://github.com/karmab/kcli/issues/456

The first error message was "No usable public key found, which is required for the deployment" which doesn't tell me what I have to do to correct the problem. There are other examples and discussion of what they should say in the link.

Edit: Here's another one that I filed: https://github.com/containers/podman/issues/20775

1123581321

17 days ago

That makes sense and good examples; thanks.

At work, I can think of cases where we error when data mismatches between two systems. It’s almost always the fault of system B but we present the mismatch error neutrally. Experienced developers just know to fix B but we shouldn’t rely on that.

Copenjin

17 days ago

1 reply

You can hope that the person reading the context will always able to understand it like you would have. Bad assumption in my experience.

1123581321

17 days ago

Quite true. It can be a bad assumption when I’m the one trying to understand it weeks later. :)

hyperadvanced

17 days ago

This is just plain wrong, I vehemently disagree. What happens if a payment fails on my API, and today that means I need to go through a 20-step process with this pay provider, my database, etc. to correct that. But what’s worse is if this error happens 11,000 times and I run a script to do my 20 step process 11,000 times, but it turns out the error was raised in error. Additionally, because the error was so explicit about how to fix it, I didn’t talk to anyone. And of course, the suggested fix was out of date because docs lag vs. production software. Now I have 11,000 pissed off customers because I was trying to be helpful.

magicalhippo

17 days ago

This can be difficult or just not possible.

What is possible is to include as much information about what the system was trying to do. If there's an file IO error, include the the full path name. Saying "file not found" without saying which file was not found infuriates me like few other things.

If some required configuration option is not defined, include the name of the configuration option and from where it tried to find said configuration (config files, environment, registry etc). And include the detailed error message from the underlying system if any.

Regular users won't have a clue how to deal with most errors anyway, but by including details at least someone with some system knowledge has a chance of figuring out how to fix or work around the issue.

throw3e98

17 days ago

Maybe that makes sense for a single-machine application where you also control the hardware. But for a networked/distributed system, or software that runs on the user's hardware, the action might involve a decision tree, and a log line is a poor way to convey that. We use instrumentation, alerting and runbooks for that instead, with the runbooks linking into a hyperlinked set of articles.

My 3D printer will try to walk you through basic fixes with pictures on the device's LCD panel, but for some errors it will display a QR code to their wiki which goes into a technical troubleshooting guide with complex instructions and tutorial videos.

Copenjin

17 days ago

Exactly. Some applications keep running way after you have long gone.

pixl97

17 days ago

So what error do you put if the server is over 500 miles away?

Or you can't connect because of a path MTU error.

Or because the TTL is set to low?

Your software at the server level has no idea what's going wrong at the network level, all you can send is some kind of network problem message.

https://docs.openstack.org/oslo.log/latest/user/guidelines.h...

17 days ago

Also put the fucking data in the message that led to the decision to emit the logs. I can't remember how many times I have had a three part test trigger a log "blah: called with illegal parameters, shouldn't happen" and the illegal parameters were not logged.

jayofdoom

17 days ago

1 reply

In OpenStack, we explicitly document what our log levels mean; I think this is valuable from both an Operator and Developer perspective. If you're a new developer, without a sense of what log levels are for, it's very prescriptive and helpful. For an operator, it sets expectations.

FWIW, "ERROR: An error has occurred and an administrator should research the event." (vs WARNING: Indicates that there might be a systemic issue; potential predictive failure notice.)

quectophoton

17 days ago

1 reply

Thank you, this (and jillesvangurp's comment) sounds way more reasonable than the article's suggestion.

If I have a daily cron job that is copying files to a remote location (e.g. backups), and the _operation_ fails because for some reason the destination is not writable.

Your suggestion would get me _both_ alerts, as I want; the article's suggestion would not alert me about the operation failing because, after all, it's not something happening in the local system, the local program is well configured, and it's "working as expected" because it doesn't need neither code nor configuration fixing.

__turbobrew__

17 days ago

Agreed, I don’t get the OPs delineation between local and non-local error sources. If your code has a job to do it doesn’t matter if the error was local or non-local, the operator needs to know that the code is not doing its job. In the case of something like you cannot backup files to a remote you can try to contact the humans who own the remote or come up with an alternative backup mechanism.

eterm

17 days ago

6 replies

How I'd personally like to treat them:

  - Critical / Fatal:  Unrecoverable without human intervention, someone needs to get out of bed, now.
  - Error : Recoverable without human intervention, but not without data / state loss. Must be fixed asap. An assumption didn't hold.
  - Warning: Recoverable without intervention. Must have an issue created and prioritised. ( If business as usual, this could be downgrading to INFO. )

The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".

So for example, a failure to parse JSON might be an error if you're responsible for generating that serialisation, but might be a warning if you're not.

17 days ago

5 replies

What if you are integrated to a third party app and it gives you 5xx once? What do you log it as, and let's say after a retry it is fine.

bqmjjx0kac

17 days ago

1 reply

I would log a warning when an attempt fails, and an error when the final attempt fails.

17 days ago

1 reply

You are not the OP, but I think I was trying to point out this example case in relation to their descriptions of Error/Warnings.

This scenario may or may not yield in data/state loss, it may also be something that you, yourself can't immediately fix. And if it's temporary, what is the point of creating an issue and prioritizing.

I guess my point is that to any such categorization of errors or warnings there are way too many counter examples to be able to describe them like that.

So I'd usually think that Errors are something that I would heuristically want to quickly react to and investigate (e.g. being paged, while Warnings are something I would periodically check in (e.g. weekly).

wredcoll

17 days ago

1 reply

Like so many things in this industry the point is establishing a shared meaning for all the humans involved, regardless of how uninvolved people think.

That being said, I find tying the level to expected action a more useful way to classify them.

17 days ago

But what I also see frequently is people trying to do the impossible and idealistic things because they read somewhere that something should mean X, when things are never so clearly cut, so either it is not such a simplistic issue and should be understood as not such a simple issue, or there might be a better more practical definition for it. We should first start from what are we using Logs for. Are we using those for debugging, or so we get alerted or both?

If for debugging, the levels seem relevant in the sense of how quickly we are able to use that information to understand what is going wrong. Out of potential sea of logs we want to see first what were the most likely culprits for something causing something to go wrong. So the higher the log level, the higher likelihood of this event causing something to go wrong.

If for alerting, they should reflect on how bad is this particular thing happening for the business and would help us to set a threshold for when we page or have to react to something.

eterm

17 days ago

3 replies

This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning.

Because what I'd want to know is how often does it fail, which is a metric not a log.

So expose <third party api failure rate> as a metric not a log.

If feeding logs into datadog or similar is the only way you're collecting metrics, then you aren't treating your observablity with the respect it deserves. Put in real counters so you're not just reacting to what catches your eye in the logs.

If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).

By implementing a retry you planned for that third party to be down, so it's just business as usual if it suceeds on retry.

17 days ago

1 reply

> If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).

How do you define uptime? What if e.g. it's a social login / data linking and that provider is down? You could have multiple logins and your own e-mail and password, but you still might lose users because the provider is down. How do you log that? Or do you only put it as a metric?

You can't always easily replace providers.

17 days ago

You may log that or count failures in some metric, but the correct answer is to have a health check on third party service and an alert on that health check. Logs may help to understand the nature of the incident, but they are not the channel through which you are informed about such problems.

hamandcheese

17 days ago

My main gripe with metrics is that they are not easily discoverable like logs are. Even if you capture a list of all the metrics emitted from an application, they often have zero context and so the semantics are a bit hard to decipher.

hk__2

17 days ago

> This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning. > > Because what I'd want to know is how often does it fail, which is a metric not a log.

It’s not controversial; you just want something different. I want the opposite: I want to know why/how it fails; counting how often it does is secondary. I want a log that says "I sent this payload to this API and I got this error in return", so that later I can debug if my payload was problematic, and/or show it to the third party if they need it.

cpburns2009

17 days ago

1 reply

It really depends on the third party service.

For service A, a 500 error may be common and you just need to try again, and a descriptive 400 error indicates the original request was actually handled. In these cases I'd log as a warning.

For service B, a 500 error may indicate the whole API is down, in which case I'd log a warning and not try any more requests for 5 minutes.

For service C, a 500 error may be an anomaly and treat it as hard error and log as error.

srdjanr

17 days ago

2 replies

What's the difference between B and C? API being down seems like an anomaly.

Also, you can't know how frequently you'll get 500s at the time you're doing integration, so you'll have to go back after some time to revisit log severities. Which doesn't sound optimal.

cpburns2009

15 days ago

I forgot to mention that for service B, the API being down is a common, daily occurrence and does not last long. The behavior of services A-C is from my real world experience.

I do mean revisiting the log seventies as the behavior of the API becomes known. You start off treating every error as a hard error. As you learn the behavior of the API over time, you adjust the logging and error handling accordingly.

IgorPartola

16 days ago

Exactly. What’s worse is that if you have something like a web service that calls an external API, when that API goes down your log is going to be littered with errors and possibly even tracebacks which is just noise. If you set up a simple “email me on error” kind of service you will get as many emails as there were user requests.

In theory some sort of internal API status tracking thing would be better that has some heuristic of is the API up or down and the error rate. It should warn you when the API is down and when it comes back up. Logging could still show an error or a warning for each request but you don’t need to get an email about each one.

marcosdumay

17 days ago

Well, the GPs criteria are quite good. But what you should actually do depends on a lot more things than the ones you wrote in your comment. It could be so irrelevant to only deserve a trace log, or so important to get a warning.

Also, you should have event logs you can look to make administrative decisions. That information surely fits into those, you will want to know about it when deciding to switch to another provider or renegotiate something.

kiicia

17 days ago

As always „it depends”

- info - when this was expected and system/process is prepared for that (like automatic retry, fallback to local copy, offline mode, event driven with persistent queue etc) - warning - when system/process was able to continue but in degraded manner, maybe leaving decision to retry to user or other part of system, or maybe just relying on someone checking logs for unexpected events, this of course depends if that external system is required for some action or in some way optional - error - when system/process is not able to continue and particular action has been stopped immediately, this includes situation where retry mechanism is not implemented for step required for completion of particular action - fatal - you need to restart something, either manually or by external watchdog, you don’t expect this kind of logs for simple 5xx

sysguest

17 days ago

1 reply

hmm maybe we need extra representation?

eg: 2.0 for "trace" / 1.0 for "debug" / 0.0 for "info" / -1.0 for "warn" / -2.0 for "error that can be handled"

wredcoll

17 days ago

1 reply

I said this elsewhere, but the point here is what the humans involved are supposed to do with this info. Do I literally get out of bed on an error log or do I grep for them once or twice a month?

17 days ago

You should never get out of bed on an error in the log. Logs are for retrospective analysis, health checks and metrics are for situational awareness, alerts are for waking people up.

RaftPeople

17 days ago

1 reply

> The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".

What about conditions like "we absolutely knew this would happen regularly, but it's something that prevents the completion of the entire process which is absolutely critical to the organization"

The notion of an "error" is very context dependent. We usually use it to mean "can not proceed with action that is required for the successful completion of this task"

wizzwizz4

17 days ago

1 reply

Those conditions would be "Critical", no? The error vs warning distinction doesn't apply.

fhcuvyxu

17 days ago

1 reply

No, many applications need to be fault tolerant.

Crashing your web stack because one route hit an error is a dumb idea.

And no, calling it a warning is also dumb idea. It is an error.

This article is a navel gazing expedition.

They're kind of right but you can turn any warning into an error and vice versa depending on business needs that outweigh the technical categorisation.

wizzwizz4

16 days ago

1 reply

A log entry marked "CRITICAL" does not imply crashing the web stack.

fhcuvyxu

16 days ago

Right. Was thinking of fatal.

arwhatever

17 days ago

2 replies

I like to think of “warning” as something to alert on statistically, e.g. incorrect password attempt rate jumps from 0.4% of login attempts to 99%.

17 days ago

1 reply

This point is important - the value of a log is inextricably tied to its unlikelihood. Which depends on so many things in the context.

bluGill

16 days ago

The value of all logs is tied only to if there is a problem will it help you find and debug it. If you never do statistics that password log is useless. If you never encounter a problem where the log helps debug it was useless.

God doesn't tell you the future so good luck figuring out which logs you really need.

mzi

16 days ago

This sounds more like metrics than a log statement.

For me logs should complement metrics, and can in many instances be replaced by tracing if the spans are annotated sufficiently. But making metrics out of logs is both costly and a bit brittle.

p2detar

16 days ago

1 reply

[delayed]

IgorPartola

16 days ago

I write a lot of backend web code that often talks to external services. So for example the user wants to add a shipping address to their profile but the address verification API responds with a 500. That is an expected error: sometimes it can happen. I want to log it but I do not want a trace back or anything like that.

On the other hand it could be that the API had changed slightly. Say they for some reason decided to rename the input parameter postcode to postal_code and I didn’t change my code to fix this. This is 100% a programming error that would be classified as critical but I would not want to panic() the entire server process over it. I just want an alert that hey there is a programming error, go fix it.

But what could also happen is that when I try to construct a request for the external API and the OS is out of memory. Then I want to just crash the process and rely on automatic process restarts to bring it back up. BTW logging an error after malloc() returns NULL needs to be done carefully since you cannot allocate more memory for things like a new log string.

masswerk

17 days ago

Also, warnings for ambiguous results.

For example, when a process implies a conversion according to the contract/convention, but we know that this conversion may be not the expected result and the input may be based on semantic misconceptions. E.g., assemblers and contextually truncated values for operands: while there's no issue with the grammar or syntax or intrinsic semantics, a higher level misconception may be involved (e.g., regarding address modes), resulting in a correct but still non-functional output. So, "In this individual case, there may be or may be not an issue. Please, check. (Not resolvable on our end.)"

(Disclaimer: I know that this is a very much classic computing and that this is now mostly moved to the global TOS, but still, it's the classic example for a warning.)

AndroTux

17 days ago

3 replies

“cannot contact port 25 on <remote host>” may very well be a configuration error. How should the program know?

notatoad

17 days ago

2 replies

>How should the program know?

if we're talking about logs from our own applications that we have written, the program should know because we can write it in a way that it knows.

user-defined config should be verified before it is used. make a ping to port 25 to see if it works before you start using that config for actual operation. if it fails the verification step, that's not an error that needs to be logged.

1718627440

17 days ago

1 reply

So when the random error on a remote party happens at one time your system ignores it, bu when it happens at another time, it prevents the server from booting? That's a very brittle system.

notatoad

17 days ago

log level error prevents your server from booting? i'm pretty sure that's not how logging works.

tcpkump

17 days ago

1 reply

What about when the mail server endpoint has changed, and for whatever reason, this configuration wasn’t updated? This is a common scenario when dealing with legacy infrastructure in my experience.

notatoad

17 days ago

the whole point of the essay here is that you should make a distinction between errors that you care about and plan to fix, and errors that you don't care about and don't intend to do anything about. and if you don't intend to do anything about it, it shouldn't be logged as error.

i'm following the author's example that an SMTP connection error is something you want to investigate and fix. if you have a different system with different assumptions where your response to a mailserver being unreachable is to ignore it, obviously that example doesn't apply for you. i'm not saying, and i don't think the author is saying that SMTP errors should always or never be logged as errors.

HankB99

17 days ago

Would it make sense to consider anything that prevents a process from completing it's intended function an error? It seems like this message would fall into that category and, as you pointed out, could result from a local fault as well.

kijin

17 days ago

SMTP clients are designed to try again with exponential backoff. If the final attempt fails and your email gets bounced, now that's an error. Until then, it's just a delay, business as usual.

aqme28

17 days ago

1 reply

I agree with this take in a steady state, but the process of building software is just that-- it's a process.

So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.

17 days ago

1 reply

Exactly: When you're building software, it has lots of defects (and, thus, error logging). When it's mature, it should have few defects, and thus few error logs, and each one that remains is a bug that should be fixed.

plorkyeran

17 days ago

1 reply

I don't understand why you seem to think you're disagreeing with the article? If you're producing a lot of error logs because you have bugs that you need to fix then you aren't violating the rule that an error log should mean that something needs to be fixed.

17 days ago

I couldn’t agree more with the article. What made you think I disagreed?

layer8

17 days ago

4 replies

> When implementing logging, it's important to distinguish between an error from the perspective of an individual operation and an error from the perspective of the overall program or system. Individual operations may well experience errors that are not error level log events for the overall program. You could say that an operation error is anything that prevents an operation from completing successfully, while a program level error is something that prevents the program as a whole from working right.

This is a nontrivial problem when using properly modularized code and libraries that perform logging. They can’t tell whether their operational error is also a program-level error, which can depend on usage context, but still want to log the operational error themselves, in order to provide the details that aren’t accessible to higher-level code. This lower-level logging has to choose some status.

Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of some program-level failure. It also can hamper modularization, because it means you can’t repackage one program’s high-level code as a library for use by other programs, without somehow factoring out the logging code again.

17 days ago

4 replies

Libraries should not log on levels above DEBUG, period. If there’s something worthy for reporting on higher levels, pass this information to client code, either as an event, or as an exception or error code.

layer8

17 days ago

3 replies

From a code modularization point of view, there shouldn’t really be much of a difference between programs and libraries. A program is just a library with a different calling convention. I like to structure programs such that their actual functionality could be reused as a library in another program. This is difficult to reconcile with libraries only logging on a debug level.

17 days ago

2 replies

The main difference is that library is not aware of the context of the execution of the code, so cannot decide, whether the problem is expected, recoverable or severe.

layer8

17 days ago

3 replies

The same is true for programs that are being invoked. The program only knows relative to its own purpose, and the same is again true for libraries. I don’t see the difference, other than, as already mentioned, the mechanism of program vs. library invocation.

Consider a Smalltalk-like system that doesn’t distinguish between programs and libraries regarding invocation mechanism. How would you handle logging in that case?

msteffen

17 days ago

Okay, but…most programs are written in Python or Rust or something, where invoking library functions is a lot safer, more ergonomic, more performant, and more common than spawning a subprocess and executing a program in it…

17 days ago

The purpose of a program and of a library is different and intent of the authors of the code is usually clear enough to make the distinction in context. Small composable programs aren’t interesting case here, they shouldn’t be verbose anyway even to justify multiple logging levels (it’s probably just set to on/off using a command line argument).

hrimfaxi

17 days ago

The mechanism of invocation is important. Most programs allow you to set the logging verbosity at invocation. Libraries may provide an interface to do so but their entry points tend to be more numerous.

dpark

17 days ago

And the program doesn’t know if the user is expecting failure, either. The library case is not actually much different.

It’s very reasonable that a logging framework should allow higher levels to adjust how logging at lower levels is recorded. But saying that libraries should only log debug is not. It’s very legitimate for a library to log “this looks like a problem to me”.

schrodinger

17 days ago

I see your point, but disagree on a practical level. Libraries are being used while you’re in “developer” mode, while programs are used in “user” mode (trying awkwardly to differentiate between _being_ a developer and currently developing code around that library.

Usually a program is being used by the user to accomplish something, and if logging is meaningful than either in a cli context or a server context. In both cases, errors are more often being seen by people/users than by code. Therefore printing them to logs make sense.

While a lib is being used by a program. So it has a better way to communicate problems with the caller (and exceptions, error values, choose the poison of your language). But I almost never want a library to start logging shit because it’s almost guaranteed to not follow the same conventions as I do in my program elsewhere. Return me the error and let me handle.

It’s analogous to how Go has an implicit rule of that a library should never let a panic occur outside the library. Internally, fine. But at the package boundary, you should catch panics and return them as an error. You don’t know if the caller wants the app to die because it an error in your lib!

0xbadcafebee

17 days ago

[delayed]

1718627440

17 days ago

1 reply

Why? Whats wrong with logging it and passing the log object to the caller? The caller can still modify the log entry however it pleases?

17 days ago

1 reply

Practicality. It is excessive for client code to calibrate library logging level. It’s ok to do it in logging configuration, but having an entry for every library there is also excessive. It is reasonable to expect that dev/staging may have base level at DEBUG and production will have base level at INFO, so that a library following the convention will not require extra effort to prevent log spam in production. Yes, we have entire logging industry around aggregation of terabytes of logs, with associated costs, but do you really need that? In other words, are we developers too lazy to adapt the sane logging policy, which actually requires minimum effort, and will just burn the company money for nothing?

dolmen

17 days ago

TLDR: I agree.

A library might also be used in multiple place, maybe deeply in a dependency stack, so the execution context (high level stack) matters more than which library got a failure.

So handling failures should stay in the hands of the developer calling the library and this should be a major constraint for API design.

kelnos

17 days ago

1 reply

Eh, as with anything there are always exceptions. I generally agree with WARN and ERROR, though I can imagine a few situations where it might be appropriate for a library to log at those levels. Especially for a warning, like a library might emit "WARN Foo not available; falling back to Bar" on initialization, or something like that. And I think a library is fine logging at INFO (and DEBUG) as much as it wants.

Ultimately, though, it's important to be using a featureful logging framework (all the better if there's a "standard" one for your language or framework), so the end user can enable/disable different levels for different modules (including for your library).

17 days ago

   WARN Foo not available; falling back to Bar

In server contexts this is usually unnecessary noise, if it’s not change detection. Of course, good logging framework will help you to mute irrelevant messages, but as I said in another comment here, it’s a matter of practicality. Library shouldn’t create extra effort for its users to fine-tune logging output, so it must use reasonable defaults.

17 days ago

I have a logging level I call "log lots" where it will log the first time with probability 1, but as it hits more often the same line, it will log with lower and lower probability bottoming out around 1/20000 times. Sort of a "log with probability proportional to the unlikiness of the event". So if I get e.g. sporadic failures to some back end, I will see them all, but if it goes down hard I will see it is still down but also be able to read other log msgs.

hinkley

17 days ago

2 replies

Log4j has the ability to filter log levels by subject matter for twenty years. In Java you end up having to use that a lot for this reason.

TZubiri

17 days ago

1 reply

Oh that library that gives you a write() wrapper in exchange for RCE vulns

17 days ago

Log4j is basically a design pattern. If you don’t like the library, Slf4j/logback are based on the same principles.

PartiallyTyped

17 days ago

Logging in rust also does that, you can set logging levels for individual modules deep within your dependency tree.

0x696C6961

17 days ago

4 replies

Libraries should not log, instead they should allow registering hooks which get called with errors and debug info.

kelnos

17 days ago

1 reply

I think this is useful for libraries in a language like C, where there is no standardized logging framework, so there's no way for the application to control what the library logs. But in a language (Java, Rust, etc.) where there are standard, widely-used logging frameworks that give people fine-grained control over what gets logged, libraries should just use those frameworks.

(Even in C, though... errors should be surfaced as return values from functions causing the error, not just logged somewhere. Debug info, sure, have a registerable callback for that.)

dolmen

17 days ago

Log4J style logging is effectively a hook system. But it is too easy to badly use it with too high level and delegate level fixing to the end user.

17 days ago

It doesn't seem to work this way in practice, not least because most libraries will be transitive deps of the application owner.

I think creating the hooks is very close to just not doing anything here, if no one is going to use the hooks anyway then you might as well not have them.

Blackthorn

17 days ago

Libraries should log in a way that is convenient to the developer rather than a way that is ideologically consistent. Oftentimes, that means logging as we know it.

17 days ago

They can log if platform permits, i.e. when you can set TRACE and DEBUG to no-op, but of course it should be done reasonably. Having hooks is often an overkill compared to this.

Too

17 days ago

3 replies

This is why it’s almost always wrong for library functions to log anything, even on ”errors”. Pass the status up through return values or exceptions. As a library author you have no clue as how an application might use it. Multi threading, retry loops and expected failures will turn what’s a significant event in one context into what’s not even worthy of a debug log in another. No rule without exceptions of course, one valid case could be for example truly slow operations where progress reports are expected. Modern tracing telemetry with sampling can be another solution for the paranoid.

echelon

17 days ago

1 reply

You need a tuple: (context, level)

The application owner should be able to adjust the contexts up or down.

As an example, Rust's logging ecosystem provides nice facilities for fine-grained tamping down of errors by crate (library) or module name. Other languages and logging libraries let you do this as well.

That capability just isn't adopted everywhere.

Izkata

17 days ago

Python's built-in logging is the same if used correctly, where the library gets a logger based on its module name (this part isn't enforced) and the application can add a handler to that logger to route the logs differently if needed.

17 days ago

1 reply

I think an example where libraries could sensibly log error is if you have a condition which is recoverable but may cause a significant slowdown, including a potential DoS issue, and the application owner can remediate.

You don't want to throw because destroying someone's production isn't worth it. You don't want to silent continue in that state because realistically there's no way for application owner to understand what is happening and why.

17 days ago

2 replies

In such scenarios it makes sense to give clients an opportunity to react on such conditions programmatically, so just logging is wrong choice and if there’s a call back to client, client can decide whether to log it and how.

17 days ago

1 reply

It's a nice idea but I've literally never seen it done, so I would be interested if you have examples of major libraries that do this. Abstractly it doesn't really seem to work to me in place of simple logs.

One test case here is that your library has existed for a decade and was fast, but Java removed a method that let you make it fast, but you can still run slow without that API. Java has a flag that turns it back on a for a stop gap. How do you expect this to work in your model, you expect to have an onUnnecessarilySlow() callback already set up that all of your users have hooked up which is never invoked for a decade, and then once it actually happens you start calling it and expect it to do something at all sane in those systems?

Second example is all of the scenarios where you're some transitively used library for many users, it makes and callback strategy immediately not work if the person who needs to know about the situation and could take action is the application owner rather than the people writing library code which called you. It would require every library to offer these same callbacks and transitively propagate things, which would only work if it was just such a firm idiomatic pattern in some language ecosystem and I don't believe that it is in any language ecosystem.

17 days ago

1 reply

> library has existed for a decade

>but Java removed a method that let you make it fast, but you can still run slow without that API

I’d like to see an example of that, because this is extremely hypothetical scenario. I don’t think any library is so advanced to anticipate such scenarios and write something to log. And of course Java specifically has longer cycle of deprecation and removal. :)

As for your second example, let’s say library A is smart and can detect certain issues. Library B depending on it is at higher abstraction level, so it has enough business context to react on them. I don’t think it’s necessary to propagate the problem and leak implementation details in this scenario.

https://github.com/protocolbuffers/protobuf/issues/20760

17 days ago

1 reply

Protobuf is the example I had in mind. It uses sun.misc.Unsafe which is being removed, but has a slow fallback. It logs a warning when it runs and the fallback path is being used and tells the application owner how they could opt back into the fast path if they care to:

vips7L

17 days ago

1 reply

What’s stopping you from using the replacements provided in VarHandle and MemorySegment? Just wanting to support the 10 year old JDK 8?

17 days ago

There's a lot here, to be honest these things always come back to investment cost and ROI compared to everything else that could be worked on.

Java 8 is still really popular, probably the most popular single version. It's not just servers in context, but also Android where Java 8 is the highest safe target, it's not clear what decade we'll be in when VarHandle would be safe to use there at all.

VarHandle was Java 9 but MemorySegment was Java 17. And the rest of FFM is only in 25 which is fully bleeding edge.

Protobuf may realistically try to move off of unsafe without the performance regressions without adopting MemorySegment to avoid the problem, but it will take significant and careful engineering time.

That said it's always possible to have waterfall of preferred implementations based on what's supported, it's just always an implementation/verification costs.

dpark

17 days ago

1 reply

I’ve written code that followed this model, but it almost always just maps to logging anyway, and the rest of the time it’s narrow options presented in the callback. e.g. Retry vs wait vs abort.

It’s very rarely realistic that a client would code up meaningful paths for every possible failure mode in a library. These callbacks are usually reserved for expected conditions.