Logging Sucks
Key topics
The frustrations of logging are in the spotlight, with a thought-provoking site "Logging sucks" sparking a lively debate. Commenters are divided on the site's single-purpose domain, with some calling it an ad, while others see it as a clever marketing move, potentially promoting a service through its "personalized report" form. Amidst the discussion, a consensus emerges around Charity Majors, a pioneer in observability, with many praising her work, including her book, talks, and tool, Honeycomb. As commenters weigh in, the thread feels relevant now, tapping into the ongoing quest for better logging and observability practices.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
27m
Peak period
128
0-6h
Avg / period
22.9
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 21, 2025 at 1:09 PM EST
12 days ago
Step 01 - 02First comment
Dec 21, 2025 at 1:36 PM EST
27m after posting
Step 02 - 03Peak activity
128 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 23, 2025 at 5:33 PM EST
10 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Gonna go on a tangent here. Why the single purpose domain? Especially since the author has a blog. My blog is full of links to single post domains that are no longer.
i do not see a product upsell anywhere.
if it's an ad for the author themselves, then it's a very good one.
Also worth pointing out that you can implement this method with a lot of tools these days. Both structured Logs or Traces lend itself to capture wide events. Just make sure to use a tool that supports general query patterns and has rich visualizations (time-series, histograms).
I concur. In fact, I strongly recommend anyone who has been working with observability tools or in the industry to read her blog, and the back story that lead to honeycomb. They were the first to recognize the value of this type of observability and have been a huge inspiration for many that came after.
- Software Sprawl, The Golden Path, and Scaling Teams With Agency: https://charity.wtf/2018/12/02/software-sprawl-the-golden-pa... - introduces the idea of the "golden path", where you tell engineers at your company that if they use the approved stack of e.g. PostgreSQL + Django + Redis then the ops team will support that for them, but if they want to go off path and use something like MongoDB they can do that but they'll be on the hook for ops themselves.
- Generative AI is not going to build your engineering team for you: https://stackoverflow.blog/2024/12/31/generative-ai-is-not-g... - why generative AI doesn't mean you should stop hiring junior programmers.
- I test in prod: https://increment.com/testing/i-test-in-production/ - on how modern distributed systems WILL have errors that only show up in production, hence why you need to have great instrumentation in place. "No pull request should ever be accepted unless the engineer can answer the question, “How will I know if this breaks?”"
- Advice for Engineering Managers Who Want to Climb the Ladder: https://charity.wtf/2022/06/13/advice-for-engineering-manage...
- The Engineer/Manager Pendulum: https://charity.wtf/2017/05/11/the-engineer-manager-pendulum... - I LOVE this one, it's about how it's OK to have a career where you swing back and forth between engineering management and being an "IC".
> Let’s start here: hiring engineers is not a process of “picking the best person for the job”. Hiring engineers is about composing teams. The smallest unit of software ownership is not the individual, it’s the team. Only teams can own, build, and maintain a corpus of software. It is inherently a collaborative, cooperative activity.
Right now, we are in a transitioning phase, where parts of a team might reject the notion of using AI, while others might be using it wisely, and still others might be auto-creating PRs without checking the output. These misalignments are a big problem in my view, and it’s hard to know (for anybody involved) during hiring what the stance really is because the latter group is often not honest about it.
With all due respect to her great writing, I think there’s a mix of revisionist history blended with PR claims going on in this thread. The blog has some good reading, but let’s not get ahead of ourselves in rewriting history around this one person/company.
I can only speak for myself. I worked for a company that is somewhere in the observability space (Sentry) and Charity was a person I looked up to my entire time working on Sentry. Both for how she ran the company, for the design they picked and for the approaches they took. There might be others that have worked on wide events (afterall, Honeycomb is famously inspired by Facebook's scuba), she is for sure the voice that made it popular.
Yep, I'm a fan.
With all due respect to her other work, she most certainly did not coin the term “observability”. Observability has been a topic in multiple fields for a very long time and has had widespread usage in computing for decades.
I’m sure you meant well by your comment, but I doubt this is a claim she even makes for herself.
If a user request is hitting that many things, in my view, that is a deeply broken architecture.
If we want it or not, a lot of modern software looks like that. I am also not a particular fan of building software this way, but it's a reality we're facing. In part it's because quite a few services that people used to build in-house are now outsourced to PaaS solutions. Even basic things such as authentication are more and more moving to third parties.
Yes. Most software is bad
The incentives between managers and technicians are all wrong
Bad software is more profitable, over the time frames managers care about, than good software
Fighting complexity is deeply unpopular.
Fighting complexity is literally the job of a computer programmer
It is a hard job, and made much harder by the (usual) disconnect between management and us
Things can add up quickly. I wouldn't be surprised if some requests touch a lot of bases.
Here's an example: a user wants to start renting a bike from your public bike sharing service, using the app on their phone.
This could be an app developed by the bike sharing company itself, or a 3rd party app that bundles mobility options like ride sharing and public transport tickets in one place.
You need to authentice the request and figure out which customer account is making the request. Is the account allowed to start a ride? They might be blocked. They might need to confirm the rules first. Is this ride part of a group ride, and is the customer allowed to start multiple rides at once? Let's also get a small deposit by putting a hold of a small sum on their credit card. Or are they a reliable customer? Then let's not bother them. Or is there a fraud risk? And do we need to trigger special code paths to work around known problems for payment authorization for cards issued by this bank?
Everything good so far? Then let's start the ride.
First, let's lock in the necessary data. Which rental pricing did the customer agree to? Is that actually available to this customer, this geographical zone, for this bike, at this time, or do we need to abort with an error? Otherwise, let's remember this, so we can calculate the correct rental fee at the end.
We normally charge an unlock fee in addition to the per-minute price. Are we doing that in this case? If yes, does the customer have any free unlock credit that we need to consume or reserve now, so that the app can correctly show unlock costs if the user wants to start another group ride before this one ends?
Ok, let's unlock the bike and turn on the electric motor. We need to make sure it's ready to be used and talk to the IoT box on the bike, taking into account the kind of bike, kind of box and software version. Maybe this is a multistep process, because the particular lock needs manual action by the customer. The IoT box might have to know that we're in a zone where we throttle the max speed more than usual.
Now let's inform some downstream data aggregators that a ride started successfully. BI (business intelligence) will want to know, and the city might also require us to report this to them. The customer was referred by a friend, and this is their first ride, so now the friend gets his referral bonus in the form of app credit.
Did we change an unrefundable unlock fee? We might want to invoice that already (for whatever reason; otherwise this will happen after the ride). Let's record the revenue, create the invoice data and the PDF, email it, and report this to the country's tax agency, because that's required in the country this ride is starting in.
Or did things go wrong? Is the vehicle broken? Gotta mark it for service to swing by, and let's undo any payment holds. Or did the deposit fail, because the credit card is marked as stolen? Maybe block the customer and see if we have other recent payments using the same card fingerprint that we might want to proactively refund.
That's just off the top of my head, there may be more for a real life case. Some of these may happen synchronously, others may hit a queue or event bus. The point is, they are all tied to a single request.
So, depending on how you cut things, you might need several services that you can deploy and develop independently.
- auth - core customer management, permissions, ToS agreement,
- pricing, - geo zone definitions, - zone rules,
- benefit programs,
- payments and payment provider integration, - app credits, - fraud handling,
- ride management, - vehicle management, - IoT integration,
- invoicing, - emails, - BI integration, - city hall integration, - tax authority integration,
- and an API gateway that fronts the app request.
These do not have to be separate services, but they are separate enough to warrant it. They wouldn't be exactly micro either.
Not every product will be this complicated, but it's also not that out there, I think.
All of this arises from your failure to question this basic assumption though, doesn't it?
Haha, no. "All of this" is a scenario I consider quite realistic in terms of what needs to happen. The question is, how should you split this up, if at all?
Mind that these concerns will be involved in other ways with other requests, serving customers and internal users. There are enough different concerns at different levels of abstraction that you might need different domain experts to develop and maintain them, maybe using different programming languages, depending on who you can get. There will definitely be multiple teams. It may be beneficial to deploy and scale some functions independently; they have different load and availability requirements.
Of course you can slice things differently. Which assumptions have you questioned recently? I think you've been given some material. No need to be rude.
I'm open to questioning assumptions too, though, if you have any specific ones for me.
Of course you can keep everything together, in just very few large parts, or even a monolith. I've not said otherwise.
My point is that "architecture" is orthogonal to the question of "monolith vs separate services"; the difference there is not architecture, but in cohesion and flexibility.
If you do things right, even inside a monolith you will have things clearly separated into different concerns, with clean interfaces. There are natural service boundaries in your code. (If there aren't, in a system like this, you and the business are in for a world of pain.)
The idea is that you can put network IO between these service boundaries, to trade off cohesion and speed at these boundaries for flexibility between them, which can make the system easier to work with.
Different parts of your system will have different requirements, in terms of criticality, performance and availability; some need more compute, others do more IO, are busy at different times, talk to different special or less special databases. This means they may have different sweet spots for various trade-offs when developing and running them.
For example, you can (can!) use different languages to implement critical components or less critical ones, which gives you a bigger pool to hire competent developers from; competent as developers, but also in the respective business domain. This can help your company off the ground.
(Your IoT and bike people are comfortable in Rust. Payments is doing Python, because they're used to waiting, and also they are the people you found who actually know not to use floats for money and all the other secrets.)
You can scale up one part of your system that needs fast compute without also paying for the part that needs a lot of memory, or some parts of your service can run on cheap spot instaces, while others benefit from a more stable environment.
You can deploy your BI service without taking down everything when the new initialization code starts crash-looping.
(You recover quickly, but in the meantime a lot of your IoT boxes got lonely are now trying to reconnect, which triggers a stampede on your monolith, you need to scale up quickly to keep the important functions running, but the invoicing code fetches a WDSL file from a slow government SOAP service, which is now down, and your cache entry's TTL expired, and you don't even need more invoicing right now... The point is, you have a big system, things happen, and fault lines between components are useful.)
It's a trade-off, in the end.
Do you need 15 services? You already have them. They're not even "micro", just each minding their own part of the business domain. But do they all need their own self-contained server? Probably not, but you might be better off with more than just one single monolith.
But I would not automatically bat an eye to find that somebody separated these whatever-teen services. I don't see that as a grievous error per se, but potentially as the result of valid decisions and trade-offs. The real job is to properly separate these concerns, whether they then live in a monolith or not.
And that's why that request may well touch so many services.
a. managing an external API+schema for each service
b. managing changes to each service, for example, smooth rollout of a change that impacts behavior across two services
c. error handling on the client side
d. error handling on the server side
e. added latency+compute because a step is crossing a network, being serialized/de-serialized on both ends
f. presuming the services use different databases, performance is now completely shot if you have a new business problem that crosses service boundaries. In practice, this will mean doing a "join" by making some API call to one service and then another API call to another service
In your description of the problem, there is nothing that I would want to split out into a separate service.
So what's the point?
I think the missing ingredient is scale: how much are you doing, and maybe also how quickly you got where you are.
The system does a lot, even once in place, there's enough depth and surface to your business and operational concerns that something is always changing. You're going to need people to build, extend and maintain it. You will have multiple teams specializing in different parts of the system. Your monolith is carved into team territories, which are subdivided into quasi-autinomous regions with well-defined boundaries and interfaces.
Having separate services for different regions buys you flexibility in the chosen implementation language. This makes it easier to hire competent people, especially initially, when you need seasoned domain experts to get things started. It also matters later, where you may find it easier to find people to work on your glue code parts of the system, where you may be more relaxed about language choice.
Being able to deploy and scale parts of your service separately can also be a benefit. As I said, things are busy, people check in a lot of code. Not having to redeploy and reinitialize the whole world every few minutes, just because some minor thing changed somewhere is good. Not bringing everything down when inevitably something breaks it also nice. You need some critical parts to be there; but a lot of your system can be gone for a while no problem. Don't let those expendables take down your critical stuff. (Yes, failure modes shift; but there's a difference between having a priority 1 outage every day, or much less frequently. That difference is also measured in developer health.)
About the databases: some of your data is big enough that you don't want to use joins anyway. They have a way of suddenly killing db performance. Those who absolutely need it are on DynamoDb. Some others are still okay with a big Postgres instances, where the large tables are a little bit denormalized. (BI want to do tons of joins, but they sit on their separate lake of data.) There's a lot of small fry that's locally very connected, and has some passing knowledge of the existence some big, important business object, but crucially not its insides. If you get a new business concern, hopefully you cut your services and data around natural business domains, or you will need to do more engineering now. Just like in your monolith, you don't want any code to be able to join any two tables, because that would mean that things are to messy to reason about the system anymore. Mind your foreign keys! In any case, if you need DynamoDb, you'll be facing similar problems in your monolith.
A nice side effect of separate services is that the resist an intermingling of concerns that must be prevented actively in monoliths. People love reaching into things they shouldn't. But that's a small upside against the many disadvantages.
Another small mitigating factor is that a lot of your services will be IO bound and make network requests anyway to perform their functions, the kind that makes the latency from your internal network hop much less of a trade-off.
It's all a trade-off. Don't spin off a service until you know why, and until you have a pretty good idea where to make a cut that's a good balance of contained complexity vs surface area.
Now, do you really need 15 different services? Probably not. But I could see how they could work together well, each of them taking care of some well-defined part of your business domain. There's enough meat there that I would call things a mistake without a closer look.
This us by no means the only way to do things. All I wanted is show that it can be a reasonable way. I hope there's more reason now.
As for the logging problem: it's not hard to have a standard way to hand around request ids from your gateway, to be put in structured logs.
Logs are fine. The job of local logs is to record the talk of a local process. They are doing this fine. Local logs were never meant to give you a picture of what's going on some other server. For such context, you need a transaction tracing that can stitch the story together across all processes involved.
Usually, looking at the logs at right place should lead you to the root cause.
When properly seasoned with context, logs give you useful information like who is impacted (not every incident impacts every customer the same way), how component performance changes when inputs change, and so forth. When connected to analytical engines, well-formed logs can help you figure out things like behaviors that lead to abandonment, the impact of malicious behavior, and much more.
Once the logs have entered the ingestion endpoint, they can take the most optimal path for their use case. Metrics can be extracted and sent off to a time-series metric database, while logs can be multiplexed to different destinations, including stored raw in cheap archival storage, or matched to schemas, indexed, stored in purpose-built search engines like OpenSearch, and stored "cooked" in Apache Iceberg+Parquet tables for rapid querying with Spark or other analytical engines.
Have you ever taken, say, VPC flow logs, saved them in Parquet format, and queried them with DuckDB? It's mind blowing.
Not if I have anything to say about it.
>Your logs are still acting like it's 2005.
Yeah, because that's just before software development went absolutely insane.
Taking slow requests as an example, a dependency gets slower and now your log volume suddenly goes up 100x. Can your service handle that? Are you causing a cascading outage due to increased log volumes?
Recovery is easier if your service is doing the same or less work in a degraded state. Increasing logging by 20-100x when degraded is not that.
Besides, a little local on-disk buffering goes a long way, and is cheap to boot. It’s an antipattern to flush logs directly over the network.
Auditing has the requirement to be mostly not lost, but most importantly not being able to be deleted by people on the host. And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on." Hopefully, the audit traffic is consistent enough that you don't get spikes and can over-capacitize with confidence.
Why does that make any difference? Keep in mind that at large enough organizations, even though the company is the same, there will often be an internal observability service team (frequently, but not always, as part of a larger platform team). At a highly-functioning org, this team is run very much like an external service provider.
> I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed.
You should take a look at CloudWatch Logs. I'm unaware of any time it has successfully ingested logs and then dropped them. (Disclaimer: I work for AWS.) Also, I didn't say anything about delays, which we often accept as a tradeoff for durability.
> And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on."
This is why buffering outgoing logs in memory is an anti-pattern, as I noted elsewhere in this discussion. There should always -- always -- be some sort of non-volatile storage buffer in between a sender and remote receiver. Disk is cheap. Use it.
Are you actually contemplating handling 10 million requests per second per core that are failing?
All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. Again, what are we even deploying that has billions of events per second?
It is, in fact, a self-fulfilling prophecy to complain that logging can be a bottleneck if you then choose logging that is 100-1000x slower than it should be. What a concept.
A catastrophic increase in logging could certainly take down your log processing pipeline but it should not create cascading failures that compromise your service.
I just wanted to make sure we weren’t still talking about “causing a cascading outage due to increased log volumes” as was mentioned above, which would indicate a significant architectural issue.
The framing is not, though. Why does it have to sound so dramatic and provocative? It’s insulting to its audience.
It’s no wonder the author had to put it up on his own website instead of on CloudFlare’s (his employer’s) blog. It wouldn’t have met their editorial standards for professionalism.
Sure, I've dealt with plenty of assholes, too, but the grumps are usually just tired of their valid insight being ignored by more foolish, orthogonally incentivized types (read: "playing the game" not "making it work well").
Assholes can sap an organization's strength faster than any productive value their intelligence can provide. I'm not suggesting the author is an asshole, though; there's not enough evidence from this post.
[0] https://openjdk.org/jeps/506
"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.
Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".
In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.
A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).
Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.
I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).
How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. And doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases.
If you have insufficient ingestion rate:
Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.
Traces are for correlated sequences of events where the entire sequence needs to be sampled to be coherent. You can drop arbitrary whole sequences to stay within ingestion rate.
Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.
If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.
> You can drop arbitrary logs to stay within ingestion rate.
Another way I've heard this framed in a production environments ingesting a firehose is: you can drop individual logging events because there will always be more.
I would rather fix this problem than every other problem. If I'm seeing backpressure, I'd prefer to buffer locally on disk until the ingestion system can get caught up. If I need to prioritize signal delivery when I see back pressure, I can do that locally as well by separating streams (i.e. priority queueing). It doesn't change the fundamental nature of the system, though.
In practice, the audit isn’t really a log, it’s something more akin to database record. The point is that you can’t filter your log stream for audit requirements.
Regulators have never dictated where auditable logs must live. Their requirement is that the records in scope be recorded, that they are accurate (which implies tamper proof) and that they are accessible. Provided those requirements are met, where the records can be found is irrelevant. It thus follows that if all logs over the union of centralized storage and endpoint storage meet the above requirements then it will satisfy the regulator.
That’s true. They specify that logs cannot be lost, available for x years, not modifiable, accessible only in y ways, cannot cross various boundaries/borders (depending on gov in question). Or bad things will happen to you (your company).
In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”. You need to know that the record has been made durable (across whatever your durability mechanism is, for example a DB with HA + DR), before progressing to the next step. Depending on the stringency, RPO needs to be zero for audit, which is why I say it is a special case.
I don’t know anything about linux audit, I doubt it has any relevance to regulatory compliance.
As long as the record can be located when it is sought, it does not matter how many copies there are. The regulator will not ask so long as your system is a reasonable one.
Consider that technologies like RAID did not exist once upon a time, and backup copies were latent and expensive. Yet we still considered the storage to be sufficient to meet the applicable regulations. If a fire then happened and burned the place down, and all the records were lost, the business would not be sanctioned so long as they took reasonable precautions.
Here, I’m not suggesting that “the record is on a single disk, that ought to be enough.” I am assuming that in the ordinary course of business, there is a working path to getting additional redundant copies made, but those additional copies are temporarily delayed due to overload. No reasonable regulator is going to tell you this is unacceptable.
But it isn’t! Because there are many hardware failure modes that mean that you aren’t getting your log back.
For the same reason that you need at least acks=1 in Kafka for durability, or synchronous_commit = remote_flush in PostgreSQL, you need to commit your audit log to more than the local disk!
Log levels could be considered an anti-pattern.
Perhaps use tags then?
”Verbose”, ”debug” and ”silly” are definitely not, as those describe a different thing altogether, and would probably be better instrumented through something like the npm ”debug” package.
The best way to equip logs to tell the truth is to have other parts of the system consume them as their source of truth.
Firstly: "what the system does" and "what the logs say" can't be two different things.
Secondly: developers can't put less info into the logs than they should, because their feature simply won't work without it.
Make actions, not assumptions. Instead of using a one machine storage system, distribute that storage across many machines. Then stop deleting them.
> Dropping a log message here or there is not a fatal error.
I would try to reallocate my effort budget to things that actually need to work.
Drop logging completely, and come back to it once you have a flawless record of everything the system did. The reconsider whether you need it.
Yes, the system shall not report that "User null was created" if it was actually "User 123 that was created".
String? Not a chance, make a proper type-safe struct. UserCreated { "id": 123}
> I don't want to have to think if i change a debug string am i going to break something.
Good point, you should probably have a unit test somewhere.
Seeing logging as debugging is flawed imo. A log is technically just a record of what happened in your database.
A common problem in a log aggregation is the question if you query for user.id, user_id, userID, buyer.user.id, buyer.id, buyer_user_id, buyer_id, ... Every log aggregation ends up being plagued by this. You need standard field names there, or it becomes a horrible mess.
And for a centralized aggregation, I like ECS' idea of "related". If you have a buyer and a seller, both with user IDs, you'd have a `related.user.id` with both id's in there. This makes it very simple to say "hey, give me everything related to request X" or "give me everything involving user Y in this time frame" (as long as this is kept up to date, naturally)
Too bad that all of this effort is spent arguing something which can be summarised as "add structured tags to your logs"
Generally speaking my biggest gripe with wide logs (and other "innovative" solutions to logging) is that whatever perceived benefit you argue for doesn't justify the increased complexity and loss of readability.
We're throwing away `grep "uid=user-123" application.log` to get what? The shipping method of the user attached to every log? Doesn't feel an improvement to me...
P.S. The checkboxes in the wide event builder don't work for me (brave - android)
As long as it's actual json, it doesn't matter if it's pretty-printed or not, since `jq` can fold and unfold it at will.
I frequently fold logs into single lines, grep for something, then unfold them again
A few things I've been thinking about recently:
- we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.
- logging an error as a separate log line to the request log is a pain. You can filter for the trace, but it makes it hard to surface "show me all the logs for 5xx requests and the error associated" - it's doable, but it's more difficult than filtering on the status code of the request log
- it's not enough to just start including that context, you have to educate your coworkers that it's now present. I've seen people making life hard for themselves because they didn't realize we'd added this context
There are two dimensions to it: UX and security.
Displaying excessive technical information on an end-user interface will complicate support and likely reveal too much about the internal system design, making it vulnerable to external attacks.
The latter is particularly concerning for any design facing the public internet. A frequently recommended approach is exception shielding. It involves logging two messages upon encountering a problem: a nondescript user-facing message (potentially including a reference ID pinpointing the problem in space and time) and a detailed internal message with the problem’s details and context for L3 support / engineering.
[0] https://news.ycombinator.com/item?id=38820893
Unfortunately, Apple has taken out the «bandwidth» sampler from «powermetrics», and it is no longer possible to measure the memory bandwidth as easily.
Just ignore them or provide appeasement insofar that it doesn’t mess with your ability to maintain the system.
It’s my long standing wish to be able to link traces/errors automatically to callers when they call the helpdesk. We have all the required information. It’s just that the helpdesk has actually very little use for this level of detail. So they can only attach it to the ticket so that actual application teams don’t have to search for it.
Takes some time and its a pain in the ass initially, but once I've matured them - work becomes so much more easy. Reduces dependability on other people / teams / access as well.
It won’t be long before ad computem comments like this become unacceptable.
Not true. It's likely an effort issue in that situation.
And that kind of effort issue is good to call out, because it compounds the low quality.
However I don't think you should outsource understanding to LLMs, and also think that shifting the effort from the writer to the reader is a poor strategy (and disrespectful to the reader)
Depends on the service, but tracking everything a user does may not be an option in terms of data retention laws
I worked with enterprise message bus loggers in semiconductor manufacturing context wherein we had thousands of participants on the message bus. It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.
The logs were mere time series of events. Figuring out the detail about specific events (e.g. a list of all the tools a lot visited) required writing queries into the Oracle monster. You could derive history from the event logs if you had enough patience & disk space, but that would have been very silly given the alternative option. We used them predominantly to establish a casual chain between events when the details are still preliminary. Identifying suspects and such. Actually resolving really complicated business usually requires more than a perfectly detailed log file.
400MB of logs an hour is nothing at all, that's why a naive grep can work. You don't even need to rotate your log files frequently in this situation.
And loosely related, I also dislike log interfaces like elk stack. They make following track of events really hard. Most of the time you do not know what you are loooking for, just a vauge understanding of why you are looking into the logs. So a line passed 3 micro seconds ago maybe your euraka moment, where no search could identify , just intuition and following logs diligently can.
Instead of sprinkling logs all across the codebase, collect information in a request context and then emit one log when the request ends with all the relevant information.
I’m so sick of working with other people at this point.
But does it? Or is it bad logging, or excessive logging, or unsearchable logs?
A client of mine uses SnapLogic, which is a middleware / ETL that's supposed run pipelines in batch mode to pass data around between systems. It generates an enormous amount of logs that are so difficult to access, search and read that they may as well don't exist.
We're replacing all of that with simple Python scripts that do the same thing and generate normal simple logs with simple errors when something's truly wrong or the data is in the wrong format.
Terse logging is what you want, not an exhaustive (and exhausting) torrent of irrelevant information.
There isn't anything radical about his proposal solutions either. Tell login is kind of nonsense. Most log storage services can be set with a rule where say all warning logs, and above are retained, but only a sample of info and debug logs.
Nothing in this article is something that most major software companies don't already do.
Point #1 isn't true, auto instrumentation exists and is really good. When I integrate OTel I add my own auto instrumentors wherever possible to automatically add lots of context. Which gets into point #2.
Point #2 also isn't true. It can add business context in a hierarchal manner and ship wide events. You shouldn't have to tell every span all the information again. Just where it appears naturally the first time.
Point #3 also also isn't true because OTel libs make it really annoying to just write a log message and very strongly pushes you into a hierarchy of nested context managers.
Like the author's ideal setup is basically using OTel with Honeycomb. You get the querying and everything. And unlike rawdogging wide events all your traces are connected, can span multiple services and do timing for you.
Also if you're going to log wide events, for the sake of the person querying them after you, please don't let your schema be an ad hoc JSON dict of dicts, put some thought into the schema structure (and better have a logging system that enforces the schema).
I do think "logs are broken" is a bit overstated. The real problem is unstructured events + weak conventions + poor correlation.
Brilliant write up regardless
64 more comments available on Hacker News