Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise? | Not Hacker News!

Q&A highlight

22 points

40 comments

Posted3 months agoActive3 months ago

Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise?

monitoringalertingSREinfrastructure

Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise?

No synthesized answer yet. Check the discussion below.

Discussion (40 comments)

Showing 40 comments

3 months ago

1 reply

No. Useful alerts are a devilishly hard problem.

3 months ago

You've hit the nail on the head with "devilishly hard." That phrase perfectly captures what I've felt. What have you found to be the most "devilish" part of it? Is it defining what "normal" behavior is for a service, or is it trying to account for all the possible-but-rare failure modes?

chasing0entropy

3 months ago

1 reply

You're not wrong.

3 months ago

Thanks for the sanity check. In your experience, what's the biggest source of the noise? Do you find it's more of a tooling problem (e.g., bad defaults) or a people/process problem (e.g., alerts not being cleaned up)?

3 months ago

1 reply

> investigating "false positive" infrastructure alerts?

Gradually with each false positive (or negative) you learn to tweak your alerts and update dashboards to reduce the noise as much as possible.

3 months ago

1 reply

So it's really a manual and iterative process....means there should be room for something to be done

3 months ago

You learn pretty quick. Like CPU I don’t alert on it, I do on load average which is more realistic. I’m also solo dev, so I do it on the 15min avg and it need to be above a pretty high threshold 3 times in a row. I don’t monitor ram usage, but swap instead. When it trigger it usually something need to be fixed.

Also check for a monitoring solution with quorum, that way you don’t get bothered by false positives because of a peering issue between your monitoring location and your app (which you have no control over).

3 months ago

1 reply

I'm a solo developer

To a first approximation, monitoring tools are built for teams, projects running at scale, and for systems where falling over matters at scale. And monitoring as a "best practice" is good engineering practice only in those contexts.

You don't have that context and probably should resist the temptation to boilerplate it and considering it as moving the project forward. Because monitoring doesn't get you customers/users; does not solve customer/user problems; and nobody cares if you monitor or not (except assholes who want to tell you you are morally failing unless you monitor).

Good engineering is doing what makes sense only in terms of the actual problems at hand. Good luck.

3 months ago

1 reply

I wasn't building one exactly for me, but I believe not all devs have a team available to monitor the deployments for them...and sometimes centralized observability could really be a plus and ease the life for a developper....just being able to vizualise the state of your multiple vps deployments from your single pc without logging into you provider accounts should count for something I belive....this is without any form of anomaly detection or extra advice about your deployment state...I wanna believe this is useful but again the critique is welcome

3 months ago

1 reply

Agreed... while monitoring isn't everything, not just alerts, but a dashboard and reports can provide insights and early warnings before things start falling over.

A single 80% CPU spike isn't anything to worry about by itself... but if it is prolonged, frequent and accompanied with significant influence on p95/99 latency and response, it could be a critical warning that you need to either mitigate an issue or upgrade soon.

I would be more inclined to set limits on response latency, or other metrics that would be impactful to users as to what is tolerable, and use that as critical alert levels. The rest you can use reports on to get say hourly or half-hourly windows in terms of where performance hits are, what the top values were for latency in addition to p95/p99, etc.

3 months ago

Thanks tracker...very insightful

3 months ago

1 reply

Figuring out logging and metrics is the hardest part of running online projects nowadays. I would say though, unpopularily, that 99% of work put into this is wasted. You are unlikely running a critical software that cannot go down, so I would not worry too much about it.

YAGNI

3 months ago

1 reply

Yes that's true....but my frustrations lmade me wonder if others really faced these problems, and before attempting to solve it, I want to know about solutions available...but lol everyone seems to say it's hell

3 months ago

As the YAGNI says - you ain't going to needed it. Until you do, only then you act and fix that particular problem. It's that simple. So unless there is an actual problem, don't worry about it.

3 months ago

1 reply

Alerting has to be a constant iterative process. Some things should be nice to know, and some things should be "halt what you are doing and investigate". The latter needs to really be decided based on what your SLI/SLAs have been defined as, and need to be high quality indicators. Whenever one of the halt-and-do things alerts start to be less high signal they should be downgraded or thresholds should be increased. Like I said, an iterative process. When you are talking about a system owned by a team there should be some occasional semi-formal review of current alerting practices and when someone is on-call and notices flaky/bad alerting they should spend time tweaking/fixing so the next person doesn't have the same churn.

There isn't a simple way but having some tooling to go from alert -> relevant dashboards -> remediation steps can help cut down on the process... it takes a lot of time investment to make these things work in a way that allows you to save time and not spend more time solving issues. FWIW I think developers need to be deeply involved in this process and basically own it. Static thresholds usually would just be a warning to look at later, you want more service level indicators. For example if you have a streaming system you probably want to know if one of your consumers are stuck or behind by a certain amount, and also if there is any measurable data loss. If you have automated pushes, you would probably want alerting for a push that is x amount of time stale. For rpc type systems you would want some recurrent health checks that might warn on cpu/etc but put higher severity alerting on whether or not responses are correct and as expected or not happening at all.

As a solo dev it might be easier just to do the troubleshooting process every time, but as a team grows it becomes a huge time sink and troubleshooting production issues is stressful, so the goal is to make it as easy as possible. Especially if downtime == $$.

I don't have good recommendations for tooling because I have used mostly internal tools but generally this is my experience.

3 months ago

This is an incredibly insightful and helpful comment, thank you. You explain exactly what I thought when writing this post. The phrase that stands out to me is "constant iterative process." It feels like most tools are built to just "fire alerts," but not to facilitate that crucial, human-in-the-loop review and tweaking process you described. A quick follow-up question if you don't mind: do you feel like that "iterative process" of reviewing and tweaking alerts is well-supported by your current tools, or is it a manual, high-effort process that relies entirely on team discipline? (This is the exact problem space I'm exploring. If you're ever open to a brief chat, my DMs are open. No pressure at all, your comment has already been immensely helpful, thanks.)

3 months ago

1 reply

Monitoring is one of those things where you report on what's easy to measure, because measuring the "real metric" is very difficult.

If you can take a reasonable amount of time and come up with something better for your system, great; do it. I've worked with a lot of systems where noisy alerts and human filtering was the best we could reasonably do, and it was fine. In a system like that, not every alert demands immediate response... a single high cpu page doesn't demand a quick response, and the appropriate response could be 'cpu was high for a short time, I don't need to look at anything else' Of course, you could be missing an important signal that should have been investigated, too. OTOH, if you get high cpu alerts from many hosts at once, something is up --- but it could just be an external event that causes high usage and hopefully your system survives those autonomously anyway.

Ideally, monitoring, operations and development feed into each other, so that the system evolves to work best with the human needs it has.

3 months ago

2 replies

So ideally, a system that can learn from your infrastructure and traffic patterns or metrics over time? Cuz that's what I'm thinking about and your last statement seems to validate it...also from what I'm getting no tool actually exists for this

3 months ago

1 reply

I would not want to use that for alerts (automatically) but I'd consider it for suggesting new alerts to set up or potential problems. If it was at all accurate and useful.

3 months ago

Okay, thanks a lot, didn't see it like this

3 months ago

You are the tool. The human element.

3 months ago

1 reply

Good alerting is hard, even for those of who are SMEs on it.

My biggest advice is to leverage alerting levels, and to only send high priority alerts for things visible to users.

For alert levels, I usually have 3. P1 (the highest level) is the only one that will fire a phone call/alarm 24/7/365, and only alerts if some kind of very user-visible issue happens (increase in error rate, unacceptable latency, etc). P2 is a mid-tier and only expected to get a response during business hours. That's where I send things that are maybe an issue or can wait, like storage filling up (but not critically so). P3 alerts get sent to a Slack channel, and exist mostly so if you get a P1 alert you can get a quick view of "things that are odd" like CPU spiking.

For monitoring, I try to only page on user-visible issues. Eg I don't routinely monitor CPU usage, because it doesn't correlate to user-visible issues very well. Lots of things can cause CPU to spike, and if it's not impacting users then I don't care. Ditto for network usage, disk IO, etc, etc. Presuming your service does network calls, the 2 things you really care about are success rate and latency. A drop in success rate should trigger a P1 page, and an increase in latency should trigger a P2 alert if it's higher than you'd like but okay and a P1 alert at the "this is impacting users" point. You may want to split those out by endpoint as well, because your acceptable latency probably differs by endpoint.

If your service can't scale, you might also want to adjust those alerts by traffic levels (i.e. if you know you can't handle 10k QPS and you can't scale past 10k QPS, there's no point in paging someone).

You can also add some automation, especially if the apps are stateless. If api-server-5 is behaving weirdly, kill it and spin up a new api-server-5 (or reboot it if physical). A lot of the common first line of defense options are pretty automatable, and can save you from getting paged if an automated restart will fix it. You probably do want some monitoring and rate limiting over that as well, though. E.g. a P2 alert that api-server-5 has been rebooted 4 times today, because repeated reboots are probably an indication of an underlying issue even if reboots temporarily resolve it.

3 months ago

1 reply

Thanks...thinking about using AI to learn about what is actually "important" to the developper or team...tracking the alerts that actually lead to manual interventions or important repo changes...this way, we could always automatically send alerts to tiers...just thinking

3 months ago

1 reply

You could, but I personally wouldn't for a few reasons.

The first is that it's there are simpler ways that are faster and easier to implement. Just develop a strategy for identifying whether page are actionable. Depends on your software, but most should support tagging or comments. Make a standard for tagging them as "actioned on" or "not actionable", and write a basic script that iterates over the alerts you've gotten in the past 30 or 90 days and shows the number of times the alert fired and what percentage of times it was tagged as unactionable. Set up a meeting to run that report once a week or month, and either remove or reconfigure alerts that are frequently tagged as not actionable.

The second is that I don't AI are great at that kind of number crunching. I'm sure you could get it to work, but if it's not your primary product then that time is sort of wasted. Paying for the tokens is one thing, but messing with RAG for the 85th time trying to get the AI to do the right thing is basically wasted time.

The last is that I don't like per alert costs, because it creates an environment ripe for cost-cutting by making alerting worse. If people have in the back of their head that it costs $0.05 every time an alert fires, the mental bar for "worth creating a low-priority alert" goes up. You don't want that friction to setting up alerts. You may not care about the cost now, but I'd put down money that it becomes a thing at some point. Alerting tends to scale superlinearly with the popularity of the product. You add tiers to the architecture and need to have more alerts for more integration points, and your SLOs tighten so the alerts have to be more finnicky, and suddenly you're spending $2,000 a month just on alert routing.

3 months ago

Thank you...reading from you guys has been great so far

3 months ago

1 reply

If "CPU > 80%" is not an error state for your application, then that is a pointless alert and it should be removed.

Ideally alerts should only be generated when ($severity_of_potential_bad_state * $probability_of_that_state) is high. In other words, for marginally bad states, you want a high confidence before alerting. For states that are really mega bad, it may be OK to loosen that and alert when you are less confident that it is actually occurring.

IME CPU% alerts are typically totally spurious in a modern cloud application. In general, to get the most out of your spend, you actually want your instances working close to their limits because the intent is to scale out when your application gets busy. Therefore, you instead want to monitor things that are as close to user experience or business metric as possible. P99 request latency, 5xx rate, etc. are OK, but ideally you go even further into application-specific metrics. For example, Facebook might ask: What's the latency between uploading a cat picture and getting its first like?

3 months ago

This. Its important to understand monitoring systems usually set up with bunch of generic sensors/thresholds, and some people tend to believe that you just install monitoring system and thats it.

It requires building risk assessment model and sensors/thresholds/alerts around it. This is quite some work which is very subjective to every case.

3 months ago

2 replies

I spent the first half of my career in ops, watching those alerts, escalating things, fixing stuff, writing EDA to fix stuff, working with monitoring teams and dev teams to tune monitoring, etc. Over time I worked my way into a dev role, but still am focused on the infrastructure.

The problem you’re starting to run into is that you’re seeing the monitors as useless, which will ultimately lead to ignoring them, so when there is a problem you won’t know it.

What you should be doing is tuning the monitors to make them useful. If your app will see occasional spikes that last 10 minutes, and the monitor checks every 5 minutes, set it to only create an alert after 3 consecutive failures. That creates some tolerance for spikes, but will still alert you if there is a prolonged issue that needs to be addressed due to the inevitable performance issues it will cause.

If there are other alerts that happen often that need action taken, which is repeatable, that’s where EDA (Event Driven Automation) would come in. Write some code to fix what needs to be fixed, and when the alert comes in the code automatically runs to fix it. You then only need to handle it when the EDA code can’t fix the issue. Fix it once in code instead of every time you get an alert.

3 months ago

1 reply

Do you have any advice about EDA use cases? I have a management that is clearly interested in it -- presumably because it sounds like robots doing work that currently expensive professionals do -- but so far I haven't found a recurring alert that could benefit from this, since it's always much easier to fix whatever bug in the application is causing those states to occur.

3 months ago

2 replies

A lot of what we used it for was to compensate for non-responsive dev teams or vendor apps that we couldn't fix. EDA felt like a tool for the powerless. We can't fix the real problem, but we can automate the band-aid. I was on an ops team dealing with hundreds of apps and teams. They didn't care of ops had to restart their app constantly, it wasn't their problem.

We had a lot of alerts where restarting a service would fix it, so we had EDA do that. That effectively freed up 3 resources to do other things just for a single application we monitored.

We have some EDA for disk cleanup, to delete files in some "safe" directories common to the OS, not applications. More often than not, the disk space issues are due to the application team and they really need to clean things up or make their own cleanup job. If you're the application owner you can be much more targeted, where I had to write something that would work for hundreds of different app teams. But of course if you own the app you can fix the excessive logging issues at the source, which is even better. Some vendor apps left a lot of junk out there. We'd clean up BladeLogic temp files (back when we used that) and of course temp directories that people never bothered to clean up.

Another thing we've used it for was to enrich the data in the alert ticket. If the first thing you do when getting a certain alert is to check various logs or run certain commands to get more information, have the EDA do that and put that data right in the ticket so you already have it. One simple example we had was for ping alerts. In many cases a ping alert would clear on it's own, so we added some EDA to check the uptime on the server and put that information into the ticket. This way the ops person could quickly see if the server rebooted. If that reboot was unexpected the app team should be made aware and verify their app. Without that, a clear alert would be assumed to be some network latency and dismissed as "noise".

Depending on how quickly an EDA band-aid can role out vs the fix, EDA can also buy you time while you implement the real fix, so you're not bogged down with operational work. This is especially true if the real fix for a problem will require massive changes that could take months to actually implement.

For a while we had a lot of issues with BTRFS filesystems, and we ended up making some EDA to run the btrfs balance when we started getting in the danger zone, to avoid the server locking up. This was a way to keep things under control and take the pressure off of the ops team while we migrated away from btrfs, which was a multi-year undertaking.

Reporting on your alert tickets should highlight any opportunities you might have, if they exist. If you have an ops team, ask them too, they'll know. But of course, if they can be fixed in the code, or the monitor just needs to be tuned, that's even better. EDA should be a last resort (other than the ticket enrichment use case).

The danger of EDA is that it can push problems to the back burner. If there is a chronic issues and EDA is taking care of it, the root causes is never resolved, because it's no longer the squeaky wheel. If you go down the EDA route, it is a good idea to report on how often it runs, and review is regularly to drive improvements in the app. High numbers of EDA resolved alerts shouldn't be the goal. Ideally, those EDA metrics should also be driven down as the apps improve, just like you'd want to see from alerts being handled by humans. At the end of the day, they are still undesirable events that reduce the stability of your environment.

3 months ago

This makes sense. The whole idea is catnip for empire-building ops managers: makes them look proactive while also building in a dependency on a new system that only ops knows anything about.

3 months ago

Love this...meaning simply reusing a quick fix is definitely not ideal to help identify root causes...LLMs have come a long way and I feel with adequate tooling and context(the rich ticket data you mentioned), they could really be a great solution or at least provide even better context to developers

3 months ago

1 reply

What are the tools used to implement EDA? Not sure how I would implement the automation part without writing code, which I'm trying to avoid if there are mature tools available.

3 months ago

We have a home grown tool. It looks at all the tickets coming in, checks for a regex match with the defined patterns, and if one matches it runs the associated script to try and resolve it. Depending on success or failure it either closes the ticket with notes or adds the notes and escalates it. That’s what I understand of it at a high level, I didn’t write it, I just used it and requested some features and changes. In the first iteration it was calling automation to remediate from a low-code/no-code orchestration tool, and these days it’s calling Ansible, but any API would work.

There are vendor solutions out there. Ansible now offers EDA as part of Ansible Automation Platform, though I haven’t been hands-on with it yet. That still requires writing Ansible playbooks, not to mention the overhead of AAP.

I don’t remember the name, but I sat in on a demo of an AI powered EDA platform probably 6 years ago (before the LLM craze). Their promise was that it would automatically figure out what to do and do it, and over time it would handle more and more incidents. It sounded a little terrifying. I could see it turning into a chaos monkey, but who knows.

Either way, there are some mature tools out there. What would work best depends on what you need to integrate with, cost, support, and how much code you are or aren’t willing to write.

3 months ago

This sounds like a case of the alerts being badly tuned.

If you are distracted by a high CPU alert that turns out to just be an expected spike, the alert needs to filter out spikes and only report persistent high CPU situations.

Think of how the body and brain report pain. Sudden new pain is more noticeable than a chronic background level of pain. Maybe CPU alarms should be that: CPU activity which is (a) X percent above the 24 hour average, (b) persistently for a certain duration, is alarm worthy.

So 100% CPU wouldn't be alarm-worthy for a node that is always at 100% CPU, as an expected condition: very busy sytem constantly loaded down. But 45% CPU for 30 minutes could be alarm-worthy on a machine that averages 15%.

Kind of thing.

3 months ago

They are very useful. Of your example "High CPU" could mean you need either a bigger CPU, more cores in the same host etc.

This will let you tune your application being run.

Also, this might change in a time of burst traffic, and if you don't have such tools like Prometheus, DD etc, you aren't able to tune accordingly.

The thing is that tuning a production setup is a bit of an art, there are many tradeoffs for what you do (typically, cost vs. benefit), so you need to make those decisions yourself.

If the alert is constantly ringing and you are satisfied with how the system is running and tradeoffs, you should disable it.

3 months ago

Most of the time checking for "typical" thresholds for infrastructure will yield more noise than signal. By typical thresholds I mean things like CPU Usage %, Memory Consumed and so on. My typical recommendation for clients who want to implement infrastructure is "Don't bother". You are better off in most cases measuring impact to user-facing services such as web page response times, task completion times for batch jobs and so on. If I have a client who is insistent on monitoring their infrastructure I tell them to monitor different metrics.

For CPU, check for CPU IOWait For memory, check for Memory swap-in rate For disk, check for latency or queue depth For network, check for dropped packets

All you want to check at an infrastructure layer is whether there is a bottleneck and what that bottleneck is. Whether an application is using 10% or 99% of available memory is moot if the application isn't impacted by it. The above metrics are indicators (but not always proof) that a resource is being bottlenecked and needs investigation.

Monitor further up the application stack, check for error code rates over time, implement tracing to the extent that you can for core user journeys, ignore infrastructure-level monitoring until you have no choice

3 months ago

CPU usage I tend to see used for two things. Scaling and maybe diagnostics (for 5% of investigations). Dont alert on it. Maybe alert if you scaled too much though.

I would recommend alerting on reliability. If errors for an endpoint go above whatever yoy judge to set e.g. 1% or 0.1% or 0.01% for a sustained period then alarm.

Maybe do the same for latency.

For hobby projects though I just point a free tier of one of those down detector things at a few urls. I may make a health check url.

Every false alarm should lead to some decision of how to fix e.g. different alarm, different threshold or even just forget that alarm.

3 months ago

I can share my experience: monitoring and alerting should be calibrated to the number of users you serve. Early on, we run load/stress tests; if those look good, many ancillary alerts aren’t necessary. Alerts are best reserved for truly critical events—such as outages and other severe incidents. Thresholds should be tuned to real-world conditions and adjusted over time. Hope this helps.

3 months ago

Bingo, too many folks focus on "oh god is my cpu good, ram good, etc" rather than "does the app still do the thing in a reasonable time?"