Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise?
No synthesized answer yet. Check the discussion below.
Gradually with each false positive (or negative) you learn to tweak your alerts and update dashboards to reduce the noise as much as possible.
Also check for a monitoring solution with quorum, that way you don’t get bothered by false positives because of a peering issue between your monitoring location and your app (which you have no control over).
To a first approximation, monitoring tools are built for teams, projects running at scale, and for systems where falling over matters at scale. And monitoring as a "best practice" is good engineering practice only in those contexts.
You don't have that context and probably should resist the temptation to boilerplate it and considering it as moving the project forward. Because monitoring doesn't get you customers/users; does not solve customer/user problems; and nobody cares if you monitor or not (except assholes who want to tell you you are morally failing unless you monitor).
Good engineering is doing what makes sense only in terms of the actual problems at hand. Good luck.
A single 80% CPU spike isn't anything to worry about by itself... but if it is prolonged, frequent and accompanied with significant influence on p95/99 latency and response, it could be a critical warning that you need to either mitigate an issue or upgrade soon.
I would be more inclined to set limits on response latency, or other metrics that would be impactful to users as to what is tolerable, and use that as critical alert levels. The rest you can use reports on to get say hourly or half-hourly windows in terms of where performance hits are, what the top values were for latency in addition to p95/p99, etc.
YAGNI
There isn't a simple way but having some tooling to go from alert -> relevant dashboards -> remediation steps can help cut down on the process... it takes a lot of time investment to make these things work in a way that allows you to save time and not spend more time solving issues. FWIW I think developers need to be deeply involved in this process and basically own it. Static thresholds usually would just be a warning to look at later, you want more service level indicators. For example if you have a streaming system you probably want to know if one of your consumers are stuck or behind by a certain amount, and also if there is any measurable data loss. If you have automated pushes, you would probably want alerting for a push that is x amount of time stale. For rpc type systems you would want some recurrent health checks that might warn on cpu/etc but put higher severity alerting on whether or not responses are correct and as expected or not happening at all.
As a solo dev it might be easier just to do the troubleshooting process every time, but as a team grows it becomes a huge time sink and troubleshooting production issues is stressful, so the goal is to make it as easy as possible. Especially if downtime == $$.
I don't have good recommendations for tooling because I have used mostly internal tools but generally this is my experience.
If you can take a reasonable amount of time and come up with something better for your system, great; do it. I've worked with a lot of systems where noisy alerts and human filtering was the best we could reasonably do, and it was fine. In a system like that, not every alert demands immediate response... a single high cpu page doesn't demand a quick response, and the appropriate response could be 'cpu was high for a short time, I don't need to look at anything else' Of course, you could be missing an important signal that should have been investigated, too. OTOH, if you get high cpu alerts from many hosts at once, something is up --- but it could just be an external event that causes high usage and hopefully your system survives those autonomously anyway.
Ideally, monitoring, operations and development feed into each other, so that the system evolves to work best with the human needs it has.
My biggest advice is to leverage alerting levels, and to only send high priority alerts for things visible to users.
For alert levels, I usually have 3. P1 (the highest level) is the only one that will fire a phone call/alarm 24/7/365, and only alerts if some kind of very user-visible issue happens (increase in error rate, unacceptable latency, etc). P2 is a mid-tier and only expected to get a response during business hours. That's where I send things that are maybe an issue or can wait, like storage filling up (but not critically so). P3 alerts get sent to a Slack channel, and exist mostly so if you get a P1 alert you can get a quick view of "things that are odd" like CPU spiking.
For monitoring, I try to only page on user-visible issues. Eg I don't routinely monitor CPU usage, because it doesn't correlate to user-visible issues very well. Lots of things can cause CPU to spike, and if it's not impacting users then I don't care. Ditto for network usage, disk IO, etc, etc. Presuming your service does network calls, the 2 things you really care about are success rate and latency. A drop in success rate should trigger a P1 page, and an increase in latency should trigger a P2 alert if it's higher than you'd like but okay and a P1 alert at the "this is impacting users" point. You may want to split those out by endpoint as well, because your acceptable latency probably differs by endpoint.
If your service can't scale, you might also want to adjust those alerts by traffic levels (i.e. if you know you can't handle 10k QPS and you can't scale past 10k QPS, there's no point in paging someone).
You can also add some automation, especially if the apps are stateless. If api-server-5 is behaving weirdly, kill it and spin up a new api-server-5 (or reboot it if physical). A lot of the common first line of defense options are pretty automatable, and can save you from getting paged if an automated restart will fix it. You probably do want some monitoring and rate limiting over that as well, though. E.g. a P2 alert that api-server-5 has been rebooted 4 times today, because repeated reboots are probably an indication of an underlying issue even if reboots temporarily resolve it.
The first is that it's there are simpler ways that are faster and easier to implement. Just develop a strategy for identifying whether page are actionable. Depends on your software, but most should support tagging or comments. Make a standard for tagging them as "actioned on" or "not actionable", and write a basic script that iterates over the alerts you've gotten in the past 30 or 90 days and shows the number of times the alert fired and what percentage of times it was tagged as unactionable. Set up a meeting to run that report once a week or month, and either remove or reconfigure alerts that are frequently tagged as not actionable.
The second is that I don't AI are great at that kind of number crunching. I'm sure you could get it to work, but if it's not your primary product then that time is sort of wasted. Paying for the tokens is one thing, but messing with RAG for the 85th time trying to get the AI to do the right thing is basically wasted time.
The last is that I don't like per alert costs, because it creates an environment ripe for cost-cutting by making alerting worse. If people have in the back of their head that it costs $0.05 every time an alert fires, the mental bar for "worth creating a low-priority alert" goes up. You don't want that friction to setting up alerts. You may not care about the cost now, but I'd put down money that it becomes a thing at some point. Alerting tends to scale superlinearly with the popularity of the product. You add tiers to the architecture and need to have more alerts for more integration points, and your SLOs tighten so the alerts have to be more finnicky, and suddenly you're spending $2,000 a month just on alert routing.
Ideally alerts should only be generated when ($severity_of_potential_bad_state * $probability_of_that_state) is high. In other words, for marginally bad states, you want a high confidence before alerting. For states that are really mega bad, it may be OK to loosen that and alert when you are less confident that it is actually occurring.
IME CPU% alerts are typically totally spurious in a modern cloud application. In general, to get the most out of your spend, you actually want your instances working close to their limits because the intent is to scale out when your application gets busy. Therefore, you instead want to monitor things that are as close to user experience or business metric as possible. P99 request latency, 5xx rate, etc. are OK, but ideally you go even further into application-specific metrics. For example, Facebook might ask: What's the latency between uploading a cat picture and getting its first like?
It requires building risk assessment model and sensors/thresholds/alerts around it. This is quite some work which is very subjective to every case.
The problem you’re starting to run into is that you’re seeing the monitors as useless, which will ultimately lead to ignoring them, so when there is a problem you won’t know it.
What you should be doing is tuning the monitors to make them useful. If your app will see occasional spikes that last 10 minutes, and the monitor checks every 5 minutes, set it to only create an alert after 3 consecutive failures. That creates some tolerance for spikes, but will still alert you if there is a prolonged issue that needs to be addressed due to the inevitable performance issues it will cause.
If there are other alerts that happen often that need action taken, which is repeatable, that’s where EDA (Event Driven Automation) would come in. Write some code to fix what needs to be fixed, and when the alert comes in the code automatically runs to fix it. You then only need to handle it when the EDA code can’t fix the issue. Fix it once in code instead of every time you get an alert.
We had a lot of alerts where restarting a service would fix it, so we had EDA do that. That effectively freed up 3 resources to do other things just for a single application we monitored.
We have some EDA for disk cleanup, to delete files in some "safe" directories common to the OS, not applications. More often than not, the disk space issues are due to the application team and they really need to clean things up or make their own cleanup job. If you're the application owner you can be much more targeted, where I had to write something that would work for hundreds of different app teams. But of course if you own the app you can fix the excessive logging issues at the source, which is even better. Some vendor apps left a lot of junk out there. We'd clean up BladeLogic temp files (back when we used that) and of course temp directories that people never bothered to clean up.
Another thing we've used it for was to enrich the data in the alert ticket. If the first thing you do when getting a certain alert is to check various logs or run certain commands to get more information, have the EDA do that and put that data right in the ticket so you already have it. One simple example we had was for ping alerts. In many cases a ping alert would clear on it's own, so we added some EDA to check the uptime on the server and put that information into the ticket. This way the ops person could quickly see if the server rebooted. If that reboot was unexpected the app team should be made aware and verify their app. Without that, a clear alert would be assumed to be some network latency and dismissed as "noise".
Depending on how quickly an EDA band-aid can role out vs the fix, EDA can also buy you time while you implement the real fix, so you're not bogged down with operational work. This is especially true if the real fix for a problem will require massive changes that could take months to actually implement.
For a while we had a lot of issues with BTRFS filesystems, and we ended up making some EDA to run the btrfs balance when we started getting in the danger zone, to avoid the server locking up. This was a way to keep things under control and take the pressure off of the ops team while we migrated away from btrfs, which was a multi-year undertaking.
Reporting on your alert tickets should highlight any opportunities you might have, if they exist. If you have an ops team, ask them too, they'll know. But of course, if they can be fixed in the code, or the monitor just needs to be tuned, that's even better. EDA should be a last resort (other than the ticket enrichment use case).
The danger of EDA is that it can push problems to the back burner. If there is a chronic issues and EDA is taking care of it, the root causes is never resolved, because it's no longer the squeaky wheel. If you go down the EDA route, it is a good idea to report on how often it runs, and review is regularly to drive improvements in the app. High numbers of EDA resolved alerts shouldn't be the goal. Ideally, those EDA metrics should also be driven down as the apps improve, just like you'd want to see from alerts being handled by humans. At the end of the day, they are still undesirable events that reduce the stability of your environment.
There are vendor solutions out there. Ansible now offers EDA as part of Ansible Automation Platform, though I haven’t been hands-on with it yet. That still requires writing Ansible playbooks, not to mention the overhead of AAP.
I don’t remember the name, but I sat in on a demo of an AI powered EDA platform probably 6 years ago (before the LLM craze). Their promise was that it would automatically figure out what to do and do it, and over time it would handle more and more incidents. It sounded a little terrifying. I could see it turning into a chaos monkey, but who knows.
Either way, there are some mature tools out there. What would work best depends on what you need to integrate with, cost, support, and how much code you are or aren’t willing to write.
If you are distracted by a high CPU alert that turns out to just be an expected spike, the alert needs to filter out spikes and only report persistent high CPU situations.
Think of how the body and brain report pain. Sudden new pain is more noticeable than a chronic background level of pain. Maybe CPU alarms should be that: CPU activity which is (a) X percent above the 24 hour average, (b) persistently for a certain duration, is alarm worthy.
So 100% CPU wouldn't be alarm-worthy for a node that is always at 100% CPU, as an expected condition: very busy sytem constantly loaded down. But 45% CPU for 30 minutes could be alarm-worthy on a machine that averages 15%.
Kind of thing.
This will let you tune your application being run.
Also, this might change in a time of burst traffic, and if you don't have such tools like Prometheus, DD etc, you aren't able to tune accordingly.
The thing is that tuning a production setup is a bit of an art, there are many tradeoffs for what you do (typically, cost vs. benefit), so you need to make those decisions yourself.
If the alert is constantly ringing and you are satisfied with how the system is running and tradeoffs, you should disable it.
For CPU, check for CPU IOWait For memory, check for Memory swap-in rate For disk, check for latency or queue depth For network, check for dropped packets
All you want to check at an infrastructure layer is whether there is a bottleneck and what that bottleneck is. Whether an application is using 10% or 99% of available memory is moot if the application isn't impacted by it. The above metrics are indicators (but not always proof) that a resource is being bottlenecked and needs investigation.
Monitor further up the application stack, check for error code rates over time, implement tracing to the extent that you can for core user journeys, ignore infrastructure-level monitoring until you have no choice
I would recommend alerting on reliability. If errors for an endpoint go above whatever yoy judge to set e.g. 1% or 0.1% or 0.01% for a sustained period then alarm.
Maybe do the same for latency.
For hobby projects though I just point a free tier of one of those down detector things at a few urls. I may make a health check url.
Every false alarm should lead to some decision of how to fix e.g. different alarm, different threshold or even just forget that alarm.