A $1k AWS mistake
Mood
heated
Sentiment
negative
Category
tech
Key topics
AWS
Cloud Costs
Infrastructure Management
The author shares a $1,000 AWS mistake due to misconfigured NAT gateway, sparking a heated discussion about cloud costs, AWS pricing, and the need for better cost management features.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
41m
Peak period
149
Day 1
Avg / period
80
Based on 160 loaded comments
Key moments
- 01Story posted
Nov 19, 2025 at 5:00 AM EST
4d ago
Step 01 - 02First comment
Nov 19, 2025 at 5:41 AM EST
41m after posting
Step 02 - 03Peak activity
149 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 20, 2025 at 3:56 PM EST
3d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
- DevOopsAnd I can see how, in very big accounts, small mistakes on your data source when you're doing data crunching, or wrong routing, can put thousands and thousands of dollars on your bill in less than an hour.
--
0: https://blog.cloudflare.com/aws-egregious-egress/By default a NGW is limited to 5Gbps https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway...
A GB transferred through a NGW is billed 0.05 USD
So, at continuous max transfer speed, it would take almost 9 hours to reach $1000
Assuming a setup in multi-AZ with three AZs, it's still 3 hours if you have messed so much that you can manage to max your three NGWs
I get your point but the scale is a bit more nuanced than "thousands and thousands of dollars on your bill in less than an hour"
The default limitations won't allow this.
Let's say they decide to recalculate or test a algorithm: they do parallel data loading from the bucket(s), and they're pulling from the wrong endpoint or region, and off they go.
And maybe they're sending data back, so they double the transfer price. RDS Egress. EC2 Egress. Better keep good track of your cross region data!
Crucial for the approval was that we had cost alerts already enabled before it happened and were able to show that this didn't help at all, because they triggered way too late. We also had to explain in detail what measures we implemented to ensure that such a situation doesn't happen again.
Do you just delete when the limit is hit?
> (the cap should be a rate, not a total)
this is _way_ more complicated than there being a single cap.
The cap can be opt-in.
People will opt into this cap, and then still be surprised when their site gets shut down.
It's great for ML research too, as you can just SSH into a pod with VScode and drag in your notebooks and whatnot as if it were your own computer, but with a 5090 available to speed up training.
s/everyone has/a bunch of very small customers have/
Hard no. Had to pay I think 100$ for premium support to find that out.
A bunch of data went down the "wrong" pipe, but in reality most likely all the data never left their networks.
- raw
- click-ops
Because, when you build your infra from scratch on AWS, you absolutely don't want the service gateways to exist by default. You want to have full control on everything, and that's how it works now. You don't want AWS to insert routes in your route tables on your behalf. Or worse, having hidden routes that are used by default.
But I fully understand that some people don't want to be bothered but those technicalities and want something that work and is optimized following the Well-Architected Framework pillars.
IIRC they already provide some CloudFormation Stacks that can do some of this for you, but it's still too technical and obscure.
Currently they probably rely on their partner network to help onboard new customers, but for small customers it doesn't make sense.
Why? My work life is in terraform and cloudformation and I can't think of a reason you wouldn't want to have those by default. I mean I can come up with some crazy excuses, but not any realistic scenario. Have you got any? (I'm assuming here that they'd make the performance impact ~0 for the vpc setup since everyone would depend on it)
If I declare two aws_route resources for my route table, I don't want a third route existing and being invisible.
I agree that there is no logical reason to not want a service gateway, but it doesn't mean that it should be here by default.
The same way you need to provision an Internet Gateway, you should create your services gateways by yourself. TF modules are here to make it easier.
Everything that comes by default won't appear in your TF, so it becomes invisible and the only way to know that it exists is to remember that it's here by default.
How does this actually work? So you upload your data to AWS S3 and then if you wish to get it back, you pay per GB of what you stored there?
You can see why, from a sales perspective: AWS' customers generally charge their customers for data they download - so they are extracting a % off that. And moreover, it makes migrating away from AWS quite expensive in a lot of circumstances.
Please get some training...and stop spreading disinformation. And to think on this thread only my posts are getting downvoted....
"Free data transfer out to internet when moving out of AWS" - https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...
In the link you posted, it even says Amazon can't actually tell if you're leaving AWS or not so they're going to charge you the regular rate. You need explicit approval from them to get this 'free' data transfer.
There is nothing in the blog suggesting that this requires approval from some committee or that it is anything more than a simple administrative step. And if AWS were to act differently, you have grounds to point to the official blog post and request that they honor the stated commitment.
People are trying to tell you something with the downvotes. They're right.
Egress bandwidth costs money. Consumer cloud services bake it into a monthly price, and if you’re downloading too much, they throttle you. You can’t download unlimited terabytes from Google Drive. You’ll get a message that reads something like: “Quota exceeded, try again later.” — which also sucks if you happen to need your data from Drive.
AWS is not a consumer service so they make you think about the cost directly.
Sure you can just ram more connections through the lossy links from budget providers or use obscure protocols, but there's a real difference.
Whether it's fairly priced, I suspect not.
Can you share your tuning parameters on each host? If you aren't doing exactly the same thing on AWS as you are on Hetzner you will see different results.
Bypassing the TCP issue I can see nothing indicating low network quality, a single UDP iperf3 pass maintains line rate speed without issue.
Edit: My ISP peers with Hetzner, as do many others. If you think it's "lossy" I'm sure someone in network ops would want to know about it. If you're getting random packet loss across two networks you can have someone look into it on both ends.
We are programmed to receive. You can check out any time you like, but you can never leave
And people wonder why Cloudflare is so popular, when a random DDoS can decide to start inflicting costs like that on you.
But “security” people might say. Well, you can be secure and keep the behavior opt out, but you should be able to have an interface that is upfront and informs people of the implications
Though important to note in this specific case was a misconfiguration that is easy to make/not understand in the data was not intended to leave AWS services (and thus should be free) but due to using the NAT gateway, data did leave the AWS nest and was charged at a higher data rate per GB than if just pulling everything straight out of S3/EC2 by about an order of magnitude (generally speaking YMMV depending on region, requests, total size, if it's an expedited archival retrieval etc etc)
So this is an atypical case, doesn't usually cost $1000 to pull 20TB out of AWS. Still this is an easy mistake to make.
I have never understood why the S3 endpoint isn't deployed by default, except to catch people making this exact mistake.
Cloud cult was successfully promoted by all major players, and people have completely forgotten about the possibilities of traditional hosting.
But when I see a setup form for an AWS service or the never-ending list of AWS offerings, I get stuck almost immediately.
BTW you can of course self-host k8s, or dokku, or whatnot, and have as easy a deployment story as with the cloud. (But not necessarily as easy a maintenance story for the whole thing.)
That's my whole point. Zero maintenance.
For a tinkerer who's focused on the infra, then sure, hosting your own can make sense. But for anyone who's focused on literally anything else, it doesn't make any sense.
Are you really going to trust Claude Code to recover in that situation? Do you think it will? I've had DB primaries fail on managed DBs like AWS RDS and Google Cloud SQL, and recovery is generally automatic within minutes. You don't have to lift a finger.
Same goes for something like a managed k8s cluster, like EKS or GKE. There's a big difference between using a fully-managed service and trying to replicate a fully managed system on your own with the help of an LLM.
Of course it does boil down to what you need. But if you need reliability and don't want to have to deal with admin, managed services can make life much simpler. There's a whole class of problems I simply never have to think about.
This is such an imaginary problem. The examples like this you hear about are inevitably the outliers who didn't pay any attention to this issue until they were forced to.
For most services, it's incredibly easy to constrain your costs anyway. You do have to pay attention to the pricing model of the services you use, though - if a DDOS is going to generate a big cost for you, you probably made a bad choice somewhere.
> You really can't think of _one_ case where self hosting makes any sense?
Only if it's something you're interested in doing, or if you're so big you can hire a team to deal with that. Otherwise, why would you waste time on it?
I kind of assume that goes without saying, but you're right.
The company I'm with does model training on cloud GPUs, but it has funding for that.
> RDS + EKS for a couple hundred a month is an amazing deal for what is essentially zero maintenance application hosting.
Right. That's my point, and aside from GPU, pretty much any normal service or app you need to run can be deployed on that.
... or for a big company. I've worked at companies with thousands of developers, and it's all been 'self hosted'. In DCs, so not rinky dink, but yes, and there's a lot of advantages to doing it this way. If you set it up right, it can be much easier for developers to use than AWS.
"I'd like to spend the next sprint on S3 endpoints by default"
"What will that cost"
"A bunch of unnecessary resources when it's not used"
"Will there be extra revenue?"
"Nah, in fact it'll reduce our revenue from people who meant to use it and forgot before"
"Let's circle back on this in a few years"
I was lucky to have experienced all of the same mistakes for free (ex-Amazon employee). My manager just got an email saying the costs had gone through the roof and asked me to look into it.
Feel bad for anyone that actually needs to cough up money for these dark patterns.
They made a mistake and are sharing it for the whole word to see in order to help others avoid making it.
It's brave.
Unlike punching down.
You realize they didn’t ask you to read their article right? They didn’t put it on your fridge or in your sandwich.
Policing who writes what honest personal experience on the Internet is not a job that needs doing.
But if you do feel the need to police, don’t critique the writer, but HN for letting interested readers upvote the article here, where it is of course, strictly required reading.
I mean, drill down to the real perpetrators of this important “problem”!
And then writing “I regret it” posts that end up on HN.
Why are people not getting the message to not use AWS?
There’s SO MANY other faster cheaper less complex more reliable options but people continue to use AWS. It makes no sense.
https://www.ionos.com/servers/cloud-vps
$22/month for 18 months with a 3-year term 12 vCores CPU 24 GB RAM 720 GB NVMe
Unlimited 1Gbps traffic
And even EC2 is not just a VPS
If you need a simple VPS, yes, by all means, don't use AWS.
For this usecase AWS is definitely not cheaper nor simpler. Nobody said that. Ever.
Anything AWS does you can run on Linux computers.
It’s naive to think that AWS is some sort of magically special system that transcends other networked computers, out of brand loyalty.
That’s the AWS kool aid that makes otherwise clever people think there’s no way any organization can run their own computer systems - only AWS has the skills for that.
I sometimes feel bad using people's services built with S3 as I know my personal usage is costing them a lot of money despite paying them nothing.
But you are absolutely right, I'm drinking the AWS kool aid like thousands of other otherwise clever people who don't know that AWS is just Linux computers!
And it’s always the same - clouds refuse to provide anything more than alerts (that are delayed) and your only option is prayer and begging for mercy.
Followed by people claiming with absolute certainty that it’s literally technically impossible to provide hard capped accounts to tinkerers despite there being accounts like that in existence already (some azure accounts are hardcapped by amount but ofc that’s not loudly advertised).
It's easier to waive cost overages than deal with any of that.
AWS is less like your garage door and more like the components to build an industrial-grade blast-furnace - which has access doors as part of its design. You are expected to put the interlocks in.
Without the analogy, the way you do this on AWS is:
1. Set up an SNS queue
2. Set up AWS budget notifications to post to it
3. Set up a lambda that watches the SNS queue
And then in the lambda you can write your own logic which is smart: shut down all instances except for RDS, allow current S3 data to remain there but set the public bucket to now be private, and so on.
The obvious reason why "stop all spending" is not a good idea is that it would require things like "delete all my S3 data and my RDS snapshots" and so on which perhaps some hobbyist might be happy with but is more likely a footgun for the majority of AWS users.
In the alternative world where the customer's post is "I set up the AWS budget with the stop-all-spending option and it deleted all my data!" you can't really give them back the data. But in this world, you can give them back the money. So this is the safer one than that.
Data transfer can be pulled into the same model by having an alternate internet gateway model where you pay for some amount of unmetered bandwidth instead of per byte transfer, as other providers already do.
Actual spikey traffic that you can't plan for or react to is something I've never heard of, and believe is a marketing myth. If you find yourself actually trying to suddenly add a lot of capacity, you also learn that the elasticity itself is a myth; the provisioning attempt will fail. Or e.g. lambda will hit its scaling rate limit way before a single minimally-sized fargate container would cap out.
If you don't mind the risk, you could also just not set a billing limit.
The actual reason to use clouds is for things like security/compliance controls.
I also expect that in reality, if you regularly try to provision 10,000 cores of capacity at once, you'll likely run into provisioning failures. Trying to cost optimize your business at that level at the risk of not being able to handle your daily needs is insane, and if you needed to take that kind of risk to cut your compute costs by 6x, you should instead go on-prem with full provisioning.
Having your servers idle 85% of the day does not matter if it's cheaper and less risky than doing burst provisioning. The only one benefiting from you trying to play utilization optimization tricks is Amazon, who will happily charge you more than those idle servers would've cost and sell the unused time to someone else.
And why is that a problem? And how different is that from "forgetting" to pay your bill and having your production environment brought down?
AWS will remind you for months before they actually stop it.
1) you hit the cap 2) aws sends alert but your stuff still runs at no cost to you for 24h 3) if no response. Aws shuts it down forcefully. 4) aws eats the “cost” because lets face it. It basically cost them 1000th of what they bill you for. 5) you get this buffer 3 times a year. After that. They still do the 24h forced shutdown but you get billed. Everybody wins.
Since there are in fact two ropes, maybe cloud providers should make it easy for customers to avoid the one they most want to avoid?
When my computer runs out of hard drive it crashes, not goes out on the internet and purchases storage with my credit card.
It is technically impossible. In that no tech can fix the greed of the people taking these decisions.
> No cloud provides wants to give their customers that much rope to hang themselves with.
They are so benevolent to us...
I've used AWS for about 10 years and am by no means an expert, but I've seen all kinds of ugly cracks and discontinuities in design and operation among the services. AWS has felt like a handful of very good ideas, designed, built, and maintained by completely separate teams, littered by a whole ton of "I need my promotion to VP" bad ideas that build on top of the good ones in increasingly hacky ways.
And in any sufficiently large tech orgnization, there won't be anyone at a level of power who can rattle cages about a problem like this, who will want to be the one to do actually it. No "VP of Such and Such" will spend their political capital stressing how critical it is that they fix the thing that will make a whole bunch of KPIs go in the wrong direction. They're probably spending it on shipping another hacked-together service with Web2.0-- er. IOT-- er. Blockchai-- er. Crypto-- er. AI before promotion season.
It wasn't when the service was first created. What's intentionally malicious is not fixing it for years.
Somehow AI companies got this right form the get go. Money up front, no money, no tokens.
It's easy to guess why. Unlike hosting infra bs, inference is a hard cost for them. If they don't get paid, they lose (more) money. And sending stuff to collections is expensive and bad press.
That’s not a completely accurate characterization of what’s been happening. AI coding agent startups like Cursor and Windsurf started by attracting developers with free or deeply discounted tokens, then adjusted the pricing as they figure out how to be profitable. This happened with Kiro too[1] and is happening now with Google’s Antigravity. There’s been plenty of ink spilled on HN about this practice.
[1] disclaimer: I work for AWS, opinions are my own
I haven’t seen any of the major model providers have a system where you use as many tokens as you want and then they bill you, like AWS has.
I dunno, Aurora’s pricing structure feels an awful lot like that. “What if we made people pay for storage and I/O? And we made estimating I/O practically impossible?”
Even if you are not currently hitting performance limits of your current engine, Aurora would maintain the same throughput and latency on smaller instance classes. Which is where the potential cost savings come from...
On top of that, with Aurora Serverless with variable and unpredictable workloads you could have important cost savings.
You can get an insane amount of performance out of a well-tuned MySQL or Postgres instance, especially if you’ve designed your schema to exploit your RDBMS’ strengths (e.g. taking advantage of InnoDB’s clustering index to minimize page fetches for N:M relationships).
And if you really need high performance, you use an instance with node-local NVMe storage (and deal with the ephemerality, of course).
0: https://hackmysql.com/are-aurora-performance-claims-true/
Also from the link...when MySQL is properly tuned, the performance gap narrows substantially but is still 1,5x to 3x for the workloads tested in the article something I would call massive.
Most workloads are somewhere between 90:10 to 98:2 reads:writes, and most tables have at least one (if not more) secondary indices.
You’re of course welcome to disagree, but speaking as a DBRE who has used both MySQL and Postgres, RDS and Aurora in production, I’m telling you that Aurora does not win on performance.
The biggest cool thing Aurora MySQL does, IMO, is maintain the buffer pool on restarts. Not just dump / restore, but actually keeps its residence. They split it out from the main mysqld process so restarting it doesn’t lose the buffer pool.
But all in all, IMO it’s hideously expensive, doesn’t live up to its promises, and has some serious performance problems due to the decision to separate compute and storage by a large physical distance (and its simplification down to redo log only): for example, the lack of a change buffer means that secondary indices have to be written synchronously. Similarly, since AWS isn’t stupid, they have “node-local” (it’s EBS) temporary storage, which is used to build the aforementioned secondary indices, among other things. The size of this is not adjustable, and simply scales with instance size. So if you have massive tables, you can quite easily run out of room in this temporary storage, which at best kills the DDL, and at worst crashes the instance.
Unfortunately, that's not correct. A multi-trillion dollar company most absolutely has not just such a person, but many departments with hundreds of people tasked with precisely that, maximizing revenue by exploting every dark pattern they can possibly think of.
It would be good to provide a factual basis for such a confident contradiction of the GP. This reads as “no, your opinion is wrong because my opinion is right”.
It's someone in a Patagonia vest trying to avoid getting PIP'd.
I have budgets set up and alerts through a separate alerting service that pings me if my estimates go above what I've set for a month. But it wouldn't fix a short term mistake; I don't need it to.
The lack of business case is the most likely culprit. "You want to put engineering resources into something that only the $100/mo guys are going to use?"
You might be tempted to think "but my big org will use that", but I can guarantee compliance will shut it down -- you will never be permitted to enable a feature that intentionally causes hard downtime when (some external factor) happens.
Conversely the first time someone hits an edge case in billing limits and their site goes down, losing 10k worth of possible customer transactions there's no way to unring that bell.
The second constituency are also, you know, the customers with real cloud budgets. I don't blame AWS for not building a feature that could (a) negatively impact real, paying customers (b) is primarily targeted at people who by definition don't want to pay a lot of money.
But an opt in „id rather you deleting data/disable than send me a 100k bill“ toggle with suitable disclaimers would mean people can safely learn.
Thats way everyone gets what they want. (Well except cloud provider who presumably don’t like limits on their open ended bills)
But hey, let's say you have different priorities than me. Then why not bot? Why not let me set the hard cap? Why Amazon insists on being able to bill me on more than my business is worth if I make a mistake?
Still sounds kind of ugly.
The key problem is that data loss is really bad pr which cannot be reversed. Overcharge can be reversed. In a twisted way it might even strengthen the public image, I have seen that happen elsewhere.
But over the last few years, people have convinced themselves that the cost of ignorance is low. Companies hand out unlimited self-paced learning portals, tick the “training provided” box, and quietly stop validating whether anyone actually learned anything.
I remember when you had to spend weeks in structured training before you were allowed to touch real systems. But starting around five or six years ago, something changed: Practitioners began deciding for themselves what they felt like learning. They dismantled standard instruction paths and, in doing so, never discovered their own unknown unknowns.
In the end, it created a generation of supposedly “trained” professionals who skipped the fundamentals and now can’t understand why their skills have giant gaps.
The expectation that it just works is mostly a good thing.
Not if its an Airbus A220 or similar. They made it easy to take off, but it is still a large commercial aircraft...easy to fly...for pilots...
It solves the problem of unexpected requests or data transfer increasing your bill across several services.
https://aws.amazon.com/blogs/networking-and-content-delivery...
Does "data transfer" not mean CDN bandwidth here? Otherwise, that price seems two orders of magnitude less than I would expect
https://news.ycombinator.com/item?id=45975411
I agree that it’s likely very technically difficult to find the right balance between capping costs and not breaking things, but this shows that it’s definitely possible, and hopefully this signals that AWS is interested in doing this in other services too.
101 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.