AWS in 2025: Stuff You Think You Know That's Now Wrong
Posted5 months agoActive4 months ago
lastweekinaws.comTechstoryHigh profile
calmmixed
Debate
60/100
AWSCloud ComputingInfrastructure
Key topics
AWS
Cloud Computing
Infrastructure
The article 'AWS in 2025: Stuff you think you know that's now wrong' discusses recent changes in AWS services, sparking a discussion among HN users about the implications and accuracy of these changes.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
36m
Peak period
52
0-3h
Avg / period
14.5
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Aug 20, 2025 at 11:30 AM EDT
5 months ago
Step 01 - 02First comment
Aug 20, 2025 at 12:06 PM EDT
36m after posting
Step 02 - 03Peak activity
52 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 21, 2025 at 10:41 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 44962844Type: storyLast synced: 11/20/2025, 8:23:06 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Not strictly true.
If key prefixes don’t matter much any more, then it’s a very recent change that I’ve missed.
S3 will automatically do this over time now, but I think there are/were edge cases still. I definitely hit one and experienced throttling at peak load until we made the change.
But I don’t know what conversations did or did not happen behind the scenes.
Please see the documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi...
This 2024 re:Invent session "Optimizing storage performance with Amazon S3 (STG328)" which goes very deep on the subject: https://www.youtube.com/watch?v=2DSVjJTRsz8
And this blog that discusses Iceberg's new base-2 hash file layout which helps optimize request scaling performance of large-scale Iceberg workloads running on S3: https://aws.amazon.com/blogs/storage/how-amazon-ads-uses-ice...
"If you want to partition your data even better, you can introduce some randomness in your key names": https://youtu.be/2DSVjJTRsz8?t=2206
FWIW The optimal way we were told was to partition our data was to do this: 010111/some/file.jpg.
Where `010111/` is a random binary string which will please both the automatic partitioning (503s => partition) and manual partitioning you could ask AWS. Please as in the cardinality of partitions grows slower at each characters vs prefixes like `az9trm/`.
We were told that the later version makes manual partitioning a challenge because as soon as you reach two characters you've already created 36x36 partitions (1,296).
The issue with that: your keys are no more meaningful if you're relying on S3 to have "folders" by tenants for example (customer1/..).
They spent the effort of branding private VPC endpoints "PrivateLink". Maybe it took some engineering effort on their part, but it should be the default out of the box, and an entirely unremarkable feature.
In fact, I think if you have private subnets, the only way to use S3 etc is Private Link (correct me if I'm wrong).
It's just baffling.
And, as as added benefit, they distinguish between "just pull" and "pull and push" which is nice
People who are probably shouldn't be on aws - but they usually have to for unrelated reasons, and they will work to reduce their bill.
This just sounds like a polite way of saying "we're taking peoples' money in exchange for nothing of value, and we can get away with it because they don't know any better".
Hideous.
S3 can use either, and we recommend establishing VPC Gateway endpoints by default whenever you need S3 access.
(Disclaimer: I work for AWS, opinions are my own.)
They should be, of course, at least when the destination is an AWS service in the same region.
[edit: I'm speaking about interface endpoints, but S3 and DynamoDB can use gateway endpoints, which are free to the same region]
The other problem with (interface) VPC endpoints is that they eat up IP addresses. Every service/region permutation needs a separate IP address drawn from your subnets. Immaterial if you're using IPv6, but can be quite limiting if you're using IPv4.
Just some internet services that haven’t upgraded. (But fixed by NAT.)
S3 can use either, and we recommend establishing VPC Gateway endpoints by default whenever you need S3 access.
(Disclaimer: I work for AWS, opinions are my own.)
Other AWS services, though, don't support gateway endpoints.
~~I get the impression there are several others, too, but that one is of especial interest to me~~ Wowzers, they really are much better now:
If you're saying "other services should offer VPC Endpoints," I am 100% on-board. One should never have to traverse the Internet to contact any AWS control planeWould you recommend using VPC Gateway even on a public VPC that has an Internet gateway (note: not a NAT gateway)? Or only on a private VPC or one with a NAT gateway?
Fascinating. What's the advantage of doing that?
Gateway endpoints only work for some things.
https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpo...
(Disclaimer: I work for AWS, opinions are my own.)
This has been a common gotcha for over a decade now: https://www.lastweekinaws.com/blog/the-aws-managed-nat-gatew...
There are many IaC libraries, including the standard CloudFormation VPC template and CDK VPC class, that can create them automatically if you so choose. I suspect the same is also true of commonly-used Terraform templates.
It's a convenience VS security argument, though the documentation could be better (including via AWS recommended settings if it sees you using S3).
NAT gateways are not purely hands-off, you can attach additional IP addresses to NAT gateways to help them scale to supporting more instances behind the NAT gateway, which is a fundamental part of how NAT gateways work in network architectures, because of the limit on the number of ports that can be opened through a single IP address. When you use a VPC Gateway Endpoint then it doesn't use up ports or IP addresses attached to a NAT gateway at all. And what about metering? If you pay per GB for traffic passing through the NAT gateway, but I guess not for traffic to an implicit built-in S3 gateway, so do you expect AWS to show you different meters for billed and not-billed traffic, but performance still depends on the sum total of the traffic (S3 and Internet egress) passing through it? How is that not confusing?
It's also besides the point that not all NAT gateways are used for Internet egress, indeed there are many enterprise networks where there are nested layers of private networks where NAT gateways help deal with overlapping private IP CIDR ranges. In such cases, having some kind of implicit built-in S3 gateway violates assumptions about how network traffic is controlled and routed, since the assumption is for the traffic to be completely private. So even if it was supported, it would need to be disabled by default (for secure defaults), and you're right back at the equivalent situation you have today, where the VPC Gateway Endpoint is a separate resource to be configured.
Not to mention that VPC Gateway Endpoints allow you to define policy on the gateway describing what may pass through, e.g. permitting read-only traffic through the endpoint but not writes. Not sure how you expect that to work with NAT gateways. This is something that AWS and Azure have very similar implementatoons for that work really well, whereas GCP only permits configuring such controls at the Organization level (!)
They are just completely different networking tools for completely different purposes. I expect closed-by-default secure defaults. I expect AWS to expose the power of different networking implements to me because these are low-level building blocks. Because they are low-level building blocks, I expect for there to be footguns and for the user to be held responsible for correct configuration.
And the traffic never even reaches the public internet. There's a mismatch between what the billing is supposedly for and what it's actually applied to.
> do you expect AWS to show you different meters for billed and not-billed traffic, but performance still depends on the sum total of the traffic (S3 and Internet egress) passing through it?
Yes.
> How is that not confusing?
That's how network ports work. They only go so fast, and you can be charged based on destination. I don't see the issue.
> It's also besides the point that not all NAT gateways are used for Internet egress
Okay, if two NAT gateways talk to each other it also should not have egress fees.
> some kind of implicit built-in S3 gateway violates assumptions
So don't do that. Checking if the traffic will leave the datacenter doesn't need such a thing.
On the one hand, this is obviously the right decision. The number of giant data breeches caused by incorrectly configured S3 buckets is enormous.
But... every year or so I find myself wanting to create an S3 bucket with public read access to I can serve files out of it. And every time I need to do that I find something has changed and my old recipe doesn't work any more and I have to figure it out again from scratch!
I'm still not sure I know how to do it if I need to again.
For small scale stuff, S3s storage and egress charges are unlikely to be impactful. But it doesn’t mean they’re cheap relative to the competition.
There are also ways you can reduce S3 costs, but then you're trading the costs received from AWS with the costs of hiring competent DevOps. Either way, you pay.
I'm not sure if that's changed recently, I've stopped using it.
eksctl just really impressed me with its eks management, specifically managed node groups & cluster add-ons, over terraform.
that uses cloudformation under the hood. so i gave it a try, and it’s awesome. combine with github actions and you have your IAC automation.
nice web interface for others to check stacks status, events for debugging and associated resources that were created.
oh, ever destroy some legacy complex (or not that complex) aws shit in terraform? it’s not going to be smooth. site to site connections, network interfaces, subnets, peering connections, associated resources… oh, my.
so far cloudformation has been good at destroying, but i haven’t tested that with massive legacy infra yet.
but i am happily converted tf>cf.
and will happily use both alongside each other as needed.
It does compile down to Azure Resource Manager's json DSL, so in that way close to Troposphere I guess, only both sides are official and not just some rando project that happens to emit yaml/json
The implementation, of course, is ... very Azure, so I don't mean to praise using it, merely that it's a better idea than rawdogging json
The syntax does look nicer but sadly that’s just a superficial improvement.
1. I need a goddamn CLI to run it (versus giving someone a URL they can load in their tenant and have running resources afterward)
1. the goddamn CLI mandates live cloud credentials, but then stright-up never uses them to check a goddamn thing it intends to do to my cloud control plane
You may say "running 'plan' does" and I can offer 50+ examples clearly demonstrating that it does not catch the most facepalm of bugs
1. related to that, having a state file that believes it knows what exists in the world is just ludicrous and pain made manifest
1. a tool that thinks nuking things is an appropriate fix ... whew. Although I guess in our new LLM world, saying such things makes me the old person who should get onboard the "nothing matters" train
and the language is a dumpster, imho
This isn’t a limitation of TF, it’s an intended consequence of cloud vendor lock in
> 1. I need a goddamn CLI to run it (versus giving someone a URL they can load in their tenant and have running resources afterward)
CloudFormation is the only IaC that supports "running as a URL" and that's only because it's an AWS native solution. And CloudFormation is a hell of a lot more painful to write and slower to iterate on. So you're not any better off for using CF.
What usually happens with TF is you'd build a deploy pipeline. Thus you can test via the CLI then deploy via CI/CD. So you're not limited to just the CLI. But personally, I don't see the CLI as a limitation.
> the goddamn CLI mandates live cloud credentials, but then stright-up never uses them to check a goddamn thing it intends to do to my cloud control plane
All IaC requires live cloud credentials. It would be impossible for them to work without live credentials ;)
Terraform does do a lot of checking. I do agree there is a lot that the plan misses though. That's definitely frustrating. But it's a side effect of cloud vendors having arbitrary conditions that are hard to define and forever changing. You run into the same problem with any tool you'd use to provision. Heck, even manually deploying stuff from the web console sometimes takes a couple of tweaks to get right.
> 1. related to that, having a state file that believes it knows what exists in the world is just ludicrous and pain made manifest
This is a very strange complaint. Having a state file is the bare minimum any IaC NEEDS for it to be considered a viable option. If you don't like IaC tracking state then you're really little better off than managing resources manually.
> a tool that thinks nuking things is an appropriate fix ... whew.
This is grossly unfair. Terraform only destroys resources when:
1. you remove those resources from the source. Which is sensible because you're telling Terraform you no longer want those resources
2. when you make a change that AWS doesn't support doing on live resources. Thus the limitation isn't Terraform, it is AWS
In either scenario, the destroy is explicit in the plan and expected behaviour.
Incorrect, ARM does too, they even have a much nicer icon for one click "Deploy to Azure" <https://learn.microsoft.com/en-us/azure/azure-resource-manag...> and as a concrete example (or whole repo of them): <https://github.com/Azure/azure-quickstart-templates/tree/2db...>
> All IaC requires live cloud credentials. It would be impossible for them to work without live credentials ;)
Did you read the rest of the sentence? I said it's the worst of both worlds: I can't run "plan" without live creds, but then it doesn't use them to check jack shit. Also, to circle back to our CF and Bicep discussion, no, I don't need cloud creds to write code for those stacks - I need only creds to apply them
I don't need a state file for CF nor Bicep. Mysterious about that, huh?
Regarding HCL, I respect their decision to keep the language minimal, and for all it's worth you can go very, very far with the language expressions and using modules to abstract some logic, but I think it's a fair criticism for the language not to support custom functions and higher level abstractions.
Also that have CDK which is a framework for writing IaC in Java/TypeScript, Go, Python, etc.
https://dev.to/kelvinskell/getting-started-with-aws-cdk-in-p...
I can't confirm it, but I suspect that it was always meant to be a sales tool.
Every AWS announcement blog has a "just copy this JSON blob, and paste it $here to get your own copy of the toy demo we used to demonstrate in this announcement blog" vibe to it.
It's so simple for storing and serving a static website.
Are there good and cheap alternatives?
BTW: Is GitHub Page still free for custom domains? (I don't know the EULA)
If you know what you're doing, as it sounds like you and I do, then all of this is very easy to get set up (but then aren't most things easy when you already know how? hehe). However we are talking about people who aren't comfortable with vanilla S3, so throwing another service into the mix isn't going to make things easier for them.
- signed url's in case you want a session base files download
- default public files, for e.g. a static site.
You can also map a domain (sub-domain) to Cloudfront with a CNAME record and serve the files via your own domain.
Cloudfront distributions are also CDN based. This way you serve files local to the users location, thus increasing the speed of your site.
For lower to mid range traffic, cloudfront with s3 is cheaper as the network cost of cloudfront is cheaper. But for large network traffic, cloudfront cost can balloon very fast. But in those scenarios S3 costs are prohibitive too!
Yeah, what month?
Even if you have a terrible and permissive bucket policy or ACLs (legacy but still around) configured for the S3 bucket, if you have Block Public Access turned on - it won't matter. It still won't allow public access to the objects within.
If you turn it off but you have a well scoped and ironclad bucket policy - you're still good! The bucket policy will dictate who, if anyone, has access. Of course, you have to make sure nobody inadvertantly modifies that bucket policy over time, or adds an IAM role with access, or modifies the trust policy for an existing IAM role that has access, and so on.
That being said, there's certain a lot more that could into making a system like that easier for developers. One thing that springs to mind is tooling that can describe what rules are currently in effect that limit (or grant, depending on the model) permissions for something. That would make it more clear when there are overlapping rules that affect the permissions of something, which in turn would make it much more clear why something is still not accessible from a given context despite one of the rules being removed.
After thinking about this sort of thing a lot when designing a system for something sort of similar to this (at a much smaller scale, but with the intent to define it in a way that could be extended to define new types of rules for a given set of resources), I feel pretty strongly that the best way for a system like this to work from the protectives of security, ease of implementation, and intuitiveness for users are all aligned in requiring every rule to explicitly be defined as a permission rather than representing any of them as restrictions (both in how they're presented to the user and how they're modeled under the hood). With this model, veryifing whether an action is allowed can be implemented by mapping an action to the set of accesses (or mutations, as the case may be) it would perform, and then checking that each of them has a rule present that allows it. This makes it much easier to figure out whether something is allowed or not, and there's plenty of room for quality of life things to help users understand the system (e.g. being able to easily show a user what rules pertain to a given resource with essentially the same lookup that you'd need to do when verifying an action in it). My sense is that this is actually not far from how AWS permissions are implemented under the hood, but they completely fail at the user-facing side of this by making it much harder than it needs to be to discover where to define the rules for something (and by extension, where to find the rules currently in effect for it).
Once I have that I can also ask it for the custom tweaks I need.
You're braver than me if you're willing to trust the LLM here - fine if you're ready to properly review all the relevant docs once you have code in hand, but there are some very expensive risks otherwise.
I take code from stack overflow all the time and there’s like a 90% chance it can work. What’s the difference here?
If you don't do that will you necessarily notice that you accidentally leaked customer data to the world?
The problem isn't the LLM it's assuming its output is correct just the same as assuming Stack Overflow answers are correct without verifying/understanding them.
If you are comparing with stackoverflow then I guess we are on the same page - most people are fine with taking stuff from stackoverflow and it doesn't count as "brave".
> I'm willing to accept the risk of ocassionally making S3 public
This is definitely where we diverge. I'm generally working with stuff that legally cannot be exposed - with hefty compliance fines on the horizon if we fuck up.
I feel like this workflow is still less time, easier and less error prone than digging out the exact right syntax from the AWS docs.
But for low stakes LLM works just fine - not everything is going to blow up to a 30,000 bill.
In fact I'll take the complete opposite stance - verifying your design with an LLM will help you _save_ money more often than not. It knows things you don't and has awareness of concepts that you might have not even read about.
I now have the daunting challenge of deploying an Azure Kubernetes cluster with... shudder... Windows Server containers on top. There's a mile-long list of deprecations and missing features that were fixed just "last week" (or whatever). That is just too much work to keep up with for mere humans.
I'm thinking of doing the same kind of customised chatbot but with a scheduled daily script that pulls the latest doco commits, and the Azure blogs, and the open GitHub issue tickets in the relevant projects and dumps all of that directly into the chat context.
I'm going to roll up my sleeves next week and actually do that.
Then, then, I'm going to ask the wizard in the machine how to make this madness work.
Pray for me.
TGW is... twice as expensive as vpc peering?
But unlike peering TGW traffic flows through an additional compute layer so it has additional cost.
They actually used to have the upstream docs in GitHub, and that was super nice for giving permalinks but also building the docs locally in a non-pdf-single-file setup. Pour one out, I guess
Wouldn't this always depend on the length of the queue to access the robotic tape library? Once your tape is loaded it should move really quickly:
https://www.ibm.com/docs/en/ts4500-tape-library?topic=perfor...
Your assumption holds if they still use tape. But this paragraph hints at it not being tape anymore. The eternal battle between tape versus drive backup takes another turn.
For storage especially we now build enough redundancy into systems that we don't have to jump on every fault. That reduces the chance of human error when trying to address it, and pushing the hardware harder during recovery (resilvering, catching up in a distributed concensus system, etc).
When the entire box gets taken out of the rack due to hitting max faults, then you can piece out the machine and recycle parts that are still good.
You could in theory ship them all off to the backend of nowhere, but it seems that Glacier is all the places where AWS data centers are, so it's not that. But Glacier being durable storage, with a low expectation of data out versus data in, they could and probably are cutting the aggregate bandwidth to the bone.
How good do your power backups have to be to power a pure Glacier server room? Can you use much cheaper in-rack switches? Can you use old in-rack switches from the m5i era?
Also most of the use cases they mention involve linear reads, which has its own recipe book for optimization. Including caching just enough of each file on fast media to hide the slow lookup time for the rest of the stream.
Little's Law would absolutely kill you in any other context but we are linear write, orders of magnitude fewer reads here. You have hardware sitting around waiting for a request. "Orders of magnitude" is the space where interesting solutions can live.
Everything you know is wrong.
Weird Al. https://www.youtube.com/watch?v=W8tRDv9fZ_c
Firesign Theatre. https://www.youtube.com/watch?v=dAcHfymgh4Y
119 more comments available on Hacker News