Leaving Serverless Led to Performance Improvement and a Simplified Architecture
Mood
heated
Sentiment
mixed
Category
other
Key topics
The author of the article discusses how abandoning serverless architecture improved their application's performance and simplified their architecture, sparking a debate among commenters about the merits and drawbacks of serverless computing.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
39m
Peak period
145
Day 1
Avg / period
53.3
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 15, 2025 at 7:20 AM EDT
about 1 month ago
Step 01 - 02First comment
Oct 15, 2025 at 7:59 AM EDT
39m after posting
Step 02 - 03Peak activity
145 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 24, 2025 at 4:10 PM EDT
about 1 month ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
All major cloud vendors have serveless solutions based on containers, with longer managed lifetimes between requests, and naturally the ability to use properly AOT compiled languages on the containers.
It reminds me of the companies that start building their application using a NoSQL database and then start building their own implementation of SQL on top of it.
Isn't serverless at the base the old model, of shared vms, except with a ton of people?
I'm old school I guess, baremetal for days...
Usually a decision factor between more serverless, or more DevOps salaries.
Why's that? Serverless is just the generic name for CGI-like technologies, and CGI is exactly how classical web application were typically deployed historically, until Rails became such a large beast that it was too slow to continue using CGI, and thus running your application as a server to work around that problem in Rails pushed it to become the norm across the industry — at least until serverless became cool again.
Making your application the server is what is more complex with more moving parts. CGI was so much simpler, albeit with the performance tradeoff.
Perhaps certain implementations make things needlessly complex, but it is not clear why you think serverless must fundamentally be that way.
But no, I'd not put any API services/entrypoints on a lambda, ever. Maybe you could manufacture a scenario where like the API gets hit by one huge spike at a random time once per year, and you need to handle the scale immediately, and so it's much cheaper to do lambda than make EC2 available year-round for the one random event. But even then, you'd have to ensure all the API's dependencies can also scale, in which case if one of those is a different API server, then you may as well just put this API onto that server, and if one of them is a database, then the EC2 instance probably isn't going to be a large percentage of the cost anyway.
There are probably exceptions, but I can't think of a single case where doing this kind of thing in a lambda didn't cause problems at some point, whereas I can't really think of an instance where putting this kind of logic directly into my main app has caused any regrets.
It can make sense if you have very differing load, with few notable spikes or on an all in on managed services, where serverless things are event collectors from other services ("new file in object store" - trigger function to update some index)
The nice thing about JS workers is that they can start really fast from cold. If you have low or irregular load, but latency is important, Cloudflare Workers or equivalent is a great solution (as the article says towards the end).
If you really need a full-featured container with AOT compiled code, won't that almost certainly have a longer cold startup time? In that scenario, surely you're better off with a dedicated server to minimise latency (assuming you care about latency). But then you lose the ability to scale down to zero, which is the key advantage of serverless.
Serverless with containers is basically managed Kubernetes, where someone else has the headache to keep the whole infrastructure running.
They get to the bottom of the post and drop:
> Fargate handles scaling for us without the serverless constraints
They dropped workers for containers.
Unlikely? They could've just as well deployed their single go binary to a vm from day 1 and it would've been smooth sailing for their use case, while they acquire customers.
The cloudflare workers they chose aren't really suited for latency critical, high throughput APIs they were designing.
Also, if it's just Golang, point Ansible or whatever deploys at new server and trigger a deploy.
That said, as an example, an m8g.8xlarge gives you 32 vCPU / 128 GiB RAM for about $1000/month in us-east-1 for current on-demand pricing, and that drops to just under $700 if you can do a 1-year RI. I’m guessing this application isn’t super memory-heavy, so you could save even more by switching to the c-family: same vCPU, half the RAM.
Stick two of those behind a load balancer, and you have more compute than a lot of places actually need.
Or, if you have anything resembling PMF, spend $10K or so on a few used servers and put them into some good colo providers. They’ll do hardware replacement for you (for a fee).
Really I think DHH just likes to tell others what he likes.
I’m assuming you’re an employee of the company based on your comments, so please don’t take this poorly - I applaud any and all public efforts to bring back sanity to modern architecture, especially with objective metrics.
And yeah you’re right in hindsight it was a terrible idea to begin with
I thought it could work but didn’t benchmark it enough and didn’t plan enough. It all looked great in early POCs and all of these issues cropped up as we built it
"Serverless was fighting us" vs "We didn't understand serverless tradeoffs" - one is a learning experience, the other is misdirected criticism.
It is your decision to make this a circlejerk of musings about how the company must be run by amateurs. Whatever crusade you're fighting in vividly criticising them is not valuable at all. People need to learn and share so we can all improve, stop distracting from that point.
But here I dont think they (or their defenders) are still aware of the real lesson here.
Theres literally zero information thats valuable here. Its like saying "we used an 18 wheeler as our family car and then we switched over to a regular camry and solved all our problems." What is the lesson to be learned in that statement?
The real interesting post mortem would be if they go, "god in retrospect what a stupid decision we took; what were we thinking? Why did we not take a step back earlier and think, why are we doing it this way?" If they wrote a blog post that way, that would likely have amazing takeaways.
Not sure what the different takeaways would be though?
Im genuinely curious because this is not singling out your team or org, this is a very common occurrence among modern engineering teams, and I've often found myself on the losing end of such arguments. So I am all ears to hear at least one such team telling what goes on in their mind when they make terrible architecture decisions and if they learned anything philosophical that would prevent a repeat.
I was working on it on and off moving one endpoint at a time but it was very slow until we hired someone who was able to focus on it.
It didn’t feel good at all. We knew the product had massive flaws due to the latency but couldn’t address it quickly. Especially cause we he to build more workarounds as time went on. Workarounds we knew would be made redundant by the reimplementation.
I think we had that discussion if “wtf are we doing here” pretty early, but we didn’t act on it in the beginning, instead we tried different approaches to make it work within the serverless constraints cause that’s what we knew well.
Isn’t this the whole point of serverless edge?
It’s understood to be more complex, with more vendor lockin, and more expensive.
Trade off is that it’s better supported and faster by being on the edge.
Why would anyone bother to learn a proprietary platform for non critical, latency agnostic service?
The whole point of edge is NOT to make latency-critical APIs with heavy state requirements faster. It's to make stateless operations faster. Using it for the former is exactly the mismatch I'm describing.
Their 30ms+ cache reads vs sub-10ms target latency proves this. Edge proximity can't save you when your architecture adds 3x your latency budget per cache hit.
I wonder if there is anything other than good engineering getting in the way of this and even sub us intra-process pull through caches for busy lambda functions. After all, if my lambda is getting called 1000X per second from the same point of presence, why wouldn't they keep the process in memory?
That's hot start VS cold start.
This may or may not matter to you depending on your application’s needs, but there is a significant performance difference between, say, an m4 family (Haswell / Broadwell) and an m7i family (Sapphire Rapids) - literally a decade of hardware improvements. Memory performance in particular can be a huge hit for latency-sensitive applications.
Edit: found it. Cool! https://rove.dev/
Setting up the required roles and permissions was also a nightmare. The deployment round trip time was also awful.
The 2 good experiences I had with AWS was when we had a super smart devops guy who set up the whole docker pipeline on top of actual instances, so we could deploy our docker compose straight to a server in under 1 minute (this wasn't a scaled app), and had everything working.
Lambda is also pretty cool, you can just zip everything up and do a deploy from aws cli without much scripting and pretty straightforward IaC.
But the easy solution is just to use AWS’s own Docker registry and copy the images to it. Fargate has allowed you to attach EFS volumes for years.
https://github.com/1Strategy/fargate-cloudformation-example/...
Most cloud pain people experience is from a misunderstanding / abuse of solutions architecture and could have been avoided with a more thoughtful design. It tends to be a people problem, not a tool problem.
However, in my experience cloud vendors sell the snot out of their offerings, and the documentation is closer to marketing than truthful technical documentation. Their products’ genuine performance is a closely guarded proprietary secret, and the only way to find out… e.g. whether Lambdas are fast enough for your use case, or whether AWS RDS cross-region replication is good enough for you… is to run your own performance testing.
I’ve been burned enough times by AWS making it difficult to figure out exactly how performant their services are, and I’ve learned to test everything myself for the workloads I’ll be running.
I know about Anycast but not how to make it operational for dynamic web products (not like CDN static assets). Any tips on this?
DIY Anycast is probably beyond most people’s reach, as you need to deal with BGP directly.
One cool trick is using GeoDNS to route the same domain to a different IP depending on the location of the user, but there are some caveats of course due to caching and TTL.
EDIT: Back to Anycast, there are also some providers who allow you BGP configuration, like those: https://www.virtua.cloud/features/your-ip-space - https://us.ovhcloud.com/network/byoip - https://docs.hetzner.com/robot/colocation/pricing/ ... However you still need to get the IPs by yourself, by dealing with your Regional Registry (RIPE in my case, in Europe)
Say you're in city A where you use transit provider 1 and city B where you use transit provider 2. If a user is in city B and their ISP is only connected to transit provider 1, BGP says deliver your traffic to city A, because then traffic doesn't leave transit provider 1 until it hits your network. So for every transit network you use, you really want to connect to it at all your PoPs, and you probably want to connect to as many transit networks as feasible. If you're already doing multihoming at many sites, it's something to consider; if not, it's probably a whole lot of headache.
GeoDNS as others suggested is a good option. Plenty of providers out there, it's not perfect, but it's alright.
Less so for web browsers, but you can also direct users to specific servers. Sample performance for each /24 and /48 and send users to the best server based on the statistics, use IP location as a fallback source of info. Etc. Not great for simple websites, more useful for things with interaction and to reduce the time it takes for tcp slow start (and similar) to reach the available bandwidth.
Azure/AWS/GCP all have solutions for this and does not require you to use their services. There are probably other DNS providers that can do it as well.
Cloudflare can also do this as well but it's probably more expensive than DNS.
I participated in AWS training and certification given by AWS for a company to obtain a government contract and I can 100% say that the PAID TRAINING itself is also 100% marketing and developer evangelism.
AWS will hopefully be reduced to natural language soon enough with AI, and their product team can move on (most likely they moved on a long time ago, and the revolving door at the company meant it was going remain a shittily thought out platform in long term maintenance).
They were a much nicer, if overpriced, load balancing alternative to the Cisco Content Switch we were using, though.
Just use Docker, there are plenty of services where deployment is simply - “hand your container to us and we run it”.
Even the most complicated popular ways to deploy Docker are simpler than deploying to a VM and a lot less error prone.
I think they are shooting themselves in the foot with this approach. If you have to run a monte carlo simulation on every one of their services at your own time and expense just to understand performance and costs, people will naturally shy away from such black boxes.
I don't this isn't true. In fact, it seems that in the industry, many developers don't proceed with caution and go straight into usage, only to find the problems later down the road. This is a result of intense marketing on the part of cloud providers.
This is how much it takes for a CTO to demand the next week that "everything should be done with AWS cloud-native stuff if possible".
Or maybe the original implementation team really didn't know what they were doing. But I'd rather give them the benefit of the doubt. Either way, I appreciate them sharing these observations because sharing these kinds of stories is how we collectively get better as a professional community.
This matches my experience. It's very difficult to argue against costly and/or inappropriate technical decisions in environments where the 'Senior Tech Leadership' team are just not that technical but believe they are, and so are influenced by every current industry trend masquerading as either 'scalable', 'modern' or (worst of all) 'best practice'.
I see this a lot in startups that grew big before they had a chance to grow up.
And to add, this rarely indicates anything about the depth and/or breadth of the 'used to' experience.
A lot of the strongest individual contributors I see want to stay in that track and use that experience to make positive and sensible change, while the ones that move into the management tracks don't always have such motivations. There's no gatekeeping intended here, just an observation that the ones that are intrinsically motivated by the detailed technical work naturally build that knowledge base through time spent hands-on in those areas and are best able to make more impactful systemic decisions.
People in senior tech leadership also are not often exposed to the direct results of their decisions too (if they even stay in the company for long enough to see the outcome of longer-term decisions, which itself is rare).
While it's not impossible to find the folk that do have breadth of experience and depth of knowledge but are comfortable and want to be in higher-level decision making places, it's frustratingly rare. And in a lot of cases, the really good ones that speak truth to power end up in situations where 'Their last day was yesterday, we wish them all the best in their future career endeavours.' It's hardly surprising that it's a game that the most capable technical folks just don't want to play, even if they're the ones that should be playing it.
This all could just be anecdata from a dysfunctional org, of course...
Personally, I appreciate the info and the admission.
I think cause connections can be reused more often. Cloud flare workers are really prone to doing a lot of TLS handshakes cause they spin up new ones constantly
Right now were just hang aws far hate for the go servers, so there really isn’t much maintenance at all. We’ll be moving that into eks soon though cause we are starting to add more stuff and need k8s anyways
Unfortunately too many comments here are quick to come to the wrong conclusion, based only on the title. Not a reason to change it though!
It’s totally fair criticism that the title and wording is a bit clickbaity
But that’s ok
Just curious if this workload also saw some of the same improvements (on a quick read it seems like you could have been hitting the routing problem CF mentions)
- Eliminated complex caching workarounds and data pipeline overhead
- Simplified architecture from distributed system to straightforward application
We, as developers/engineers (put whatever title you want), tend to make things complex for no reason sometimes. Not all systems have to follow state-of-the-art best practices. Many times, secure, stable, durable systems outperform these fancy techs and inventions. Don't get me wrong, I love to use all of these technologies and fancy stuff, but sometimes that old, boring, monolithic API running on an EC2 solves 98% of your business problems, so no need to introduce ECS, K8S, Serverless, or whatever.
Anyway, I guess I'm getting old, or I understand the value of a resilient system, and I'm trying to find peace xD.
Adding that much compute to an edge POP is a big lift; even firecracker gets heavy at scale. And security risk for executing arbitrary code since these POPs don't have near the physical security of a datacenter, small scale makes more vulnerable to timing attacks, etc.
While it "takes away" some work from you, it adds this work on other points to solve the "artificial induced problems".
Another example i hit was a hard upload limit. Ported an application to a serverless variant, had an import API for huge customer exports. Shouldnt be a problem right? Just setup an ingest endpoint and some background workers to process the data.
Tho than i learned : i cant upload more than 100mb at a time through the "api gateway" (basically their proxy to invoke your code) and when asking if i could change it somehow i just was told to tell our customers to upload smaller file chunks.
While from a "technical" perspective this sounds logical, our customers not gonne start exchanging all their software so we get a "nicer upload strategy".
For me this is comparable with "it works in a vacuum" type of things. Its cool in theory, but as soon it hits reality you will realice quite fast that the time and money you safed on changing from permanent running machines to serverless, you will spend in other ways to solve the serverless specialities.
Have the users upload to s3 directly and then they can either POST you what they uploaded or you can find some other means of correlating the input (eg: files in s3 are prefixed with the request id or something)
I agree this is annoying and maybe I’ve been in AWS ecosystem for too long.
However having an API that accepts an unbounded amount of data is a good recipe for DoS attacks, I suppose the 100MB is outdated as internet has gotten faster but eventually we do need some limit
In this specific case im getting oldschool file upload request from software that was partly written before the 2000s - noones gonne adjust anything any more.
And ye, just accepting giant size uploads is far from good in terms of "Security" like DoS - but ye we talking about stupidly somewhere between 100 and 300mb CSV files (called them "huge" because in terms of product data 200-300mb text include quite alot) - not great but well we try to satisfy our customers needs.
But ye like all the other points - everything is solvable somehow - just needs us to spend more time to solve something that technickly wasn't a real problem in first place.
Edit: Another funny example. In a similar process on another provider i downloaded files in a similar size range from S3 to parse them - which died again and again. After contacting the hoster, because their logs litearlly just stopped no error tracing nothing) they told me that basically their setup only allows for 10mb local storing - and the default (in this case aws s3 adapter for PHP) always downloads it even if you tell it to "stream". So i build a solution that used HTTP ranged requests to "fake stream" the file into memory in smaller chunks so i could process it afterwards without completely download it. Just another example of : yes its solvable, but annoying.
Then I either batch/schedule the processing or give them an endpoint to just to trigger it (/data/import?filename=demo.csv)
It’s actually so common that I just have the “data exchange” conversation and let them decide which fits their needs best. Most of it is available for self service configuration.
Uploader on the client uses presigned url. S3 triggers lambda. Lambda function takes file path and tells background workers about it either via queue, mq, rest, gRPC, or doing the lift in workflow etl functions.
Easy peasy. /s
I read this and was getting ready to angrily start beating my keyboard. The best satire is hard to detect.
So I still don't see how it's notably worse than the idea of using serverless at all.
The sarcasm of correctness yet playing down its complexity is entirely my own. We used to be able to do things easily.
how will your client know if you backend lambda crashed or whatever? All it knows is the upload to s3 succeeded
Basically you’re turning a synchronous process into asynchronous
It actually is though. I don't need to build a custom upload client, I don't need to manage restart behavior, I get automatic restarts if any of the background workers fail, I have a dead letter queue built in to catch unusual failures, I can tie it all together with a common API that's a first class component of the system.
Working in the cloud forces you to address the hard problems first. If you actually take the time to do this everything else becomes _absurdly_ easy.
I want to write programs. I don't want to manage failures and fix bad data in the DB directly. I personally love the cloud and this separation of concerns.
It also forces you to address all the non-existent problems first, the ones you just wish you had like all the larger companies that genuinely have to deal with thousands of file upload per second.
And don't forget all the new infrastructure you added to do the job of just receiving the file in your app server and putting it into the place it was going to go anyway but via separate components that all always seem to end up with individual repositories, separate deployment pipelines, and that can't be effectively tested in isolation without going into their target environment.
And all the additional monitoring you need on each of the individual components that were added, particularly on those helpful background workers to make sure they're actually getting triggered (you won't know they're failing if they never got called in the first place due to misconfiguration).
And you're now likely locked into your upload system being directly coupled to your cloud vendor. Oh wait, you used Minio to provide a backend-agnostic intermediate layer? Great, that's another layer that needs managing.
Is a content delivery network better suited to handling concurrent file uploads from millions of concurrent users than your app server? I'd honestly hope so, that's what it's designed for. Was it necessary? I'd like to see the numbers first.
At the end of the day, every system design decision is a trade off and almost always involves some kind of additional complexity for some benefit. It might be worth the cost, but a lot of these system designs don't need this many moving parts to achieve the same results and this only serves to add complexity without solving a direct problem.
If you're actually that company, good for you and genuinely congratulations on the business success. The problem is that companies that don't currently and may never need that are being sold system designs that, while technically more than capable, are over-designed for the problem they're solving.
You will have these problems. Not as often as the larger companies but to imagine that they simply don't exist is the opposite of sound engineering.
> if they never got called in the first place due to misconfiguration
Centralized logging is built into all these platforms. Debugging these issues is one of the things that becomes absurdly easy.
> likely locked into your upload system
The protocol provided by S3 is available through dozens of vendors.
> Was it necessary?
It only matters if it is of equivalent or lessor cost.
> every system design decision is a trade off
Yet you explicitly ignore these.
> are being sold system designs
No, I just read the documentation, and then built it. That's one of those "trade offs" you're willingly ignoring.
A lot of those failure mode examples seem well suited to client-side retries and appropriate rate limiting. If we're talking file uploads then sure, there absolutely are going to be cases where the benefits of having clients go to the third-party is more beneficial than costly (high variance in allowed upload size would be one to consider), but for simple upload cases I'm not so convinced that high-level client retries aren't something that would work.
> if they never got called in the first place due to misconfiguration
I find it hard to believe that having more components to monitor will ever be simpler than fewer. If we're being specific about vendors, the AWS console is IMHO the absolute worst place to go for a good centralized logging experience, so you almost certainly end up shipping your logs into a better centralized logging system that has more useful monitoring and visualisation features than CloudWatch and has the added benefit of not being the AWS console. The cost here? Financial, time, and complexity/moving parts for moving data from one to the other. Oh and don't forget to keep monitoring on the log shipping component too, that can also fail (and needs updates).
> The protocol provided by S3 is available through dozens of vendors.
It's become a de facto standard for sure, and is helpful for other vendors to re-implement it but at varying levels of compatibility.
> It only matters if it is of equivalent or lessor cost.
This is precisely the point, I'm saying that adding boxes in the system diagram is a guaranteed cost as much as a potential benefit.
> Yet you explicitly ignore these
I repeatedly mentioned things that to me count as complexity that should be considered. Additional moving parts/independent components, the associated monitoring required, repository sprawl, etc.
> No, I just read the documentation, and then built it.
I also just 'read the documention and built it', but other comments in the thread allude to vendor-specific training pushing for not only vendor-specific solutions (no surprise) but also the use of vendor-specific technology that maybe wasn't necessary for a reliable system. Why use a simple pull-based API with open standards when you can tie everything up in the world of proprietary vendor solutions that have their own common API?
But not all of the S3 API is supported by other vendors - the asynchronous triggers for lambdas and the CloudTrail logs that you write code to parse.
People often don't know how different might be easier for their case.
Following others, or the best practices, when they might not apply in their case can lead to to social proof architecture a little too often.
GP said this is an app from the 2000s.
For S3 you do need to generate a presigned URL, so you would have to add this logic there somewhere instead of "just having a generic HTTP upload endpoint".
Unless the solution is "don't have the problem in the first place" the cloud limitations are just getting in the way here.
If we want to treat the architectural peculiarities of GP's stack as an indictment of serverless in general, then we could just as well point to the limitations of running LAMP on a single machine as an indictment of servers in general (which obviously would be silly, since LAMP is still useful for some applications, as are bare metal servers).
Upload file to S3 -> trigger an SNS message for fanout if you need it -> SNS -> SQS trigger -> SQS to ETL jobs.
The ETL job can then be hosted using Lambda (easiest) or ECS/Docker/Fargate (still easy and scales on demand) or even a set of EC2 instances that scale based on the items in a queue (don’t do this unless you have a legacy app that can’t be containerized).
If your client only supports SFTP, there is the SFTP Transfer Service on AWS that will allow them to send the file via SFTP and it is automatically copied to an S3 bucket.
Alternatively, there are products that treat S3 as a mountable directory and they can just use whatever copy commands on their end to copy the file to a “folder”
For uploads under 50 MB you could also skip the multipart upload and take a naive approach without taking a significant hit.
https://fullstackdojo.medium.com/s3-upload-with-presigned-ur...
And before you cry “lock in”, S3 API compatible services are a dime a dozen outside of AWS including GCP and even Backblaze B2.
The biggest one I regret is "communicating through the file system is 10x dumber than you think it is, even if you think you know how dumb it is." I should have a three page bibliography on that. Mostly people don't challenge you on this, but I had one brilliant moron at my last job who did, and all I could do was stare at him like he had three heads.
Lambda still requires that you need to update the Node runtime every year or two, while with your own containers, you can decide on your own upgrade schedule.
Being in the cloud doesn't mean you need to accept timeouts/limitations. CDK+fargate can easily run an ephemeral container to perform some offline processing.
I guess they never came out of MVP, which could warrant using serverless, but in the end it makes 0 sense to use some slow solution like this for the service they are offering.
Why didnt they go with a self hosted backend right away?
Its funny how nowadays most devs are too scared to roll their own and just go with the cloud offerings that cost them tech debt and actual money down the road.
We believed their docs/marketing without doing extensive benchmarks, which is on us.
The appeal was also to use the same typescript stack across everything, which was nice to work with
Source work somewhere where you easily get 1ms cached relational DB reads from outside the service.
30ms makes me suspect it went cross region.
"Down with serverless! Long live serverless!"
100 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.