We cut our Mongo DB costs by 90% by moving to Hetzner
Mood
supportive
Sentiment
positive
Category
tech
Key topics
MongoDB
Hetzner
Cost Optimization
Cloud Infrastructure
A company reduced their MongoDB costs by 90% by migrating to Hetzner, a more cost-effective infrastructure provider.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
21m
Peak period
156
Day 1
Avg / period
40
Based on 160 loaded comments
Key moments
- 01Story posted
11/13/2025, 3:17:57 PM
5d ago
Step 01 - 02First comment
11/13/2025, 3:39:11 PM
21m after posting
Step 02 - 03Peak activity
156 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/18/2025, 7:11:21 AM
1d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Just look at the most recent many hour long Azure downtime where Microsoft could not even get microsoft.com back. With that much downtime you could physically move drives between servers multiple times each year, and still have less downtime. Servers are very reliable, cloud software is not.
I'm not saying people should use a single server if they can avoid it, but using a single cloud provider is just as bad. "We moved to the cloud, with managed services and redundancy, nothing has gone wrong...today"
Business people are weird about numbers. You should have claimed 70% even if the replicas do nothing and made them work later on. This is highly likely to bite you on the ass.
In technical terms you need to plan ahead. The legacy mistakes are caused by actions in the past and will likely be made again, when you can’t change the strategy or approach to problems. You won‘t get budget for this AFTER you successfully made a change. „It‘s all solved now, we are good“. No.
I mean, you're connecting to your primary database potentially on another continent? I imagine your costs will be high, but even worse, your performance will be abysmal.
> When you migrate to a self-hosted solution, you're taking on more responsibility for managing your database. You need to make sure it is secure, backed up, monitored, and can be recreated in case of failure or the need for extra servers arises.
> ...for a small amount of pain you can save a lot of money!
I wouldn't call any of that "a small amount of pain." To save $3,000/month you've now required yourself to become experts in a domain that maybe is out of your depth. So whatever cost saved is now tech debt and potentially having to hire someone else to manage your homemade solution for you.
However, I self-host, and applaud other self-hosters. But sometimes it really has to make business sense for your team.
Atlas AWS was actually setup in Ireland. The data transfer costs were coming from extracting data for ML modelling. We don't get charged for extracting data with the new contract.
> experts in a domain that maybe is out of your depth
We're in the bot detection space so we need to be able to run our own infra in order to inspect connections for patterns of abuse. We've built up a fair amount of knowledge because of this and we're lucky enough to have a guy in our team who just understands everything related to computers. He's also pretty good at disseminating information.
Thanks for reading!
Having said that, MongoDB pricing page promises 99.995% uptime, which is outstanding, and would probably be hard to beat that doing it oneself, even after adding redundancy. But maybe you don't need that much uptime for your particular use case.
> maybe you don't need that much uptime for your particular use case.
Correct. Thanks for reading!
Also, we noticed that after migration, the databases that were occupying ~600GB of disk in our (very old) on premise deployment, were around 1TB big on Atlas. After talking with support for a while we found that they were using Snappy compression with a relatively low compression level and we couldn't change that by ourselves. After requesting it through support, we changed to zstd compression, rebuilt all the storage, and a day or two later our storage was under 500GB.
And backup pricing is super opaque. It doesn't show concrete pricing on the docs, just ranges. And depending on the cloud you deployed, snapshots are priced differently so you can't just multiply you storage by the number of the snapshots, and they aren't transparent about the real size of the snapshots.
All the storage stuff is messy and expensive...
Or.. what? That's the important part
In fact after looking at https://www.mongodb.com/legal/sla/atlas/data-federation#:~:t... it makes me wonder how much worth the SLA is. 10% Service Credit after all the limitations?
Atlas can take their 10% Service Credit, i wouldn't care. Save the money and chose a stable provider.
That was an interesting surprise.
This makes me want to use the company's service less because now I know they can't survive an outage in a consistent and resilient way.
We have extra provisions for enterprise clients to provide rock solid SLAs for every use case.
The db in question is our data store for event that we use for aggregated features such as traffic analysis and ml. This service lags behind our realtime services so we can deal with some downtime if necessary
In addition to direct costs, Atlas had also expensive limitations. For example we often spin up clone databases from a snapshot which have lower performance and no durability requirements, so a smaller non-replicated server suffices, but Atlas required those to be sized like the replicated high performance production cluster.
You could cut your MongoDB costs by 100% by not using it ;)
> without sacrificing performance or reliability.
You're using a single server in a single datacenter. MongoDB Atlas is deployed to VMs on 2-3 AZs. You don't have close to the same reliability. (I'm also curious why their M40 instance costs $1000, when the Pricing Calculator (https://www.mongodb.com/pricing) says M40 is $760/month? Was it the extra storage?)
> We're building Prosopo to be resilient to outages, such as the recent massive AWS outage, so we use many different cloud providers
This means you're going to have multiple outages, AND incur more cross-internet costs. How does going to Hetzner make you more resilient to outages? You have one server in one datacenter. Intelligent, robust design at one provider (like AWS) is way more resilient, and intra-zone transfer is cheaper than going out to the cloud ($0.02/GB vs $0.08/GB). You do not have a centralized or single point of failure design with AWS. They're not dummies; plenty of their services are operated independently per region. But they do expect you to use their infrastructure intelligently to avoid creating a single point of failure. (For example, during the AWS outage, my company was in us-east-1, and we never had any issues, because we didn't depend on calling AWS APIs to continue operating. Things already running continue to run.)
I get it; these "we cut bare costs by moving away from the cloud" posts are catnip for HN. But they usually don't make sense. There's only a few circumstances where you really have to transfer out a lot of traffic, or need very large storage, where cloud pricing is just too much of a premium. The whole point of using the cloud is to use it as a competitive advantage. Giving yourself an extra role (sysadmin) in addition to your day job (developer, data scientist, etc) and more maintenance tasks (installing, upgrading, patching, troubleshooting, getting on-call, etc) with lower reliability and fewer services, isn't an advantage.
> AND incur more cross-internet costs hetzner have no bandwidth traffic limit (only speed) on the machine, we can go nuts.
I understand you point wrt the cloud, but I spend as much time debugging/building a cloud deployment (atlas :eyes: ) as I do a self-hosted solution. Aws gives you all the tools to build a super reliable data store, but many people just chuck something on us-east-1 and go. There's you single point of failure.
Given we're constructing a many-node decentralised system, self-hosted actually makes more sense for us because we've already had to become familiar enough to create a many-node system for our primary product.
When/if we have a situation where we need high data availability I would strongly consider the cloud, but in the situations where you can deal with a bit of downtime you're massively saving over cloud offerings.
We'll post a 6-month and 1-year follow-up to update the scoreboard above
Even dropping something on a single EC2 node in us-east-1 (or at Google Cloud) is going to be more reliable over time than a single dedicated machine elsewhere. This is because they run with a layer that will e.g. live migrate your running apps in case of hardware failures.
The failure modes of dedicated are quite different than those of the modern hyperscaler clouds.
On the other hand, a Hetzner machine I just rented came with Linux software RAID enabled (md devices in the kernel)
---
I'm not aware of any comparisons, but I'd like to see see some
It's not straightforward, and it's not obvious the cloud is more reliable
The cloud introduces many other single points of failure, by virtue of being more complex
e.g. human administration failure, with the Unisuper incident
https://news.ycombinator.com/item?id=40366867
https://arstechnica.com/gadgets/2024/05/google-cloud-acciden... - “Unprecedented” Google Cloud event wipes out customer account and its backups
Of course, dedicated hardware could have a similar type of failure, but I think the simplicity means there is less variety in the errors.
e.g. A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable - Leslie Lamport
I just wish there was a way to underscore this more and more. Complex systems fail in complex ways. Sadly, for many programmers, the thrill or ego boost that comes with solving/managing complex problems lets us believe complex is better than simple.
It's kept me employed though...
In a way, I think it doesn't matter what you use as long as you diversify enough (and have lots of backups), as everything can fail, and often the probability of failure doesn't even matter that much as any failure can be one too many.
The typical failure mode of AWS is much better. Half the internet is down, so you just point at that and wait for everything to come back, and your instances just keep running. If you have one server you have to do the troubleshooting and recovery work. But you need to run more than one machine to get fewer nines of reliability
A couple pieces of gentle pushback here:
- if you chose a hyperscaler, you should use their (often one-click) geographic redundancy & failover.
- All of the hyperscalers have more than one AZ. Specifically, there's no reason for any AWS customer to locate all/any* of their resources in us-east-1. (I actively recommend against this.)
* - Except for the small number of services only available in us-east-1, obviously.
Lets host it all with 2 companies instead and see how it goes.
Anyway random things you will encounter: Azure doesn't work because frontdoor has issues (again, and again) A webapp in Azure just randomly stops working, its not live migrated by any means, restarts don't work. Okay lets change SKU, change it back, oop its on a different baremetal cluster and now it works again. Sure there'll be some setup (read, upsell) that'll prevent such failures from reaching customers, but there is just simply no magic to any of this.
Really wish people would stop dreaming up reasons that hyperscalars are somehow magical places where issues don't happen and everything is perfect if you justtt increase the complexity a little bit more the next time around.
Going into any depth with mongo mostly taught me to just stick with postgres.
In the mean time, I am curious where the time was spent debugging and building Atlas deployments? It certainly isn't the cheapest option, but it has been quite a '1 click' solution for us.
I think it was just luck of the draw that the failure happened in this way and not some other way. Even if APIs falling over but EC2 instances remaining up is a slightly more likely failure mode, it means you can't run autoscaling, can't depend on spot instances which in an outage you can lose and can't replace.
Yes, this is part of designing for reliability. If you use spot or autoscaling, you can't assume you will have high availability in those components. They're optimizations, like a cache. A cache can disappear, and this can have a destabilizing effect on your architecture if you don't plan for it.
This lack of planning is pretty common, unfortunately. Whether it's in a software component or system architecture, people often use a thing without understanding the implications of it. Then when AWS API calls become unavailable, half the internet falls over... because nobody planned for "what happens when the control plane disappears". (This is actually a critical safety consideration in other systems)
We still take some steps to mitigate control plane issues in what I consider a reasonable AWS setup (attempt to lock ASGs to prevent scale-down) but I place the control plane disappearing on the same level as the entire region going dark, and just run multi-region.
If traffic cost is relevant (which it is for a lot of use cases), Hetzner's price of $1.20/TB ($0.0012 / GB) for internet traffic [1] is an order of magnitude less than what AWS charges between AWS locations in the same metro. If you host only at providers with reasonable bandwidth charges, most likely all of your bandwidth will be billed at less than what AWS charges for inter-zone traffic. That's obscene. As far as I can tell, clouds are balancing their budgets on the back of traffic charges, but nothing else feels under cost either.
> For example, during the AWS outage, my company was in us-east-1, and we never had any issues, because we didn't depend on calling AWS APIs to continue operating. Things already running continue to run.
This doesn't always work out. During the GCP outage, my service was running fine, but other similar services were having trouble, so we attracted more usage, which we would have scaled up for, except that the GCP outage prevented that. Cloud makes it very expensive to run scaled beyond current needs and promises that scale out will be available to do just in time...
This is a common problem with “bare metal saved us $000/mo” articles. Bare metal is cheaper than cloud by any measure, but the comparisons given tend to be misleadingly exaggerated as they don't compare like-for-like in terms of redundancy and support, and after considering those factors it can be a much closer result (sometimes down as far as familiarity and personal preference being more significant).
Of course unless you are paying extra for multi-region redundancy things like the recent us-east-1 outage will kill you, and that single point of failure might not really matter if there are several others throughout your systems anyway, as is sometimes the case.
If I'm storing data on a NAS, and I keep backups on a tape, a simple hardware failure that causes zero downtime on S3 might take what, hours to recover? Days?
If my database server dies and I need to boot a new one, how long will that take? If I'm on RDS, maybe five minutes. If it's bare metal and I need to install software and load my data into it, perhaps an hour or more.
Being able to recover from failure isn't a premature optimization. "The site is down and customers are angry" is an inevitability. If you can't handle failure modes in a timely manner, you aren't handling failure modes. That's not an optimization, that's table stakes.
It's not about five nines, it's about four nines or even three nines.
Backups are point in time snapshots of data, often created daily and sometimes stored on tape.
It's primary usecase is giving admins the ability to e.g restore partial data via export and similar. It can theoretically also be used to restore after you had a full data loss, but that's beyond rare. Almost no company has had that issue.
This is generally not what's used in high availability contexts. Usually, companies have at least one replica DB which is in read only and only needs to be "activated" in case of crashes or other disasters.
With that setup you're already able to hit 5 nines, especially in the context of b2e companies that usually deduct scheduled downtimes via SLA
This is "five nines every year except that one year we had two freak hardware failures at the same time and the site was hard down for eighteen hours".
"Almost no company has this problem" well I must be one incredibly unlucky guy, because I've seen incidents of this shape at almost every company I've worked at.
And it isn't just about 9s of uptime, it is all the admin that goes with DR if something more terrible then a network outage does happen, and other infrastructure conveniences. For instance: I sometimes balk at the performance we get out of AzureSQL given what we pay for it, and in my own time you are safe to bet I'll use something else on bare metal, but while DayJob are paying the hosting costs I love the platform dealing with managing backup regimes, that I can do copies or PiT restores for issue reproduction and such at the click of the button (plus a bit of a wait), that I can spin up a fresh DB & populate it without worrying overly about space issues, etc.
I'm a big fan of managing your own bare metal. I just find a lot of other fans of bare metal to be more than a bit disingenuous when extolling its virtues, including cost-effectiveness.
Unfortunately it's not guaranteed hat paying for multi-region replication will save you.
I cut my Mongo DB costs by 100% by piping my data to /dev/null.
Could we have done better with more sensible configs? Was it silly to cluster ES cross-AZ? Maybe. Point is that if you don't police every single detail of your platform at AWS/GCP and the like, their made-up charges will bleed your startup and grease their stock price.
As others have said unless the scale of the data is the issue, if your switching because of cost, perhaps you should be going back to your business model instead.
Setup mongodb (or any database) so that you have geographically distributed nodes with replication+whatever else and maintain the same SLA as one of the big hyperscalers. Blog about how long did it take to setup, how hard is it to maintain, and how much are the ongoing costs.
My hunch is a setup on the scale of the median saas company is way more simple and cost effective than you'd think.
No way I cannot spin up my infra in a full day even if the current datacenter burns to the ground.
So we have the same reliability.
I've not actually seen an AZ go down in isolation, so whilst I agree its technically a less "robust" deployment, in practice its not that much of a difference.
> these "we cut bare costs by moving away from the cloud" posts are catnip for HN. But they usually don't make sense.
We moved away from atlas because they couldn’t cope with the data growth that we had(4tb is the max per DB). Turns out that its a fuck load cheaper even hosting on amazon (as in 50%). We haven't moved to hertzner because that would be more effort than we really want to expend, but its totally doable, with not that much extra work.
> more maintenance tasks (installing, upgrading, patching, troubleshooting, getting on-call, etc) with lower reliability and fewer services, isn't an advantage.
Depends right, firstly its not that much of an overhead, and if it saves you significant cash, then it increases your run rate.
Counterpoint: I have. Maybe not completely down, but degraded, or out of capacity on an instance type, or some other silly issue that caused an AZ drain. It happens.
Naïve. If the network infrastructure is down, your computer goes down, it just happens that the functionality that went down you didn’t rely on. You could not rely on any functions at all by turning the server off, too.
Came here to say exactly this
It doesn't have to be only one server in one datacenter though.
It's more work, but you can have replicas ready to go at other Hetzner DCs (they offer bare metal at 3 locations in 2 different countries) or at other cheaper providers like OVH. Two or three $160 servers is still cheaper than what they're paying right now.
Depends on the service and its complexity. More complexity means more outages. In most instances a focus on easy recoverability is more productive than preemptive "reliability". As I have said, depends on your service.
And prices get premium very fast if you have either a lot of traffic or low traffic but larger file interchange. And you have more work to do if you use the cloud, because it uses non-standard interfaces. Today a well maintained server is a few clicks away. Even for managed servers you have maintenance and configuration. Plus, your provider probably changes the service quite often. I had to accommodate beanstalk while my application was just running on its own, free of maintenance needs.
Fighting said outages is often made harder is that the providers themselves just don't admit to anything being wrong, everything's green on the dashboard yet 4 out of 5 requests are timing out.
They are known to just cancel accounts and cut access.
I had paid for advertising on a few game curation sites plus youtubers and streamers. Lovely failure all thanks to Hetzner. Took 3 days and numerous emails with the most arrogant Germans you’ve ever met before my account was unlocked.
I switched to OVH and while they’re not without their own faults (reliability is a big one), it’s been a far better experience.
It seems like you have to go to one of the big boys like hurricane electric where you are allowed to use the bandwidth you paid for without someone sticking their fingers in it.
The old one for dedicated servers (robot) is horribly outdated though.
We need an American “get off American big tech” movement.
Differentiate people! Reading “we moved from X to Y” does not mean everyone move from X to Y, it means start considering the Y values and research other Y’s around you.
Yes. Hetzner is a German company from Gunzenhausen.
As a non-American, I use Hetzner precisely to have my projects not hosted anywhere near the US.
You set up your server. Harden it. Follow all the best practices for your firewall with ufw. Then you run a Docker container. Accidentally, or simply because you don’t know any better, you bind it to 0.0.0.0 by doing 5432:5432. Oops. Docker just walked right past your firewall rules, ignored ufw, and now port 5432 is exposed with default Postgres credentials. Congratulations. Say hello to Kinsing.
And this is just one of many possible scenarios like that. I’m not trying to spread FUD, but this really needs to be stressed much more clearly.
EDIT. as always - thank you HN for downvoting instead of actually addressing the argument.
Clearly defining your boundaries is important for both internal and external vectors of attack.
> when there's even mongo-api compatible Postgres solutions
With their own drawbacks.
Like a file system?
- schema-less: we don't have to think about DDL statements at any point.
- oplog and change streams as built-in change data capture.
- it's dead simple to setup a whole new cluster (replica set).
- IMO you don't need a designated DBA to manage tens of replica sets.
- Query language is rather low-level and that makes performance choices explicit.
But I have to admit that our requirements and architecture play to the strength of mongodb. Our domain model is neatly described in a strongly typed language. And we use a sort of event sourcing.
The only thing I really don't care for is managing Mongo... as a developer, using it is pretty joyous assuming you can get into the query mindset of how to use it.
Also, if you're considering Mongo, you might also want to consider looking at Cassandra/ScyllaDB or CockroachDB as alternatives that might be a better fit for your needs that are IMO, easier to administer.
Postgrtes distributed story is more complicated.
I'm using on a project not by choice. It was chosen already when I joined the project and the more we develop the project the more I feel Postgres would be a better fit but I don't think we can change it now
To be quite honest today's software engineering sadly is mostly about addressing 'how complex can we go' rather than 'what problem are we trying to solve'.
I think it's just more complicated than that. No hostage situation, just good old incentives.
Thing is, Linode was great 10-15 years ago, then enshittification ensued (starting with Akamai buying them).
So what does enshittification for Hetzner look like? I've already got migration scripts pointed at their servers but can't wait for the eventual letdown.
The pain points are when you're also intwined with specific implementations for services from a given provider... Sure, you can shift from PostgreSQL on a hosted provider to another without much pain... but say SQS to Azure Simple Queues or Service Bus is a lot more involved. And that is just one example.
The is a large reason to keep your services to those with self-hosted options and/or self-hosting from the start... that said, I'm happy to outsource things that are easier to (re) integrate or replace.
I love Hetzner for what they offer but you will run into huge outages pretty soon. At least you need two different network zones on Hetzner and three servers.
It's not hard to setup, but you need to do it.
Risk management is a normal part of business - every business does it. Typically the risk is not brought down all the way to zero, but to an acceptable level. The milk truck may crash and the grocery store will be out of milk that day - they don't send three trucks and use a quorum.
If you want to guarantee above-normal uptime, feel free, but it costs you. Google has servers failing every day just because they have so many, but you are not Google and you most likely won't experience a hardware failure for years. You should have a backup because data loss is permanent, but you might not need redundancy for your online systems. Depending on what your business does.
No need for SREs. Just add 2 more Hetzner servers.
from the "Serverborse": i7-7700 with 64GB ram and 500G disk.
37.5 euros/month
This is ~8 vcpus + 64GB ram + 512G disk.
585 USD/month
It gets a lot worse if you include any non-negligible internet traffic. How many machines before for your company a team of SREs is worth it? I think it's actually dropped to 100.
"Run a script to deploy new node and load last backup" can be enough, but then you have to plan on what to tell customers when last few hours of their data is gone
41 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.