Go Ahead, Self-Host Postgres
Key topics
The debate around self-hosting Postgres versus relying on managed services is heating up, with veterans sharing their decades-long experiences of rock-solid self-hosting and others chiming in on the reliability and support trade-offs. While some swear by the control and reliability of self-hosting, others point out that the need for 24/7 support is often driven by global customer scope, and that managed services can be a safer bet for critical applications. The discussion reveals a divide between those who've never had issues with self-hosting and those who've been burned by outages, with some noting that even big companies like AWS aren't immune to downtime. As one commenter quipped, when AWS goes down, "you share the link to AWS being down and go back to sleep," highlighting the stark contrast between the stress of dealing with self-hosted outages and the relative ease of relying on a major provider.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
8m
Peak period
120
0-6h
Avg / period
17.8
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 20, 2025 at 10:43 AM EST
17 days ago
Step 01 - 02First comment
Dec 20, 2025 at 10:51 AM EST
8m after posting
Step 02 - 03Peak activity
120 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 22, 2025 at 2:33 PM EST
15 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Of all the places I've worked that had the attitude "If this goes down at 3AM, we need to fix it immediately", there was only one where that was actually justifiable from a business perspective. I'm worked at plenty of places that had this attitude despite the fact that overnight traffic was minimal and nothing bad actually happened if a few clients had to wait until business hours for a fix.
I wonder if some of the preference for big-name cloud infrastructure comes from the fact that during an outage, employees can just say "AWS (or whatever) is having an outage, there's nothing we can do" vs. being expected to actually fix it
From this perspective, the ability to fix problems more quickly when self hosting could be considered an antifeature from the perspective of the employee getting woken up at 3am
That includes big and small businesses, SaaS and non-SaaS, high scale (5M+rps) to tiny scale (100s-10krps), and all sorts of different markets and user bases. Even at the companies that were not staffed or providing a user service over night, overnight outages were immediately noticed because on average, more than one external integration/backfill/migration job was running at any time. Sure, “overnight on call” at small places like that was more “reports are hardcoded to email Bob if they hit an exception, and integration customers either know Bob’s phone number or how to ask their operations contact to call Bob”, but those are still environments where off-hours uptime and fast resolution of incidents was expected.
Between me, my colleagues, and friends/peers whose stories I know, that’s an N of high dozens to low hundreds.
What am I missing?
IME the need for 24x7 for B2B apps is largely driven by global customer scope. If you have customers in North American and Asia, now you need 24x7 (and x365 because of little holiday overlap).
That being said, there are a number of B2B apps/industries where global scope is not a thing. For example, many providers who operate in the $4.9 trillion US healthcare market do not have any international users. Similarly the $1.5 trillion (revenue) US real estate market. There are states where one could operate where healthcare spending is over $100B annually. Banks. Securities markets. Lots of things do not have 24x7 business requirements.
All of those places needed their backend systems to be up 24/7. The banks ran reports and cleared funds with nightly batches—hundreds of jobs a night for even small banking networks. The healthcare companies needed to receive claims and process patient updates (e.g. your provider’s EMR is updated if you die or have an emergency visit with another provider you authorized for records sharing—and no, this is not handled by SaaS EMRs in many cases) over night so that their systems were up to date when they next opened for business. The “regular” businesses closed for the night generated reports and frequently had IT staff doing migrations, or senior staff working on something at midnight due the next day (when the head of marketing is burning the midnight oil on that presentation, you don’t want to be the person explaining that she can’t do it because the file server hosting the assets is down all the time after hours).
And again, that’s the norm I’ve heard described from nearly everyone in software/IT that I know: most businesses expect (and are willing to pay for or at least insist on) 24/7 uptime for their computer systems. That seems true across the board: for big/small/open/closed-off-hours/international/single-timezone businesses alike.
But there are also a not-insignificant number of important systems where nobody is on a pager, where there is no call rotation[1]. Computers are much more reliable than they were even 20 years ago. It is an Acceptable Business Choice to not have 24x7 monitoring for some subset of systems.
Until very recently[2], Citibank took their public website/user portal offline for hours a week.
1 - if a system does not have a fully staffed call rotation with escalations, it's not prepared for a real off-hours uptime challenge 2 - they may still do this, but I don't have a way to verify right now.
Also, in addition to perception/reputation issues, B2B contracts typically include an SLA, and nobody wants to be in breach of contract.
I think the parent you're replying to is wrong, because I've worked at small companies selling into large enterprise, and the expectation is basically 24/7 service availability, regardless of industry.
You wake up. It's not your fault. You're helpless to solve it.
Eventually, AWS has a VP of something dial in to your call to apologize. They’re unprepared and offer no new information. The get handed to a side call for executive bullshit.
AWS comes back. Your support rep only vaguely knows what’s going on. Your system serves some errors but digs out.
Then you go to sleep.
And Postgres upgrades are not transparent. So you'll have a 1 or 2 hours task, every 6 to 18 months that you have only a small amount of control over when it happens. This is ok for a lot of people, and completely unthinkable for some other people.
I was disappointed alloy doesn't support timescaledb as a metrics endpoint. Considering switching to telegraf just because I can store the metrics on Postgres.
SQLite when prototyping, Postgres for production.
If you need to power a lawnmower and all you have is a 500bhp Scania V8, you may as well just do it.
Even better, stage to a production-like environment early, and then deploy day can be as simple as a DNS record change.
Because it's "low effort" to just fire it into sqlite and if I have to do ridiculous things to the schema as I footer around working out exactly what I want the database to do.
I don't want to use nodejs if I can possibly avoid it and you literally could not pay me to even look at Java, there isn't enough money in the world.
For most purposes, it works perfectly fine, but with two main caveats:
1. It is single user, single connection (i.e. no MVCC) 2. It doesn't support all postgres extensions (particularly postGIS), though it does support pgvector
https://github.com/supabase-community/pg-gateway is something that may be used to use pglite for prototyping I guess, but I haven't used this.
-Backups: the provider will push a full generic disaster-recovery backup of my database to an off-provider location at least daily, without the need for a maintenance window
-Optimization: index maintenance and storage optimization are performed automatically and transparently
-Multi-datacenter failover: my database will remain available even if part(s) of my provider are down, with a minimal data loss window (like, 30 seconds, 5 minutes, 15 minutes, depending on SLA and thus plan expenditure)
-Point-in-time backups are performed at an SLA-defined granularity and with a similar retention window, allowing me to access snapshots via a custom DSN, not affecting production access or performance in any way
-Slow-query analysis: notifying me of relevant performance bottlenecks before they bring down production
-Storage analysis: my plan allows for #GB of fast storage, #TB of slow storage: let me know when I'm forecast to run out of either in the next 3 billing cycles or so
Because, well, if anyone provides all of that for a monthly fee, the whole "self-hosting" argument goes out of the window quickly, right? And I say that as someone who absolutely adores self-hosting...
Corollary: rental/SaaS models provide that property in large part because their providers have lots of slack.
And especially having worked in startups, I was expected to do many different things, from fixing infrastructure code one day to writing frontend code the next. If you're in a bigger company, maybe it's understandable to be specialized, but especially if you're at a company with only a few people, you must be willing to do the job, whatever it is.
Yes, I'd say backups and analysis are table stakes for hiring it, and multi-datacenter failover is a relevant nice to have. But the reason to do it yourself is because it's literally impossible to get anything as good as you can build in somebody's else computer.
The default I've used on Amazon and GCP both do (RDS, Cloud SQL)
In case you want to self host but also have something that takes care of all that extra work for you
(I do use Maria at home for legacy reasons, and have used MySQL and Pg professionally for years.)
Can you give any details on that?
I switched to MariaDB back in the day for my personal projects because (so far as I could tell) it was being updated more regularly, and it was more fully open source. (I don't recall offhand at this point whether MySQL switched to a fully paid model, or just less-open.)
Until then it is nice to have options, even if they do require extra steps.
There’s also pg_auto_failover which is a Postgres extension and a bit less complex than Patroni, but it has its drawbacks.
FYI - it's already supported by cloudnativepg [1]
I was playing with this operator recently and I'm truly impressed - it's a piece of art when it comes to postgres automation; alongside with barman [2] it does everything I need and more
[1] https://cloudnative-pg.io/docs/1.28/connection_pooling [2] https://cloudnative-pg.io/plugin-barman-cloud/
Patroni has been around for awhile. The database-as-a-service team where I work uses it under the hood. I used it to build database-as-a-service functionality on the infra platform team I was at prior to that.
It's basially push-button production PG.
Currently scratching my head on what the appropriate upgrade procedure is for a non-k8s/operator spilo/patroni cluster for minimal downtime and risk.
I read these and then wrote my own scripts that were tailored to my environment.
https://pganalyze.com/blog/5mins-postgres-zero-downtime-upgr...
https://www.pgedge.com/blog/always-online-or-bust-zero-downt...
https://knock.app/blog/zero-downtime-postgres-upgrades
Basically - Created a new cluster on new machines - Started logically replicating - Waited for that to complete and then left it there replicating for a while until I was comfortable with the setup - We were already using haproxy and pgbouncer - Then I did a cut over to the new setup - This was for a database 600gb-1tb in size - The client application was not doing anything overly fancy which meant there was very little to change going from 12 to 17
>> "God Send". Everything just worked. Replication was as reliable as one could imagine. It outlives several hardware incidents without manual intervention. It allowed cluster maintenance (software and hardware upgrades) without application downtime. I really dream PostgreSQL will be as reliable as MongoDB without need of external services.
https://www.postgresql.org/message-id/0e01fb4d-f8ea-4ca9-8c9...
Sure, the PostrgreSQL HA story isn't what we all want it to be, but the reliability is exceptional.
Database engineering is very hard. MongoDB has had both poor defaults as well as bugs in the past. It will certainly have durability bugs in the future, just like Postgres and all other serious databases. I'm not sure that Postgres' durability stacks up especially well with modern MongoDB.
[1] https://jepsen.io/analyses/postgresql-12.3
[2] https://archive.fosdem.org/2019/schedule/event/postgresql_fs...
Yet they still call it HA because there's nothing else. Even a planned shutdown of the primary to patch the OS results in downtime, as all connections are terminated. The situation is even worse for major database upgrades: stop the application, upgrade the database, deploy a new release of the app because some features are not compatible between versions, test, re-analyze the tables, reopen the database, and only then can users resume work.
Everything in SQL/RDBMS was thought for a single-node instance, not including replicas. It's not HA because there can be only one read-write instance at a time. They even claim to be more ACID than MongoDB, but the ACID properties are guaranteed only on a single node.
One exception is Oracle RAC, but PostgreSQL has nothing like that. Some forks, like YugabyteDB, provide real HA with most PostgreSQL features.
About the hype: many applications that run on PostgreSQL accept hours of downtime, planned or unplanned. Those who run larger, more critical applications on PostgreSQL are big companies with many expert DBAs who can handle the complexity of database automation. And use logical replication for upgrades. But no solution offers both low operational complexity and high availability that can be comparable to MongoDB
OTOH, Oracle takes most of my time with endless issues, bugs, unexpected feature modifications, even on OCI!
My theory of why Postgres is still getting the hype is either people don't know the problem, or it's acceptable on some level. I've worked in a team that maintains the in house database cluster (even though we were using MySQL instead of PostgreSQL) and the HA story was pretty bad. But there were engineers manually recover the data lost and resolve data conflicts, either from the recovery of incident or from customer tickets. So I guess that's one way of doing business.
I would expect a little bit more as a cost of the convenience, but in my experience it's generally multiple times the expense. It's wild.
This has kept me away from managed databases in all but my largest projects.
If anything that’s a feature for ease of use and compatibility.
I know there are other issues with Kubernetes but at least its transferable knowledge.
An expert will give you thousands of theoretical reasons why self-hosting the DB is a bad idea.
An "expert" will host it, enjoy the cost savings and deal with the once-a-year occurrence of the theoretical risk (if it ever occurs).
> If you're just starting out in software & want to get something working quickly with vibe coding, it's easier to treat Postgres as just another remote API that you can call from your single deployed app
> If you're a really big company and are reaching the scale where you need trained database engineers to just work on your stack, you might get economies of scale by just outsourcing that work to a cloud company that has guaranteed talent in that area. The second full freight salaries come into play, outsourcing looks a bit cheaper.
This is funny. I'd argue the exact opposite. I would self host only:
* if I were on a tight budget and trading an hour or two of my time for a cost saving of a hundred dollars or so is a good deal; or
* at a company that has reached the scale where employing engineers to manage self-hosted databases is more cost effective than outsourcing.
I have nothing against self-hosting PostgreSQL. Do whatever you prefer. But to me outsourcing this to cloud providers seems entirely reasonable for small and medium-sized businesses. According to the author's article, self hosting costs you between 30 and 120 minutes per month (after setup). It's easy to do the math...
except now they are stuck trying to maintain and debug Postgres without having the same visibility and agency that they would if they hosted it themselves. situation isn't at all clear.
I use Google Cloud SQL for PostgreSQL and it's been rock solid. No issues; troubleshooting works fine; all extensions we need already installed; can adjust settings where needed.
in the limit I dont think we should need DBAs, but as long as we need to manage indices by hand, think more than 10 seconds about the hot queries, manage replication, tune the vacuumer, track updates, and all the other rot - then actually installing PG on a node of your choice is really the smallest of problems you face.
This leads the developers to do all kinds of workarounds and reach for more cloud services (and then integrating them and - often poorly - ensuring consistency across them) because the cloud hosted DB is not able to handle the load.
On bare-metal, you can go a very long way with just throwing everything at Postgres and calling it a day.
Running on IaaS also gives you more scalability knobs to tweak: SSD Iops and b/w, multiple drives for logs/partitions, memory optimized VMs, and there's a lot of low level settings that aren't accessible in managed SQL. Licensing costs are also horrible with managed SQL Server, where it seems like you pay the Enterprise level, but running it yourself offers lower cost editions like Standard or Web.
You can take it even further in some context if you use sqlite.
I think one of the craziest ideas of the cloud decade was to move storage away from compute. It's even worse with things like AWS lambda or vercel.
Now vercel et al are charging you extra to have your data next to your compute. We're basically back to VMs at 100-1000x the cost.
Every company out there is using the cloud and yet still employs infrastructure engineers to deal with its complexity. The "cloud" reducing staff costs is and was always a lie.
Whether or not you need that equivalence is an orthogonal question.
There's probably a sweet spot where that is true, but because cloud providers offer more complexity (self-inflicted problems) and use PR to encourage you to use them ("best practices" and so on) in all the cloud-hosted shops I've been in a decade of experience I've always seen multiple full-time infra people being busy with... something?
There was always something to do, whether to keep up with cloud provider changes/deprecations, implementing the latest "best practice", debugging distributed systems failures or self-inflicted problems and so on. I'm sure career/resume polishing incentives are at play here too - the employee wants the system to require their input otherwise their job is no longer needed.
Maybe in a perfect world you can indeed use cloud-hosted services to reduce/eliminate dedicated staff, but in practice I've never seen anything but solo founders actually achieve that.
It's complexity but it's also providing features. If you didn't use those cloud features, you'd be writing or gluing together and maintaining your own software to accomplish the same tasks, which takes even more staff
> Maybe in a perfect world you can indeed use cloud-hosted services to reduce/eliminate dedicated staff
So let's put it another way: either you're massively reducing/eliminating staff to achieve the same level of functionality, or you're keeping the equivalent staff but massively increasing functionality.
The point is, clouds let you deliver a lot more with a lot less people, no matter which way you cut it. The people spending money on them aren't mostly dumb.
I love self-hosting stuff and even have a bias towards it, but the cost/time tradeoff is more complex than most people think.
Every company beyond a particular size surely? For many small and medium sized companies hiring an infrastructure team makes just as little sense as hiring kitchen staff to make lunch.
Local reproducibility is easier, and performance is often much better
As I pointed out above, you may be better served mixing and matching so you spend your time on the critical aspects but offload those other tasks to someone else.
Of course, I’m not sitting at your computer so I can’t tell you what’s right for you.
Task runner/que at least for us postgres works for both cases.
We also self host an s3 storage and allow useruploaded content in within strict borders.
Fact is a lot of these companies are on the cloud because their internal IT was a total fail.
Of course, my comment wasn't aimed at those who successfully keep their cloud bill in the low 3-figures, but the majority of companies with a 5-figure bill and multiple "infrastructure" people on payroll futzing around with YAML files. Even half the achieved savings should be enough incentive for those guys to learn something new.
But initial setup is maybe 10% of the story. The day 2 operations of monitoring, backups, scaling, and failover still needs to happen, and it still requires expertise.
If you bring that expertise in house, it costs much more than 10x ($3/day -> $30/day = $10,950/year).
If you get the expertise from experts who are juggling you along with a lot of other clients, you get something like PlanetScale or CrunchyData, which are also significantly more expensive.
Most monitoring solutions support Postgres and don't actually care where your DB is hosted. Of course this only applies if someone was actually looking at the metrics to begin with.
> backups
Plenty of options to choose from depending on your recovery time objective. From scheduled pg_dumps to WAL shipping to disk snapshots and a combination of them at any schedule you desire. Just ship them to your favorite blob storage provider and call it a day.
> scaling
That's the main reason I favor bare-metal infrastructure. There is no way anything on the cloud (at a price you can afford) can rival the performance of even a mid-range server that scaling is effectively never an issue; if you're outgrowing that, the conversation we're having is not about getting a big DB but using multiple DBs and sharding at the application layer.
> failover still needs to happen
Yes, get another server and use Patroni/etc. Or just accept the occasional downtime and up to 15 mins of data loss if the machine never comes back up. You'd be surprised how many businesses are perfectly fine with this. Case in point: two major clouds had hour-long downtimes recently and everyone basically forgot about it a week later.
An LLM could set this up for you, it's dead simple.
I also disagree that the ongoing maintenance, observability, and testing of a replicated database would take a few hours to set up and then require zero maintenance and never ping me with alerts.
Looking at all the recent AWS, Azure and Cloudflare outages, I posit that it doesn't.
Anyway, for companies not heavily into tech, lots of this stuff is not that expensive.
For medium sized companies you need "devops engineers". And in all honesty, more than you'd need sysadmins for the same deployment.
For large companies, they split up AWS responsibilities into entire departments of team (for example, all clouds have math auth so damn difficult most large companies have -not 1- but multiple departments just dealing with authorization, before you so much as start your first app)
At my last two places it very quickly got to the point where the technical complexity of deployments, managing environments, dealing with large piles of data, etc. meant that we needed to hire someone to deal with it all.
They actually preferred managing VMs and self hosting in many cases (we kept the cloud web hosting for features like deploy previews, but that’s about it) to dealing with proprietary cloud tooling and APIs. Saved a ton of money, too.
On the other hand, the place before that was simple enough to build and deploy using cloud solutions without hiring someone dedicated (up to at least some pretty substantial scale that we didn’t hit).
> The "cloud" reducing staff costs
Both can be true at the same time.
Also:
> Otherwise you're waking up at 3am no matter what.
Do you account for frequency and variety of wakeups here?
Yes. In my career I've dealt with way more failures due to unnecessary distributed systems (that could have been one big bare-metal box) rather than hardware failures.
You can never eliminate wake-ups, but I find bare-metal systems to have much less moving parts means you eliminate a whole bunch of failure scenarios so you're only left with actual hardware failure (and HW is pretty reliable nowadays).
There was, I have to admit, a log message that explained the problem... once I could find the specific log message and understand the 45 steps in the chain that got to that spot.
This doesn’t make sense as an argument. The reason the cloud is more complex is because that complexity is available. Under a certain size, a large number of cloud products simply can’t be managed in-house (and certainly not altogether).
Also your argument is incorrect in my experience.
At a smaller business I worked at, I was able to use these services to achieve uptime and performance that I couldn’t achieve self-hosted, because I had to spend time on the product itself. So yeah, we’d saved on infrastructure engineers.
At larger scales, what your false dichotomy suggests also doesn’t actually happen. Where I work now, our data stores are all self-managed on top of EC2/Azure, where performance and reliability are critical. But we don’t self-host everything. For example, we use SES to send our emails and we use RDS for our app DB, because their performance profiles and uptime guarantees are more than acceptable for the price we pay. That frees up our platform engineers to spend their energy on keeping our uptime on our critical services.
https://blog.notmyhostna.me/posts/what-i-wish-existed-for-se...
234 more comments available on Hacker News