I Cannot SSH Into My Server Anymore (and That's Fine)
Key topics
The debate around SSH-ing into servers took an unexpected turn when the author of "I Cannot SSH into My Server Anymore (and That's Fine)" sparked a lively discussion on container management and Podman. Commenters enthusiastically shared their own experiences with declarative hosting, with some raving about Quadlets as a "game changer" for small-to-medium scale deployments. As the conversation unfolded, a fascinating discussion emerged around Podman pods, with some clarifying that restarting a container within a pod doesn't necessarily restart the entire pod, while others pointed out that, architecturally, a Podman pod is essentially a single container with separate rootfs. The thread's relevance lies in its timely exploration of modern container management strategies and the trade-offs that come with them.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
5d
Peak period
62
108-120h
Avg / period
62
Based on 62 loaded comments
Key moments
- 01Story posted
Jan 7, 2026 at 4:37 AM EST
4d ago
Step 01 - 02First comment
Jan 11, 2026 at 5:59 PM EST
5d after posting
Step 02 - 03Peak activity
62 comments in 108-120h
Hottest window of the conversation
Step 03 - 04Latest activity
Jan 12, 2026 at 12:15 AM EST
37m ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The predictability and drop in toil is so nice.
https://blog.gripdev.xyz/2024/03/16/in-search-of-a-zero-toil...
I found working with normal `dnf` and normal config files much easier than dealing with Ignition and Butane. Plus, working with your image in CI/CD instead of locally fixed my ZFS instability. When Fedora kernel updates, but ZFS doesn't support that version yet, now it fails in GitHub Actions and the container is never built, so there's no botched update that my NAS mistakenly picks up.
A few things in the article I think might help the author:
1. Podman 4 and newer (which FCOS should definitely have) uses netavark for networking. A lot of older tutorials and articles were written back when Podman used CNI for it's networking and didn't have DNS enabled unless you specifically installed it. I think the default `podman` network is still setup with DNS disabled by default. Either way, you don't have to use a pod if you don't want to anymore, you can just attach both containers to the same network and it should Just Work.
2. You can run the generator manually with "/usr/lib/systemd/system-generators/podman-system-generator --dry-run" to check Quadlet validity and output. Should be faster than daemon-reload'ing all the time or scanning the logs.
And as a bit of self-promotion: for anyone who wants to use Quadlets like this but doesn't want to rebuild their server whenever they make a change, I'm working on a tool called Materia[0] that can install, remove, template, and update Quadlets and other files from a Git repository.
[0] https://github.com/stryan/materia
Anyone know why this is? Or, for that matter, why Kubernetes seems to work like this too?
I have an application for which the natural solution would be to create a pod and then, as needed, create and destroy containers within the pod. (Why? Because I have some network resources that don’t really virtualize, so they can live in one network namespace. No bridges.)
But despite containerd and Podman and Kubernetes kind-of-sort-of supporting this, they don’t seem to actually want to work this way. Why not?
In Podman, a pod is essentially just a single container; each "container" within a pod is just a separate rootfs. So from that perspective, it makes sense, since you can't really restart half of a container. (But I think that it might be possible to restart individual containers within a pod; but if any container within a pod fails, then I think that the whole pod will automatically restart)
> Why? Because I have some network resources that don’t really virtualize, so they can live in one network namespace.
You can run separate containers in the same network namespace with the "--network" option [0]. You can either start one container with its own automatic netns and then join the other containers to it with "--network=container:<name>", or you can manually create a new netns with "podman network create <name>" and then join all the containers to it with "--network=<name>".
[0]: https://docs.podman.io/en/latest/markdown/podman-run.1.html#...
Oh, right, thanks. I think I did notice that last time I dug into this. But:
> or you can manually create a new netns with "podman network create <name>" and then join all the containers to it with "--network=<name>".
I don’t think this has the desired effect at all. And the docs for podman network connect don’t mention pods at all, which is odd. In general, I have not been very impressed by podman.
Incidentally, apptainer seems to have a more or less first class ability to join an existing netns, and it supports CNI. Maybe I should give it a try.
Pods are specifically not wanted to be treated as vms, but as a single application/deployment units.
Among other things, if a container goes down you don’t know if it corrupted shared state (leaving sockets open or whatever). So you don’t know if the pod is healthy after restart. Also reviving it might not necessarily work, if the original startup process relied on some boot order. So to guarantee a return to healthy you need to restart the whole thing.
This is not a thing. A program that opens a socket and crashes does not leak that socket for the lifetime of the network namespace. (Keep in mind that ordinary non-containerized servers usually have exactly one network namespace. If a program crashes, you restart it. Sure, CLOSE_WAIT is a thing, but it’s neither permanent nor usually a big deal.)
> You are normally running several instances of your frontend so that it can crash without impacting the user experience, or so it can get deployed to in a rolling manner, etc.
Err, the classic way to do this is to hand off the listening socket from one server instance to the next. You can’t do this if your orchestration tools insist on tearing down the entire network namespace to update the server. Sure, you can use fancy load balancers or software defined networking or firewall kludges to hand off something that functions like a listening socket, but it kind of feels like we lost the plot somehow. The old techniques work, and they often worked at the appropriate scale for the application — why are we building new systems can’t be made to work well without extra layers.
In any event, the feature I want isn’t rocket science. I think Kubernetes would need to add two special kinds of Pods:
1. An joinable Pod that explicitly permits other Pods to join with it (this would be a genuine Pod with some special attributes).
2. A subsidiary Pod that depends on a joinable Pod and joins its network namespace. This would almost be a real pod except that it would have no network namespace of its own and hence no normal managed hostname or addresses.
#2 is a bit weird, but there’s precedent. A hostNetwork: true Pod is already weird in exactly the same way.
Podman was changing pretty fast for a while so it could be an older version thing, though I'd assume FCOS is on Podman 5 by now.
The tool that manages all my tools is the shell. It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation. Take it away and you are left with a server that is resilient against things you have seen before but lacks the tools to deal with the future.
You’ll never attach a debugger in production. Not going to happen. Shell into what? Your container died when it errored out and was restarted as a fresh state. Any “Sherlock Holmes” work would be met with a clean room. We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?
- Rollback/investigate the changelog between the current and prior version to see which code paths are relevant
- Use our observability infra that is equivalent to `perf`, but samples ~everything, all the time, again to see which codepaths are relevant
- Potentially try to push additional logging or instrumentation
- Try to better repro in a non-prod/test env where I can do more aggressive forms of investigation (debugger, sanitizer, etc.) but where I'm not running on production data
I certainly can't strace or run raw CLI commands on a host in production.
Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair?
To me, being without those capabilities just feels crippling
A lot of the problems I enjoy solving specifically relate to consistently minimizing privilege, not from a security perspective (though there are obvious upsides to this), but from a debugging/clarity perspective. If you have a relatively small and statically verifiable set of (networked) dependencies, and minimize which resources which containers can access, reasoning about the system as a whole becomes a lot easier.
I can think of lots of cases where I've witnessed good outcomes from moving towards more fine-grained resource access, and very few cases where life has gotten better by saying "everyone has access to everything".
If you have a memory leak, wrap the suspect code in more instrumentation. Write unit tests that exercise that suspect code. Load test that suspect code. Fix that suspect code.
I’ll also add that while I build clusters and throw away the ssh keys, there are still ways to gain access to a specific container to view the raw logs and execute commands but like all container environments, it’s ephemeral. There’s spice access.
You would connect to any of the nodes having the problem.
I've worked both ways; IMHO, it's a lot faster to get to understanding in systems where you can inspect and change the system as it runs than in systems where you have to iterate through adding logs and trying to reproduce somewhere else where you can use interactive tools.
My work environment changed from an Erlang system where you can inspect and change almost everything at runtime to a Rust system in containers where I can't change anything and can hardly inspect the system. It's so much harder.
It is, SSH is indeed the tool for that, but that's because until recently we did not have better tools and interfaces.
Once you try newer tools, you don't want to go back.
Here's the example of my fairly recent debug session:
You don't need debugging facilities for many issues. You need observability and tracing.Instead of debugging the issue for tens of minutes at least, I just used observability tool which showed me the path in 2 minutes.
Take a look at this Netflix presentation, especially on the screenshots of their web interface tool: https://archives.kernel-recipes.org/wp-content/uploads/2025/...
We need to accept that UNIX did not get things right decades ago and be willing to evolve UX and security to a better place.
Once is chance, twice is coincidence, three times makes a pattern.
I recently diagnosed and fixed an issue with Veeam backups that suddenly stopped working part way through the usual window and stopped working from that point on. This particular setup has three sites (prod, my home and DR), and five backup proxies. Anyway, I read logs and Googled somewhat. I rebooted the backup server - no joy, even though it looked like the issue was there. I restarted the proxies and things started working again.
The error was basically: there are no available proxies, even though they were all available (but not working but not giving off "not working" vibes).
I could bother with trying to look for what went wrong but life is too short. This is the first time that pattern has happened to me (I'll note it down mentally and it was logged in our incident log).
So, OK, I'll agree that a reboot should not generally be the first option. Whilst sciencing it or nerding harder is the purist approach, often a cheeky reboot gets the job done. However, do be aware that a Windows box will often decide to install updates if you are not careful 8)
You just temporarily mitigated it.
This is also a good reason to log everything all the time in a human readable way. You can get services up and then triage at your own pace after.
perf is a shell tool. iptables is a shell tool. sshguard is a log reader and ultimately you will use the CLI to take action.
If you are advocating newer tools, look into nft - iptables is sooo last decade 8) I've used the lot: ipfw, ipchains, iptables and nftables. You might also try fail2ban - it is still worthwhile even in the age of the massively distributed botnet, and covers more than just ssh.
I also recommend a VPN and not exposing ssh to the wild.
Finally, 13,000 address in an ipset is nothing particularly special these days. I hope sshguard is making a properly optimised ipset table and that you running appropriate hardware.
My home router is a pfSense jobbie running on a rather elderly APU4 based box and it has over 200,000 IPs in its pfBlocker-NG IP block tables and about 150,000 records in its DNS tables.
Well yes, and to be honest in this case I did that all over SSH: run `perf`, generate flame graph, copy the .svg to the PC over SFTP, open it in the file viewer.
What I really wanted is a web interface which will just show that to me without using the shell.
>look into nft - iptables is sooo last decade
It doesn't matter in this context: iptables is using new netfilter (I'm not using iptables-legacy), and this exact scenario is 100% possible with native netfilter nft.
>Finally, 13,000 address in an ipset is nothing particularly special these days
Oh, the other day I had just 70 `iptables -m set --match-set` rules, and did you know how apparently inefficient source/destination address hashing algorithm for the set match is?! It was debugged with perf as well!
I'm talking about ~4Gbit/s sudden limitation on a 10Gbit link.
>I'm talking about ~4Gbit/s sudden limitation on a 10Gbit link.
I think you need to look into things if 70 IPs in a table are causing issues, such that a 10Gb link ends up at four Gb/s. I presume that if you remove the ipset, that 10Gb/s is restored?
Testing throughput and latency is also quite a challenge - how do you do it?
Yes, we all want that. I've been running monitoring systems for over 30 years and it is quite a tricky thing to get right. .1.3.1.4.1.33230 is my company enterprise number, which I registered a while back.
The thing is that even though we are now in 2026, monitoring is still a hard problem. There are, however, lots of tools - way more than we had in the day but just like a saw can rip your finger off instead of cutting a piece of wood, well I'm sure you can fill in the blanks.
Back in the day we had a thing called Ethereal which was OK and nearly got buried. However you needed some impressive hardware to use it. Wireshark is a modern marvel and we all have decent hardware. SNMP is still relevant too.
Although we have stonking hardware these days, you do also have to be aware of the effects of "watching". All those stats have to be gathered and stashed somewhere and be analysed etc. That requires some effort from the system that you are trying to watch. That's why things like snmp and RRD were invented.
Anyway, it is 2026 and IT is still properly hard (as it damn well should be)!
But, anyway, remote command and control of observability really is a thing in the industry, not just at one company.
The dashboards are something that looks cool, but they usually are not really helpful for debugging. What you're looking for is per-request tracing and logging, so you can grab a request ID and trace it (get log messages associated with it) through multiple levels of the stack. Even maybe across different services.
Debuggers are great, but they are not a good option for production traffic.
There are tools which show what happens per process/thread and inside the kernel. Profiling and tracing.
Check Yandex's Perforator, Google Perfetto. Netflix also has one, forgot the name.
But instead we go with multiple moving parts all configured independently? CoreOS, Terraform and a dependence on Vultr thing. Lol.
Never in a million years I would think it's a good idea to disable SSH access. Like why? Keys and non-standard port already bring China login attempts to like 0 a year.
Though I must say I am not brave enough and my family uses it so I prefer to have jest one broken service instead of enire machine.
But it is possible.
I also notice that the word security does not grace your blog posting. That is a sure sign of the DevOps Way 8) You might look into the sysadmin way. Its boring, to be sure: all that fussing over security and the like!
You could look into VPNs for access to your gear. An IPSEC, OpenVPN or Wireguard seems to keep most baddies away simply because it is a lot of effort to even engage with one. There are a huge number of ways that a VPN is configured. Then you have ssh, which can be very securely configured (or not).
You can also use firewalls and I'm sure you do. If you have a static IP at home then simply filter for that. Make use of allow/deny lists - there are loads for firewalls of all sorts.
Dumping remote shell access is not useful.
Worked well for me a few years.
Problems: when you have issues you need to look into pertainer logs to see why it failed.
That’s one big problem, if prefer something like Jenkins to build it instead.
And if you have more groups of docker compose, you just put another sh script to do this piling on the main infrastructure git repo, which on git change will spawn new git watchers
[0]: https://fedoraproject.org/iot/
6 more comments available on Hacker News