Ternfs – an Exabyte Scale, Multi-Region Distributed Filesystem

Posted3 months agoActive2 months ago

kirlev

138 points

31 comments

xtxmarkets.comTechstory

calmmixed

Debate

60/100

Distributed FilesystemsExabyte Scale StorageParallel Computing

Key topics

Distributed Filesystems

Exabyte Scale Storage

Parallel Computing

TernFS is a new exabyte-scale, multi-region distributed filesystem announced by XTX Markets, sparking discussion on its features, scalability, and comparison to existing solutions like Lustre and ZFS.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

26m

Peak period

12-18h

Avg / period

4.4

Comment distribution31 data points

Loading chart...

Based on 31 loaded comments

Key moments

01Story posted
Oct 20, 2025 at 1:36 PM EDT
3 months ago
Step 01
02First comment
Oct 20, 2025 at 2:01 PM EDT
26m after posting
Step 02
03Peak activity
10 comments in 12-18h
Hottest window of the conversation
Step 03
04Latest activity
Oct 23, 2025 at 4:24 AM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (31 comments)

Showing 31 comments

poppafuze

3 months ago

1 reply

Great default license.

Joel_Mckay

3 months ago

CephFS looks stable, and has diskprediction and Prometheus modules:

https://docs.ceph.com/en/quincy/cephfs/index.html

https://github.com/ceph/ceph

Still not completely decoupled from host roles, but seems to work for some folks. =3

president_zippy

2 months ago

3 replies

Could anybody with applicable experience tell me how this filesystem compares in the real world to Lustre?

If it is decisively better than Lustre, I am happy to make the switch over at my sector in Argonne National Lab where we currently keep about 0.7 PB of image data and eventually intend to hold 3-5 PB once we switch over all 3 of our beamlines to using Dectris X-Ray detectors.

Contrary to what the non-computer scientists insist, we only need about 20Gb/s of throughput in either direction, so robustness and simplicity are the only concerns we have.

mat_epice

2 months ago

1 reply

There are several other systems I would recommend before TernFS for your environment. If you're looking at Lustre versus this in particular, Lustre has been through the wringer, and ANL/DOE has plenty of people who understand it enough to run it well and fix it when it breaks.

However, you are right. Your bandwidth needs don't really require Lustre.

president_zippy

2 months ago

2 replies

Seriously man, I'm asking because I don't know: which filesystems do you recommend instead? I dabbled in CephFS because our data is write-once, but helping computer illiterate research scientists at other universities and national labs retrieve their data is a lot simpler from Lustre because it's just plain-old POSIX filesystem semantics.

I'm not joking, I didn't ask this as a way to namedrop my experience and credentials (common 'round this neck o' the woods), I honestly don't know what all the much more competent organizations are doing and would really like to find out.

huntaub

2 months ago

2 replies

I’d be happy to chat more about your needs and try to help recommend a path forward. Feel free to shoot me an email at the address in my profile.

x______________

2 months ago

1 reply

Is this an ad? Why can't the topic continue here as a reply to op?

Borg3

2 months ago

1 reply

Because its a consulting oportunity.

nhanb

2 months ago

1 reply

I read somewhere that Hacker News should have been named Startup News, and sometimes interactions like the one upthread reminds me of that. I'm not saying it's wrong - if you're good at something don't do it for free and all that - but it's kinda sad that in-depth discussions on public forums are getting harder and harder to find these days.

x______________

2 months ago

1 reply

Normal conversations by topic enthusiasts usually have fun stuff hidden in their profiles and at times lead to fun rabbit holes where you endlessly learn and somehow, forgot that you were initially browsing HN.

Agree about the public discussion part, one of the reasons why I'm here lately.

Also, why can't someone create Startup News: Where every article reply is an opportunity to be sold a service, SN would take a cut of transactions. /s

dspillett

2 months ago

> SN would take a cut of transactions

These are people already trying to divert the discussion off-site for their benefit. Very few would honestly report any resulting transaction for the cut to be taken from.

[yeah, it did see the sarcasm tag, just clarifying to put off would be entrepreneurs so we aren't inundated by show-hn posts from people vibe-coding the idea over the next few days!]

president_zippy

2 months ago

I saw the follow-up responses complaining about you soliciting, but I've got no problem with you offering to solve a problem and being remunerated for it.

However, my lab is a brokedick operation with barely enough cash reserves to pay staff salaries. We sincerely do not have the budget to buy new software, especially after the NIH funding cuts.

mat_epice

2 months ago

It's a serious requirements-gathering exercise. I would look inside your organization for HPC storage experts and ask them to sit down with you for an hour to walk through your users' typical workflows, expectations, and budget. If you need some names, send me an email.

Or just shell out for as much Weka as they can convince you that you need and call it a day.

anon-3988

2 months ago

0.7PB of compressed data?

toast0

2 months ago

If you only need 20 Gb/s, you might be able to meet your needs without an exotic distributed filesystem by just getting a single giant server with a rack full of JBODs:

Something like this [1] gets you 44 disks in 4u. You can probably fit 9 of those and a server with enough HBAs to interface with it in a 42U rack. 9x44x20TB = not quite 8 PB. Adjust for redundancy and/or larger drives. If you go with SAS drives, you can have two servers connected to the drives, with failover. Or you can setup two of these racks in different locations and mirror the data (somehow).

[1] https://www.supermicro.com/en/products/chassis/4U/847/SC847E... (as an illustration, sas jbods aka disk shelves are widely available from server vendors)

anon-3988

2 months ago

1 reply

Isn't this literally what ZFS is designed for? What is ZFS lacking that this is needed.

somat

2 months ago

1 reply

ZFS is not distributed. So probably closer to ceph or lustre. I have to admit, on my first pass through the page it failed to explain why it was better than ceph.

president_zippy

2 months ago

2 replies

Given all the good work ZFS does locally, it does make you wonder what it would take to extend the concepts of ARC caching and RAID redundancy to a distributed system, one where all the nodes are joined together by RDMA rather than ethernet; one where reliability can be taken for granted (short of a rat chewing cables).

It would make for one heck of a FreeBSD development project grant, considering how superb their ZFS and their networking stack are separately.

P.S. Glad someone pointed this out tactfully. A lot of people would have pounced on the chance to mock the poor commenter who just didn't know what he didn't know. The culture associated with software development falsely equates being opinionated with being knowledgeable, so hopefully we get a lot more people reducing the stigma of not knowing and reducing the stigma of saying "I don't know".

nh2

2 months ago

2 replies

Even if you were build a ZFS mega-machine with an Exabyte of storage with RDMA (the latencies of "normal" Ethernet in the datacenters would probably not be good enough), wouldn't you still have the problem that ZFS is fundamentally designed to be managed by and accessed on one machine? All data in and out of it would have to flow through that machine, which would be quite the bottleneck.

mgerdts

2 months ago

If your entire system is connected via RDMA networks (rather common in HPC) I would not worry at all about latency. If you are buying NICs and switches that are capable of 100Gb or better, there’s a reasonable chance they support RoCE.

president_zippy

2 months ago

Because RDMA latency is still a lot lower than disk access latency, it depends more on whether or not the control logic can be generalized to distributed scale with some simple refactoring and a few calls to access remote shared memory, or whether a full-on rewrite is less time-consuming. I don't know, and I don't pretend to know.

All I know is that the semantics of RDMA (absent experience writing code that uses RDMA) deceive me into thinking there's some possibility I could try it and not end up regretting the time spent on a proof of concept.

mgerdts

2 months ago

This is hobby project I’ve been thinking about for quite a while. It’s way larger than a hobby project, though.

I think the key to making it horizontally scalable is to allow each writable dataset to be managed by a single node at a time. Writes would go to blocks reserved for use by a particular node, but at least some of those blocks will be on remote drives via nvmeof or similar. All writes would be treated as sync writes so another node could have lossless takeover via ZIL replay.

Read-only datasets (via property or snapshot, including clone origins) could be read directly from any node. Repair of blocks would be handled by a specific node that is responsible for that dataset.

A primary node would be responsible for managing association between nodes and datasets, including balancing load and handling failover. It would probably be responsible for metadata changes(datasets, properties, nodes, devs, etc., not posix fs metadata) and the coordination required across nodes.

I don’t feel like I have a good handle on how TXG syncs would happen, but I don’t think that is insurmountable.

darkwater

2 months ago

2 replies

To the OP: nice karma trick posting the URL with the anchor to bypass the HN duplicates detector. Dang & co, this is a bug, it should be fixed.

I know because I stumbled on the same page following the links from the blog of the author of another post that made the frontpage yesterday (https://news.ycombinator.com/item?id=45589156), liked the TernFS concept, submitted it and got redirected to https://news.ycombinator.com/item?id=45290245

dspillett

2 months ago

1 reply

> Dang & co, this is a bug, it should be fixed.

Agreed, more or less, this would be easy to work around naively. Though an duplicate detection should not block reposts based on removing the anchor nor should the anchor portion automatically be removed generally. Some sites unnecessarily setup as SPAs use the anchor portion for context so it is needed for direct links to the right article in those¹, and going directly to a specific section in a long page can also be useful.

> nice karma trick posting the URL with the anchor to bypass the HN duplicates detector.

Karma jealously is as unbecoming as karma whoring, so perhaps wind the ol' neck in a little there. Laziness/ineptitude is more common than malice and this could have been accidental via quick copy+paste.

A better way to respond in this situation is a more neutral “Already recently discussed at [link]”, as had been done some hours before your comment: https://news.ycombinator.com/item?id=45646691#45647047

----

[1] Yes, those sites are badly designed, but they are unlikely to change because of our technical preferences and breaking the ability to deep link into them would add an issue for HN while not being noticed by those sites at all.

darkwater

2 months ago

1 reply

> Karma jealously is as unbecoming as karma whoring, so perhaps wind the ol' neck in a little there. Laziness/ineptitude is more common than malice and this could have been accidental via quick copy+paste.

Agreed, and sorry for that (even though my gut and not-so-gut feeling is that it was done on purpose rather than by mistake, but I might be on the wrong side myself on this).

dspillett

2 months ago

1 reply

Of course if it is a pattern (I've not checked OP's post history to check) then accuse away and present evidence, to make it clear that we tut disapprovingly at that sort of thing!

darkwater

2 months ago

It was the first and currently only submission from OP, who seems also a lurker because they wrote just one comment on HN 3 months ago.

I'm not _that_ interested in the HN meta-game, so I will leave it here.

txrx0000

2 months ago

The fact that a repost got so many upvotes means the post has a lot of value, but many missed it previously. Perhaps it would be better to allow reposts to collect karma as usual so they would have a chance to get to the front page, but have the repost's link and comment page redirect to the existing post's link and comment page (unless a certain amount of time has passed, then the repost can be its own post and have its own comments section, which is the system we already have).

I think being able to identify what's worthy of reposting deserves upvotes, too. If a repost truly provided little to no value, then the number of upvotes would reflect that and it would never get to the front page. But in this case, many people and myself included would have never found the post if it weren't for this repost.

Different batches of users are on HN at different times and on different days. Allowing reposts to collect karma would mean that every link's exposure is derived from the entire HN userbase's votes rather than a small subset of the users that happened to be online at the time of the post.

semessier

3 months ago

should post again when having 5% of the features of the other parallel file systems starting with RDMA, whereby it's not clear if this FS does even stripe that is if it is even a parallel file system

cpach

3 months ago

Previously: https://news.ycombinator.com/item?id=45290245

roadbuster

2 months ago

> all reads and writes go through the leader

One of the pain points of scaling Zookeeper is that all writes must go to the leader (reads can be fulfilled by followers). I understand this is "leader of a shard" and not a "global leader," but it still means a skewed write load on a shard has to run through a single leader instance

> given that horizontal scaling of metadata requires no rebalancing

This means a skewed load cannot be addressed via horizontal scaling (provisioning additional shards). To their credit, they acknowledge this later in the (very well-written) article:

> This design decision has downsides: TernFS assumes that the load will be

> spread across the 256 logical shards naturally.

View full discussion on Hacker News

ID: 45646691Type: storyLast synced: 11/20/2025, 4:35:27 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN