Mind the Encryptionroot: How to Save Your Data When Zfs Loses Its Mind

3 months ago

4 replies

> Lesson: Test backups continuously so you get immediate feedback when they break.

This is a very old lesson that should have been learned by now :)

But yeah the rest of the points are interesting.

FWIW I rarely use ZFS native encryption. Practically always I use it on top of cryptsetup (which is a frontend for LUKS) on Linux, and GELI on FreeBSD. It's a practice from the time ZFS didn't support encryption and these days I just keep doing what I know.

GauntletWizard

3 months ago

1 reply

I really love ZFS Native encryption, but this is the big problem with it. I use ZFS Raw Sends to store my backups incrementally in a cloud I trust, but not enough to have raw access to my files. ZFS has great attributes there, theoretically - I can send delta updates of my filesystems, and the receiver never has they keys to decrypt them.

I've used this in practice for many years (2020), and aside from encountering exactly this issue (though thankfully I did have a bookmark already in place), it's worked great. I've tested restores from these snapshots fairly regularly (~ quarterly), and only once had an issue related to a migration - I moved the source from one disk to another. This can have some negative effects on encryptionroots, which I was able to solve... But I really, really wish that ZFS tooling had better answers to it, such as being able to explicitly create and break these associations.

3 months ago

3 replies

Yeah I use different methods for that. I considered using zfs send/receive for backups, however there's one big issue with that: every time you need one or two files from the backup you need to restore the whole filesystem. There's no official way to retrieve a single file from a zfs send stream.

For backup purposes I also greatly prefer file by file encryption because one corruption will only break one file and not the whole backup.

What I do now is encrypt with encfs and store on a S3 glacier-style service.

GauntletWizard

3 months ago

I've never had to restore a single file that's older than my local snapshots; The restores where I've needed an old subset have been a substantial enough subset that 4-5xing the data size on restore was not really an issue.

I kinda agree with your point on file-by-file encryption, but ZFS's general integrity features are such that I'm not really worried - Except about this article's specific failure mode, which is pretty easy to deal with/avoid when you know about it, but is a substantial deficiency.

benlivengood

3 months ago

I can only think of one situation where it's hard to retrieve a remote backup of an individual file; where ZFS native encryption is in use and the remote backup system is not trusted to load the key for the dataset.

For myself, I don't trust remote systems to always have keys loaded, but in an emergency I would feel relatively safe temporarily loading the key, mounting the snapshot read-only, and scp-ing the files out, then unmounting and unloading the key (and rebooting for good measure).

There's also a viable slow option; export the raw storage of the backup ZFS pool over the (inter)network to a trusted machine and import the pool read-only locally, load the key, mount the filesystem, and make a copy. Much slower but is practical. I've used s3backer fairly successfully as a backup method for a pool with native encryption; it takes a minute or so to import the pool and can write backup snapshots at a few MB/s, so there shouldn't be any fundamental reason iscsi or similar wouldn't work.

3 months ago

You don't actually have to restore the entire snapshot if you just want a single file! ZFS mounts snapshots read-only in an extra hidden .zfs/snapshot directory that doesn't even show up in ls -a unless you set snapdir=visible on the dataset but you can copy files out of there.

For example, cp /path/to/dataset/mountpoint/.zfs/snapshot/<snapshot_name>/path/to/file ~/path/to/file

binwiederhier

3 months ago

3 replies

ZFS encryption is much more space efficient than dmcrypt+unencrypted ZFS when combined with zstd compression. This is because it can do compress-then-encrypt instead of encrypt-then-(not-really-)compress. It is also much much faster.

Source: I work for a backup company that uses ZFS a lot.

zielmicha

3 months ago

1 reply

Can you explain this in more detail? It doesn't seem true on a first glance.

If you enable compression on ZFS that runs on top of dmcrypt volume, it will naturally happen before encryption (since dmcrypt is the lower layer). It's also unclear how it could be much faster, since dmcrypt generally is bottlenecked on AES-NI computation (https://blog.cloudflare.com/speeding-up-linux-disk-encryptio...), which ZFS has to do too.

binwiederhier

3 months ago

1 reply

Oh my bad. I misread your comment. You are doing ZFS on top of dmcrypt, not dmcrypt images/volumes on top of ZFS.

3 months ago

It was my comment and yes indeed that's what I'm doing. Zfs on top of luks

3 months ago

1 reply

I don't use compression anyway. I don't like the way that the storage pool capacity becomes variable then.

HumanOstrich

3 months ago

1 reply

I don't understand. You don't like that some things compress better than others, saving a variable amount of space?

3 months ago

It also saves on wear

3 months ago

Using any file system that supports compression on top of LUKS does compression before encryption

3 months ago

It's hard to write a completely automated backup test that's also pretty thorough. Yeah it would have caught "completely umountable" but there's a lot of other problems that a basic script has little hope of catching.

I do manual backup checks, and so did the author, but those are going to be limited in number.

tom_alexander

3 months ago

I use native ZFS encryption because it makes it super easy to share encrypted datasets across dual-booted operating systems. AFAIK Linux does not support GELI and FreeBSD does not support LUKS. DragonflyBSD supports LUKS but then no ZFS.

Also, that way I can have Linux and FreeBSD living on the same pool, seamlessly sharing my free space, without losing the ability to use encryption. Doing both LUKS and GELI would requiring partitioning and giving each OS its own pool.

kalaksi

3 months ago

1 reply

I've used zfs and btrfs and while I haven't quite lost data, I have also hit some unnerving pitfalls / sharp edges that have confirmed that I should keep at least one copy using just LUKS + ext4. I like the features but I think the more complicated filesystems bring about other kind of risks.

j45

3 months ago

I like storage to be an appliance and not feel like something intermittently unreliable.

chasing0entropy

3 months ago

3 replies

Why zfs freak out is accepted as "normal" in a dev environment is beyond me. I use storage spaces on a daily basis in production and dev environment and have for nearly 10 years now and with only marginal use of PowerShell I have been able to restore every array I didn't destroy intentionally. This is the bare minimum I expect out of an redundant array of any type regardless of its speed or scalability promises.

nabla9

3 months ago

2 replies

It's not accepted and it's not normal.

This is the case of user changing password setting and and realizing you can't use them with old backups after accidentally destroying one dataset. zfs is intented for servers and sysdmins so it is not as friendly as some may expect, but it did not lose anything that user did not destroy. Author had to use logic to deduct what he did and walk it back.

3 months ago

1 reply

> changing password setting and and realizing you can't use them with old backups

That's unfair to the author. The backups were new, post-password change. And neither old nor new password worked on them. The thing that was old was an otherwise empty container dataset.

unsnap_biceps

3 months ago

4 replies

This is a case where both sides are completely understandable and no one did anything wrong. ZFS didn't lose its mind. It worked as designed and intended. The author didn't know a critical detail about the implementation. It's a series of unfortunate events. The only failure could be lack of better ZFS documentation.

3 months ago

1 reply

What ZFS did is understandable but wrong. Sending an incremental snapshot needs to send updates to the encryption parameters, even if they're inherited from another dataset.

privatelypublic

3 months ago

I'm not sure if anybody is wrong or right. But this should be officially documented, a specific error provided- not "permission denied", and a workflow to fix it that doesn't involve patching the driver.

1oooqooq

3 months ago

1 reply

let's all agree sending incremental data to something that settings were changed, without any error, is a bug

privatelypublic

3 months ago

1 reply

Except, zero knowledge replication is why ZFE native encryption got funded. I believe it was Datto specifically.

3 months ago

A snapshot knows the parent block ID of its data. Also storing the parent block ID of its encryption settings wouldn't leak anything.

3 months ago

Hi, author here! I'm not sure why you're getting downvoted because you're absolutely right.

Depending on if you're being optimistic or pessimistic, either 1) neither I nor ZFS did anything wrong or 2) both ZFS and I did some things wrong. Either way, neither one nor the other is particularly at fault.

While I said "ZFS lost its mind" for title length reasons, it really would be more accurate to say "ZFS appeared to loose its mind", since as I learned, everything makes sense when you consider the encryption root.

My only disagreement is that the only possible improvement is in documentation. I answered this in a reply to OpenZFS developer robn in another thread (https://old.reddit.com/r/zfs/comments/1ntwrjx/mind_the_encry...) which I'll copy here:

> This is great writeup, and I really appreciate you taking the time on it. With my OpenZFS dev hat on, it's often quite difficult to understand exactly how people are using the things we make, especially when they go wrong - what were they expecting, what conceptual errors were involved, and so on. I'm passing it around at the moment and will give it a much slower and more thoughtful read as soon as I can. Thanks!

> While it's fresh on your mind, what would be one simple change that we could make today that would have prevented this is or made it much less likely? Doc change, warning output, etc. I have some ideas, but I don't want to lead the witness :)

First off, thank you for taking the effort to try and understand this from the OpenZFS side. It's really easy to dismiss this kind of thing as user error (which is true) since OpenZFS did actually behave as designed (which is also true), rather than taking it as an opportunity to better understand and improve the user experience.

When I think of the factors that lead me to make the mistake of not sending a snapshot of the encryption root, I think it comes down to a difference in expectations vs reality. When I think of a snapshot, I conceptualize it as fully consistent version of a dataset at a point in time (which as far as I know is still true for unencrypted datasets). Native encryption violates that expectation by 1) storing the wrapping key parameters in the encryption root which may be a different dataset and therefore exists outside of the snapshot and 2) allowing the wrapping key and dataset master keys to get out of sync.

If I send a snapshot from one pool to another, I expect ZFS to send everything necessary to reproduce the same state on the destination pool. As an uneducated user, I'd find it very unintuitive that I also need to send a new snapshot of another empty, unchanged dataset which through some "spooky action at a distance" affects the decryptability other child encrypted datasets just because it's the parent dataset at the root of the encrypted tree. I expect datasets to be isolated from one another. Users shouldn't have to know about wrapping keys and master keys, let alone worry about keeping them in sync across multiple datasets.

While I do think the docs could be improved to emphasize the importance of the encryption root (especially in zfs-send -w, --raw which doesn't even mention it) which would've made debugging the issue a bit easier, I'm not sure how much that would've helped prevent the issue in the first place. The reality we must face is that people don't read the docs unprompted to challenge their fundamental expectations; they work with the mental model they have and only consult the docs when they have a question to answer.

What I do think would've really helped is if zfs-recv could check the wrapping key parameters on the encryption root when it sees a new encrypted master key in the send stream and abort if they don't match. This wouldn't prevent every scenario, (eg. if instead of forgetting to send a snapshot of the encryption root, you forgot to send snapshots of the child encrypted datasets), but it would have prevented this one and would be a step in the right direction.

In the long term, I'd really love for OpenZFS to treat keeping wrapping keys and master keys in sync as an invariant that is always maintained so users don't need to know about encryption roots, wrapping keys, or master keys and can ignore them like any other implementation details. I've only just begun thinking about some potential options in the solution space, but I have a feeling this will not be an easy problem to fully solve. I'd love to hear your ideas as an actual OpenZFS developer.

docsaintly

3 months ago

Would the author (or most people) have read the documentation before doing this action? I doubt it.

aborsy

3 months ago

The user has their encrypted data and two encryption keys. They should be able to decrypt. They don’t care about internal ZFS password settings.

I also confirm that people snapshot their data, which is usually child datasets. If you don’t care about an empty folder, snapshotting and replicating it according to a careful schedule is not expected.

MuteXR

3 months ago

And you can also do so with ZFS. OP has hit a weird issue that normal usage won't ever get.

One that should not exist, of course, but certainly not a normal one.

atmosx

3 months ago

OpenZFS has worked fine for me, in mirror mode, for 15 years without anything resembling data loss.

When I had to replace HDDs, the ops were very smooth. I don't mess with ZFS all that often. I rely to the documentation. I must say that IMO the CLI is a breath of fresh air compared to the other options we had in the past (ext3/4FS, ReiserFS, XFS, etc.). Now BTRFS might be easier to work with, I can't tell.

btw, this bug is well known amongst openZFS users. There are quite a few posts about it.

pessimizer

3 months ago

2 replies

This all seems unbelievably more complicated and prone to failure than just doing luks over mdadm. You could just skip this weird, arcane process by imaging the disks, walking them to where they needed to be, then slapping them into the other machine and mounting them as normal.

I do not understand making RAID and encryption so very hard, and then using some NAS in a box distribution like an admission you don't have the skills to handle it. A lot of people are using ZFS and "native encryption" on Archlinux (not in this case) when they should just be using mdadm and luks on Debian stable. It's like they're overcomplicating things in order to be able to drop trendy brand names around other nerds, then often dramatically denouncing those brand names when everything goes wrong for them.

If you don't have any special needs, and you don't know what you're doing, just do it the simple way. This all just seems horrific. I've got >15 year old mdadm+luks arrays that have none of their original disks, are 5x their original disk size, have survived plenty of failures, and aren't in their original machines. It's not hard, and dealing with them is not constantly evolving.

Reading this gives me childhood anxiety from when I compressed by dad's PC with a BBS pirated copy of Stacker so I would have more space for pirated Sierra games, it errored out before finishing, and everything was inaccessible. I spent from dusk to dawn trying to figure out how to fix it (before the internet, but I was pretty good at DOS) and I still don't know how I managed it. I thought I was doomed. Ran like a dream afterwards and he never found out.

throwaway240403

3 months ago

3 replies

There are very real reasons to use ZFS instead of the oldschool Linux block device sandwich. mdadm+luks+lvm still do not quite provide the same set of features that ZFS alone does even without encryption. Namely in-line compression, and data checksumming, not to mention free snapshots.

ZFS is quite mature, the feature discussed in the article is not. As others have pointed out this could have been avoided by running ZFS on top of luks and would have hardly sacrificed any functionality.

kelnos

3 months ago

2 replies

It's a little weird to denounce the "block device sandwich" and then say that they should have used... a variation of the block device sandwich.

> There are very real reasons to use ZFS

I feel like, for the types of person GP is talking about, they likely don't really need to use ZFS, and luks+md+lvm would be just fine for them.

Like the GP, I have such a setup that's been in operation for 15-20 years now, with none of the original disks, probably 4 or 5 full disk swaps, starting out as a 4x 500GB array, which is now a 5x 8TB array. It's worked perfectly fine, and the only times I've come close to losing data is when I have done something truly stupid (that is, directly and intentionally ignored the advice of many online tutorials)... and even then, I still have all my data.

Honestly the only thing missing that I wish I had was data checksumming, and even then... eh.

pixl97

3 months ago

Run enough disks long enough and you'll find one that starts returning garbage while telling the OS everything is ok.

First time I had it happen was on a hardware raid device and a company lost 2 and a half days worth of data as any backups from when it started had bad data.

The next time I had it happen is using ZFS and we saw a flood of checksum errors and replaced the disk. Even after that SMART thought it was perfectly fine and you could send commands to it, you just got garbage back.

conception

3 months ago

How do you know you’ve lost no data? Do you checksum all your files? Bits gonna rot.

eadmund

3 months ago

> mdadm+luks+lvm still do not quite provide the same set of features that ZFS alone does even without encryption. Namely in-line compression, and data checksumming, not to mention free snapshots.

Sure, but LUKS+ZFS provides all that too, and also encrypts everything (ZFS encryption, surprisingly, does not encrypt metadata).

As this article demonstrates, encryption really is an afterthought with ZFS. Just as ZFS rethought from first principles what storage requires and ended up making some great decisions, someone needs to rethink from first principles what secure storage requires.

3 months ago

> Namely in-line compression, and data checksumming, not to mention free snapshots.

You get these for free with btrfs

yjftsjthsd-h

3 months ago

> I do not understand making RAID and encryption so very hard,

I don't use ZFS-native encryption, so I won't speak to that, but in what way is RAID hard? You just `zpool create` with the topology and devices and it works. In fact,

> If you don't have any special needs, and you don't know what you're doing, just do it the simple way. This all just seems horrific. I've got >15 year old mdadm+luks arrays that have none of their original disks, are 5x their original disk size, have survived plenty of failures, and aren't in their original machines. It's not hard, and dealing with them is not constantly evolving.

I would write almost this exact thing, but with ZFS. It's simple, it's easy, it just keeps going through disk replacements and migrations.

3abiton

3 months ago

2 replies

> I very nearly permanently lost 8.5 TiB of data after performing what should've been a series of simple, routine ZFS operations but resulted in an undecryptable dataset. Time has healed the wound enough that I am no longer filled with anguish just thinking about it, so I will now share my experience in the hope that you may learn from my mistakes.

As a zfs user employing encryption, that read like a horror story. Great read, and thanks for the takeaway.

nextaccountic

3 months ago

1 reply

why would anyone use zfs encryption rather than luks?

3abiton

3 months ago

Luks add overheads. Itcs more flexible, but zfs native envryption should be faster.

3 months ago

Thanks! If you'd like to read another encryption related horror story, this one served as a bit of stylistic inspiration: https://max.levch.in/post/724289457144070144/shamir-secret-s...

Imustaskforhelp

3 months ago

3 replies

I am not sure if this is the correct place but pardon me, I was one trying to remove the luksEncryption key and I searched it on stackoverflow thinking that I am going to figure this out myself...

The first thing on stackoverflow permanently made the data recoverable and it was only under the comment that people mentioned this...

My whole data of projects and what not got lost because of it and that just gave me the lesson of actually reading the whole thing.

I sometimes wonder if using AI would've made any difference or would it have even mattered because I didn't want to use AI and that's why I went to stackoverflow lol... But at the same point, AI makes hallucinations too but it was a good reality check for me as well to always read the whole thing before running commands.

edwcross

3 months ago

1 reply

Did you mean "unrecoverable"? I first read your comment as "ok, the solution is trivially easy so the article is unnecessary", but the rest of your comment implies the opposite.

Imustaskforhelp

3 months ago

My data did get unrecoverable after running the command that was shown first when i didn't scroll or read about that command more and I just ran it and it just made it unrecoverable.

So yes it got unrecoverable.

And then I just deleted that drive by flashing nix-os in that and trying that for sometimes, so maybe there is good in every bad and I definitely learnt something to always be cautious about what commands you run

jeroenhd

3 months ago

1 reply

> I sometimes wonder if using AI would've made any difference or would it have even mattered because I didn't want to use AI and that's why I went to stackoverflow lol

AI is trained on stackoverflow and much, much worse support forums. At least SO has the comments below bad advice to warn others, AI will just say "Oops, you're entirely right, I made a mistake and now your data is permanently gone".

Imustaskforhelp

3 months ago

Oh yes, I forgot to tell the aftermath,The funny thing is that I actually went to AI after making it unrecoverable and it says that don't worry it can be fixed and gave me commands which gave me hope but did nothing and it never admitted to be honest as those comments

In the end I just asked it to flash it clean so that I can atleast use my HDD which was now in the state of a limbo and it couldn't even do that.

I was just wondering about in my comment if it would have originally given me a different command or not but there are a lot more chances that it would and gaslight me than give me the right command lol

https://openzfs.github.io/openzfs-docs/man/master/8/zpool-im...

3 months ago

AI has told me to do things that would have made my system not bootable. You want a human in the loop for these types of things.

ysleepy

3 months ago

1 reply

Would it not have been easier to just mount the destroyed old pool or recovering the dataset from the history ring buffer?

zpool import -D

I haven't tried this, but I gather from the blog post that it would have been much simpler as it didn't require any of the encryption stuff.

3 months ago

1 reply

There wasn't a destroyed pool, it's the harder version of trying to rewind time on the filesystem. It's worth trying once the disks are fully backed up, but it's fussy enough that I can understand why they made it plan B.