@jwildeboer This appears to have a lot more to do with not testing your recovery procedure, not having backups, and using non-ECC RAM in servers than it does with ZFS. There is no filesystem that can protect you from bad processes. I don't know about the full scan for import thing; I've been using ZFS for over a decade and have never had that happen. It seems like on top of not bothering to test their recovery procedures their ops folks also barely know ZFS.

@freakazoid I didn't see that they were not using ECC. I only saw a note that "server RAM" might be a cause. But I am always very careful to jump to conclusions from the outside.

@jwildeboer Except for the conclusion that it's somehow ZFS's fault ;-)

@freakazoid That's your interpretation of my words. I didn't say or imply that. I found the story interesting and shareworthy for these reasons:

- One server can cause a lot of problems, even when ZFS seems to be set up in a way that should garantuee high resilience.
- Finding the root cause can be quite difficult
- Lack of features in older versions that came unexpected, causing severe slowing down of the recovery process.

It's an insightful postmortem of a ZFS failure mode. Hence I shared it

@jwildeboer It's also a good cautionary tale for those who think "Oh it's triple-replicated so we don't need backups." You always need backups. And you need to test your backups. Just like you need to test every other recovery procedure.

Follow

@freakazoid Yes, it's a cautionary tale, highlighting a lot of points to review in any DR/Failure process. And that's why I shared it. Not many companies are as transparent as in sharing such info.

Sign in to participate in the conversation
social.wildeboer.net

Mastodon instance for people with Wildeboer as their last name