A good read. Thanks for the hint.

"Keep your software as up-to-date as possible"

#zfs #Ghandi

@jwildeboer What is the warning in this story? It's a good post-mortem of a random failure and why it was slow to recover..
but recover they did
I've had harder crashes with other FS's or even DB recoveries..

@jwildeboer due to the old version they were using, and used the safest way not to loose data, yeah 'slow to recover' is also in mySQL xD

@jwildeboer This appears to have a lot more to do with not testing your recovery procedure, not having backups, and using non-ECC RAM in servers than it does with ZFS. There is no filesystem that can protect you from bad processes. I don't know about the full scan for import thing; I've been using ZFS for over a decade and have never had that happen. It seems like on top of not bothering to test their recovery procedures their ops folks also barely know ZFS.

@freakazoid I didn't see that they were not using ECC. I only saw a note that "server RAM" might be a cause. But I am always very careful to jump to conclusions from the outside.

@jwildeboer Except for the conclusion that it's somehow ZFS's fault ;-)

@freakazoid That's your interpretation of my words. I didn't say or imply that. I found the story interesting and shareworthy for these reasons:

- One server can cause a lot of problems, even when ZFS seems to be set up in a way that should garantuee high resilience.
- Finding the root cause can be quite difficult
- Lack of features in older versions that came unexpected, causing severe slowing down of the recovery process.

It's an insightful postmortem of a ZFS failure mode. Hence I shared it

@jwildeboer I guess we should also take in consideration the fact that, IIRC, the boxes involved were old Nexenta's implementation of ZFS.
I've seen it happening already on old IllumOS boxes.


@jwildeboer It's also a good cautionary tale for those who think "Oh it's triple-replicated so we don't need backups." You always need backups. And you need to test your backups. Just like you need to test every other recovery procedure.

@freakazoid Yes, it's a cautionary tale, highlighting a lot of points to review in any DR/Failure process. And that's why I shared it. Not many companies are as transparent as in sharing such info.

@jwildeboer good find and nice writeup from Gandi! We should make restore exercises mandatory I guess... maybe find some cyber reasons for it :-) just "restore" sounds boring

Sign in to participate in the conversation

Mastodon instance for people with Wildeboer as their last name