A warning story for the #ZFS people. The #Gandi storage failure post mortem https://news.gandi.net/en/2020/01/postmortem-of-the-failure-of-one-hosting-storage-unit-at-lu-bi1-on-january-8-2020/
@jwildeboer What is the warning in this story? It's a good post-mortem of a random failure and why it was slow to recover..
but recover they did
I've had harder crashes with other FS's or even DB recoveries..
@Maescool the „slow to recover” is quite a warning IMHO.
@jwildeboer due to the old version they were using, and used the safest way not to loose data, yeah 'slow to recover' is also in mySQL xD
@jwildeboer This appears to have a lot more to do with not testing your recovery procedure, not having backups, and using non-ECC RAM in servers than it does with ZFS. There is no filesystem that can protect you from bad processes. I don't know about the full scan for import thing; I've been using ZFS for over a decade and have never had that happen. It seems like on top of not bothering to test their recovery procedures their ops folks also barely know ZFS.
@freakazoid I didn't see that they were not using ECC. I only saw a note that "server RAM" might be a cause. But I am always very careful to jump to conclusions from the outside.
@jwildeboer Except for the conclusion that it's somehow ZFS's fault ;-)
@freakazoid That's your interpretation of my words. I didn't say or imply that. I found the story interesting and shareworthy for these reasons:
- One server can cause a lot of problems, even when ZFS seems to be set up in a way that should garantuee high resilience.
- Finding the root cause can be quite difficult
- Lack of features in older versions that came unexpected, causing severe slowing down of the recovery process.
It's an insightful postmortem of a ZFS failure mode. Hence I shared it
@jwildeboer It's also a good cautionary tale for those who think "Oh it's triple-replicated so we don't need backups." You always need backups. And you need to test your backups. Just like you need to test every other recovery procedure.
@jwildeboer good find and nice writeup from Gandi! We should make restore exercises mandatory I guess... maybe find some cyber reasons for it :-) just "restore" sounds boring
Mastodon instance for people with Wildeboer as their last name