Follow

@codeberg My coronamuc.de page, which is hosted by your pages server is also down, right at the time of Oktoberfest and many people trying to view our data. PLEASE fix this.

@codeberg Ok. Seems everything is back. That was at least a 20 minutes full outage :(

@jwildeboer sorry for inconvenience, Jan. By Chance, any ceph developers around that could help debugging a ceph/libceph kernel driver bug?

@codeberg @jwildeboer Hmm... this doesn't look like a ceph bug, but an ext4 bug instead.

You seem to be running a Debian kernel based in upstream 5.10.136. Could you update to a more recent one? It looks like Debian has already new stable kernels released and I see a bunch of ext4-related fixes in (upstream) 5.10.137.

@codeberg let's get in touch when you decide you need better VMs.

@antondollmaier We're looking to migrate this to new hardware and LXC anyway. But this also requires some work ..

@codeberg ah, chicken fencing problem...
Been there as well, not just once.
Good luck! (Seriously! Get big and show GitHub who's the better platform :) )

@gwenn @codeberg @jwildeboer sadly no, i have not yet an #ceph running - maybe in a half year with @tercean (he has some small experiences).
( i am not sure if @mortzu has it in in the past).

@codeberg @jwildeboer One thing I just realized is that this seems to be happening while the system is trying to get free memory:

Ceph has some work queued, but there's not memory available. The shrinker kicks-in and ext4 is selected to free some memory. And that's where things go south.

So, a possible workaround is to increase system memory. If upgrading the kernel doesn't fix it, I'd suggest reporting a bug to the ext4 mailing-list linux-ext4@vger.kernel.org. No need to register, it's an open list.

@henrix @jwildeboer yes, and the first lines report a permissions violation on hypervisor/KVM level

@codeberg @henrix Good to see the community coming together to help understand and hopefully fix this one :) Totally worth the short frustration. Keep it going, wonderful people!

@codeberg @jwildeboer I think @ingo deals with ceph a lot? Not sure if he's got time for something like that.

@tuxflo @codeberg @jwildeboer i'm not a ceph dev but maybe ask on the ceph mailinglist

@codeberg @jwildeboer I guess also the obvious question is which kernel and ceph version are you running and what are the system specs?

@mongrelion @jwildeboer vanilla debian stable. For a details please let's discuss offline to avoid too much noise in here

@codeberg @jwildeboer so from what I understand this is (likely) because of memory pressure and then some issue (bug/failed mitigation) in ext4.
I am no expert on this but out of interest I have followed developments of XFS for a while. They seem very conscious of failure cases to the point that they successfully countered a claim and pointed out that the kernel subsystem defies its own memory allocation, instead of a bug in XFS. That whole filesystem seems to be very carefully and deliberately designed. It has also been default filesystem for a number of distros.

If you have to evaluate configurations, please consider XFS as a file system too.

Sign in to participate in the conversation
social.wildeboer.net

Mastodon instance for people with Wildeboer as their last name