Hey, @codeberg - down again? This is getting a bit out of control :(

@codeberg My page, which is hosted by your pages server is also down, right at the time of Oktoberfest and many people trying to view our data. PLEASE fix this.

@codeberg Ok. Seems everything is back. That was at least a 20 minutes full outage :(

@jwildeboer sorry for inconvenience, Jan. By Chance, any ceph developers around that could help debugging a ceph/libceph kernel driver bug?

@codeberg @jwildeboer Hmm... this doesn't look like a ceph bug, but an ext4 bug instead.

You seem to be running a Debian kernel based in upstream 5.10.136. Could you update to a more recent one? It looks like Debian has already new stable kernels released and I see a bunch of ext4-related fixes in (upstream) 5.10.137.

@codeberg let's get in touch when you decide you need better VMs.

@antondollmaier We're looking to migrate this to new hardware and LXC anyway. But this also requires some work ..

@codeberg ah, chicken fencing problem...
Been there as well, not just once.
Good luck! (Seriously! Get big and show GitHub who's the better platform :) )

@gwenn @codeberg @jwildeboer sadly no, i have not yet an #ceph running - maybe in a half year with @tercean (he has some small experiences).
( i am not sure if @mortzu has it in in the past).

@codeberg @jwildeboer One thing I just realized is that this seems to be happening while the system is trying to get free memory:

Ceph has some work queued, but there's not memory available. The shrinker kicks-in and ext4 is selected to free some memory. And that's where things go south.

So, a possible workaround is to increase system memory. If upgrading the kernel doesn't fix it, I'd suggest reporting a bug to the ext4 mailing-list No need to register, it's an open list.

@henrix @jwildeboer yes, and the first lines report a permissions violation on hypervisor/KVM level

@codeberg @henrix Good to see the community coming together to help understand and hopefully fix this one :) Totally worth the short frustration. Keep it going, wonderful people!

@codeberg @jwildeboer I think @ingo deals with ceph a lot? Not sure if he's got time for something like that.

@tuxflo @codeberg @jwildeboer i'm not a ceph dev but maybe ask on the ceph mailinglist

@codeberg @jwildeboer I guess also the obvious question is which kernel and ceph version are you running and what are the system specs?

@mongrelion @jwildeboer vanilla debian stable. For a details please let's discuss offline to avoid too much noise in here

@codeberg @jwildeboer so from what I understand this is (likely) because of memory pressure and then some issue (bug/failed mitigation) in ext4.
I am no expert on this but out of interest I have followed developments of XFS for a while. They seem very conscious of failure cases to the point that they successfully countered a claim and pointed out that the kernel subsystem defies its own memory allocation, instead of a bug in XFS. That whole filesystem seems to be very carefully and deliberately designed. It has also been default filesystem for a number of distros.

If you have to evaluate configurations, please consider XFS as a file system too.

Sign in to participate in the conversation

Mastodon instance for people with Wildeboer as their last name