Hi
We got a regression report in Debian after the update from 6.1.133 to 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array
stalls idefintively. The full report is inlined below and originates
from https://bugs.debian.org/1104460 .
Hi Moritz,
On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote:
Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso:
Hi
We got a regression report in Debian after the update from 6.1.133 to 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array stalls idefintively. The full report is inlined below and originates
from https://bugs.debian.org/1104460 .
JFTR, we ran into the same problem with a few Wikimedia servers running 6.1.135 and RAID 10: The servers started to lock up once fstrim.service
got started. Full oops messages are available at https://phabricator.wikimedia.org/P75746
Thanks for this aditional datapoints. Assuming you wont be able to
thest the other stable series where the commit d05af90d6218
("md/raid10: fix missing discard IO accounting") went in, might you at
least be able to test the 6.1.y branch with the commit reverted again
and manually trigger the issue?
If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
with the patch reverted.
Thanks for this aditional datapoints. Assuming you wont be able to
thest the other stable series where the commit d05af90d6218
("md/raid10: fix missing discard IO accounting") went in, might you at
least be able to test the 6.1.y branch with the commit reverted again
and manually trigger the issue?
If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
with the patch reverted.
On 2025-05-05 18:02:37, Salvatore Bonaccorso wrote:
On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote:
Hi Moritz,
On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote:
Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso: >> > > Hi
We got a regression report in Debian after the update from 6.1.133 to >> > > 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array >> > > stalls idefintively. The full report is inlined below and originates >> > > from https://bugs.debian.org/1104460 .
JFTR, we ran into the same problem with a few Wikimedia servers running >> > 6.1.135 and RAID 10: The servers started to lock up once fstrim.service >> > got started. Full oops messages are available at
https://phabricator.wikimedia.org/P75746
Thanks for this aditional datapoints. Assuming you wont be able to
thest the other stable series where the commit d05af90d6218
("md/raid10: fix missing discard IO accounting") went in, might you at
least be able to test the 6.1.y branch with the commit reverted again
and manually trigger the issue?
If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
with the patch reverted.
So one additional data point as several Debian users were reporting
back beeing affected: One user did upgrade to 6.12.25 (where the
commit was backported as well) and is not able to reproduce the issue there.
That would be me.
I can reproduce the issue as outlined by Moritz above fairly reliably in 6.1.135 (debian package 6.1.0-34-amd64). The reproducer is simple, on a RAID-10 host:
1. reboot
2. systemctl start fstrim.service
We're tracking the issue internally in:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/42146
I've managed to workaround the issue by upgrading to the Debian package
from testing/unstable (6.12.25), as Salvatore indicated above. There,
fstrim doesn't cause any crash and completes successfully. In stable, it
just hangs there forever. The kernel doesn't completely panic and the
machine is otherwise somewhat still functional: my existing SSH
connection keeps working, for example, but new ones fail. And an `apt install` of another kernel hangs forever.
On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote:
Hi Moritz,
On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote:
Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso:
Hi
We got a regression report in Debian after the update from 6.1.133 to
6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array >> > > stalls idefintively. The full report is inlined below and originates
from https://bugs.debian.org/1104460 .
JFTR, we ran into the same problem with a few Wikimedia servers running
6.1.135 and RAID 10: The servers started to lock up once fstrim.service
got started. Full oops messages are available at
https://phabricator.wikimedia.org/P75746
Thanks for this aditional datapoints. Assuming you wont be able to
thest the other stable series where the commit d05af90d6218
("md/raid10: fix missing discard IO accounting") went in, might you at
least be able to test the 6.1.y branch with the commit reverted again
and manually trigger the issue?
If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
with the patch reverted.
So one additional data point as several Debian users were reporting
back beeing affected: One user did upgrade to 6.12.25 (where the
commit was backported as well) and is not able to reproduce the issue
there.
This indicates we might miss some pre-requisites in the 6.1.y series?
user is trying now the 6.1.135 with patch reverted as well.
Hi Antoine,
On Mon, May 05, 2025 at 02:50:32PM -0400, Antoine Beaupré wrote:
On 2025-05-05 18:02:37, Salvatore Bonaccorso wrote:
On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote:
Hi Moritz,
On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote:
Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso: >> >> > > Hi
We got a regression report in Debian after the update from 6.1.133 to >> >> > > 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array
stalls idefintively. The full report is inlined below and originates >> >> > > from https://bugs.debian.org/1104460 .
JFTR, we ran into the same problem with a few Wikimedia servers running >> >> > 6.1.135 and RAID 10: The servers started to lock up once fstrim.service >> >> > got started. Full oops messages are available at
https://phabricator.wikimedia.org/P75746
Thanks for this aditional datapoints. Assuming you wont be able to
thest the other stable series where the commit d05af90d6218
("md/raid10: fix missing discard IO accounting") went in, might you at
least be able to test the 6.1.y branch with the commit reverted again
and manually trigger the issue?
If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
with the patch reverted.
So one additional data point as several Debian users were reporting
back beeing affected: One user did upgrade to 6.12.25 (where the
commit was backported as well) and is not able to reproduce the issue
there.
That would be me.
I can reproduce the issue as outlined by Moritz above fairly reliably in
6.1.135 (debian package 6.1.0-34-amd64). The reproducer is simple, on a
RAID-10 host:
1. reboot
2. systemctl start fstrim.service
We're tracking the issue internally in:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/42146
I've managed to workaround the issue by upgrading to the Debian package
from testing/unstable (6.12.25), as Salvatore indicated above. There,
fstrim doesn't cause any crash and completes successfully. In stable, it
just hangs there forever. The kernel doesn't completely panic and the
machine is otherwise somewhat still functional: my existing SSH
connection keeps working, for example, but new ones fail. And an `apt
install` of another kernel hangs forever.
So likely at least in 6.1.y there are missing pre-requisites causing
the behaviour.
If you can test 6.1.135-1 with the commit 4a05f7ae33716d996c5ce56478a36a3ede1d76f2 reverted then you can fetch
built packages at:
https://people.debian.org/~carnil/tmp/linux/1104460/
Melvin, the same change went as well in other stable series, 6.6.88,
6.12.25, 6.14.4, can you test e.g. 6.12.25-1 in Debian as well from
unstable to see if the regression is there as well?
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 152:40:47 |
Calls: | 10,383 |
Files: | 14,054 |
Messages: | 6,417,821 |