• Bug#1104460: [regression 6.1.y] discard/TRIM through RAID10 blocking (w

    From =?UTF-8?Q?Moritz_M=C3=BChlenhoff?=@1:229/2 to All on Mon May 5 14:00:01 2025
    XPost: linux.debian.bugs.dist
    From: jmm@inutil.org

    Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso:
    Hi

    We got a regression report in Debian after the update from 6.1.133 to 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array
    stalls idefintively. The full report is inlined below and originates
    from https://bugs.debian.org/1104460 .

    JFTR, we ran into the same problem with a few Wikimedia servers running
    6.1.135 and RAID 10: The servers started to lock up once fstrim.service
    got started. Full oops messages are available at https://phabricator.wikimedia.org/P75746

    Cheers,
    Moritz

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)
  • From Salvatore Bonaccorso@1:229/2 to Salvatore Bonaccorso on Mon May 5 18:10:01 2025
    XPost: linux.debian.bugs.dist
    From: carnil@debian.org

    On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote:
    Hi Moritz,

    On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote:
    Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso:
    Hi

    We got a regression report in Debian after the update from 6.1.133 to 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array stalls idefintively. The full report is inlined below and originates
    from https://bugs.debian.org/1104460 .

    JFTR, we ran into the same problem with a few Wikimedia servers running 6.1.135 and RAID 10: The servers started to lock up once fstrim.service
    got started. Full oops messages are available at https://phabricator.wikimedia.org/P75746

    Thanks for this aditional datapoints. Assuming you wont be able to
    thest the other stable series where the commit d05af90d6218
    ("md/raid10: fix missing discard IO accounting") went in, might you at
    least be able to test the 6.1.y branch with the commit reverted again
    and manually trigger the issue?

    If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
    with the patch reverted.

    So one additional data point as several Debian users were reporting
    back beeing affected: One user did upgrade to 6.12.25 (where the
    commit was backported as well) and is not able to reproduce the issue
    there.

    This indicates we might miss some pre-requisites in the 6.1.y series?

    user is trying now the 6.1.135 with patch reverted as well.

    Regards,
    Salvatore

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)
  • From Moritz =?iso-8859-1?Q?M=FChlenhoff?@1:229/2 to Salvatore Bonaccorso on Mon May 5 19:50:01 2025
    XPost: linux.debian.bugs.dist
    From: jmm@inutil.org

    On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote:
    Thanks for this aditional datapoints. Assuming you wont be able to
    thest the other stable series where the commit d05af90d6218
    ("md/raid10: fix missing discard IO accounting") went in, might you at
    least be able to test the 6.1.y branch with the commit reverted again
    and manually trigger the issue?

    If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
    with the patch reverted.

    I should be able to test that, yes!

    Cheers,
    Moritz

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)
  • From Salvatore Bonaccorso@1:229/2 to All on Mon May 5 22:40:01 2025
    XPost: linux.debian.bugs.dist
    From: carnil@debian.org

    Hi Antoine,

    On Mon, May 05, 2025 at 02:50:32PM -0400, Antoine Beaupré wrote:
    On 2025-05-05 18:02:37, Salvatore Bonaccorso wrote:
    On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote:
    Hi Moritz,

    On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote:
    Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso: >> > > Hi

    We got a regression report in Debian after the update from 6.1.133 to >> > > 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array >> > > stalls idefintively. The full report is inlined below and originates >> > > from https://bugs.debian.org/1104460 .

    JFTR, we ran into the same problem with a few Wikimedia servers running >> > 6.1.135 and RAID 10: The servers started to lock up once fstrim.service >> > got started. Full oops messages are available at
    https://phabricator.wikimedia.org/P75746

    Thanks for this aditional datapoints. Assuming you wont be able to
    thest the other stable series where the commit d05af90d6218
    ("md/raid10: fix missing discard IO accounting") went in, might you at
    least be able to test the 6.1.y branch with the commit reverted again
    and manually trigger the issue?

    If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
    with the patch reverted.

    So one additional data point as several Debian users were reporting
    back beeing affected: One user did upgrade to 6.12.25 (where the
    commit was backported as well) and is not able to reproduce the issue there.

    That would be me.

    I can reproduce the issue as outlined by Moritz above fairly reliably in 6.1.135 (debian package 6.1.0-34-amd64). The reproducer is simple, on a RAID-10 host:

    1. reboot
    2. systemctl start fstrim.service

    We're tracking the issue internally in:

    https://gitlab.torproject.org/tpo/tpa/team/-/issues/42146

    I've managed to workaround the issue by upgrading to the Debian package
    from testing/unstable (6.12.25), as Salvatore indicated above. There,
    fstrim doesn't cause any crash and completes successfully. In stable, it
    just hangs there forever. The kernel doesn't completely panic and the
    machine is otherwise somewhat still functional: my existing SSH
    connection keeps working, for example, but new ones fail. And an `apt install` of another kernel hangs forever.

    So likely at least in 6.1.y there are missing pre-requisites causing
    the behaviour.

    If you can test 6.1.135-1 with the commit 4a05f7ae33716d996c5ce56478a36a3ede1d76f2 reverted then you can fetch
    built packages at:

    https://people.debian.org/~carnil/tmp/linux/1104460/

    Regards,
    Salvatore

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)
  • From =?utf-8?Q?Antoine_Beaupr=C3=A9?=@1:229/2 to Salvatore Bonaccorso on Mon May 5 21:00:01 2025
    XPost: linux.debian.bugs.dist
    From: anarcat@debian.org

    On 2025-05-05 18:02:37, Salvatore Bonaccorso wrote:
    On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote:
    Hi Moritz,

    On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote:
    Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso:
    Hi

    We got a regression report in Debian after the update from 6.1.133 to
    6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array >> > > stalls idefintively. The full report is inlined below and originates
    from https://bugs.debian.org/1104460 .

    JFTR, we ran into the same problem with a few Wikimedia servers running
    6.1.135 and RAID 10: The servers started to lock up once fstrim.service
    got started. Full oops messages are available at
    https://phabricator.wikimedia.org/P75746

    Thanks for this aditional datapoints. Assuming you wont be able to
    thest the other stable series where the commit d05af90d6218
    ("md/raid10: fix missing discard IO accounting") went in, might you at
    least be able to test the 6.1.y branch with the commit reverted again
    and manually trigger the issue?

    If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
    with the patch reverted.

    So one additional data point as several Debian users were reporting
    back beeing affected: One user did upgrade to 6.12.25 (where the
    commit was backported as well) and is not able to reproduce the issue
    there.

    That would be me.

    I can reproduce the issue as outlined by Moritz above fairly reliably in 6.1.135 (debian package 6.1.0-34-amd64). The reproducer is simple, on a
    RAID-10 host:

    1. reboot
    2. systemctl start fstrim.service

    We're tracking the issue internally in:

    https://gitlab.torproject.org/tpo/tpa/team/-/issues/42146

    I've managed to workaround the issue by upgrading to the Debian package
    from testing/unstable (6.12.25), as Salvatore indicated above. There,
    fstrim doesn't cause any crash and completes successfully. In stable, it
    just hangs there forever. The kernel doesn't completely panic and the
    machine is otherwise somewhat still functional: my existing SSH
    connection keeps working, for example, but new ones fail. And an `apt
    install` of another kernel hangs forever.

    This indicates we might miss some pre-requisites in the 6.1.y series?

    user is trying now the 6.1.135 with patch reverted as well.

    I am embarrassed to say I couldn't figure out how to build a Debian
    package of the Linux kernel at the moment. I would be happy to test a
    built package, that said. I got stock in various snags: the `debian/bin/test-patches` script seem to require a flavor (worked around
    with `-f amd64`) and in the end the build failed with:

    [...]

    ld -r -m elf_x86_64 -z noexecstack --no-warn-rwx-segments --build-id=sha1 -T scripts/module.lds -o virt/lib/irqbypass.ko virt/lib/irqbypass.o virt/lib/irqbypass.mod.o; true
    debian/bin/buildcheck.py debian/build/build_amd64_none_amd64 amd64 none amd64 Can't read ABI reference. ABI not checked!
    make[2]: Leaving directory '/home/anarcat/dist/linux-6.1.135'
    /usr/bin/make -f debian/rules.real build_kbuild ABINAME='6.1.0-0.a.test' ARCH='amd64' DESTDIR='/home/anarcat/dist/linux-6.1.135/debian/linux-kbuild-6.1' DH_OPTIONS='-plinux-kbuild-6.1' KERNEL_ARCH='x86' PACKAGE_NAME='linux-kbuild-6.1' SOURCEVERSION='6.1.
    135-1a~test' SOURCE_BASENAME='linux' SOURCE_SUFFIX='' UPSTREAMVERSION='6.1' VERSION='6.1'
    make[2]: Entering directory '/home/anarcat/dist/linux-6.1.135'
    mkdir -p debian/build/build-tools/headers-tools
    /usr/bin/make ARCH=x86 O=debian/build/build-tools/headers-tools \
    INSTALL_HDR_PATH=/home/anarcat/dist/linux-6.1.135/debian/build/build-tools \
    headers_install
    make[3]: Entering directory '/home/anarcat/dist/linux-6.1.135'
    ***
    *** Configuration file ".config" not found!
    ***
    *** Please run some configurator (e.g. "make oldconfig" or
    *** "make menuconfig" or "make xconfig").
    ***
    /home/anarcat/dist/linux-6.1.135/Makefile:792: include/config/auto.conf.cmd: No such file or directory
    make[4]: *** [/home/anarcat/dist/linux-6.1.135/Makefile:801: .config] Error 1 make[3]: *** [Makefile:250: __sub-make] Error 2
    make[3]: Leaving directory '/home/anarcat/dist/linux-6.1.135'
    make[2]: *** [debian/rules.real:530: debian/stamps/build-tools-headers] Error 2 make[2]: Leaving directory '/home/anarcat/dist/linux-6.1.135'
    make[1]: *** [debian/rules.gen:1471: build-arch_amd64_real_kbuild] Error 2 make[1]: Leaving directory '/home/anarcat/dist/linux-6.1.135'
    make: *** [debian/rules:40: build-arch] Error 2
    dpkg-buildpackage: error: debian/rules binary subprocess returned exit status 2

    It's been a while since I compiled linux, amazingly... It might be
    because I'm trying to compile the Debian 12 kernel on Debian 13. Here
    are the steps I took:

    curl -o 4a05f7ae33716d996c5ce56478a36a3ede1d76f2.patch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=4a05f7ae33716d996c5ce56478a36a3ede1d76f2
    # (reverse the patch)

    [continued in next message]

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)
  • From =?utf-8?Q?Antoine_Beaupr=C3=A9?=@1:229/2 to Salvatore Bonaccorso on Mon May 5 23:10:01 2025
    XPost: linux.debian.bugs.dist
    From: anarcat@debian.org

    On 2025-05-05 22:36:07, Salvatore Bonaccorso wrote:
    Hi Antoine,

    On Mon, May 05, 2025 at 02:50:32PM -0400, Antoine Beaupré wrote:
    On 2025-05-05 18:02:37, Salvatore Bonaccorso wrote:
    On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote:
    Hi Moritz,

    On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote:
    Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso: >> >> > > Hi

    We got a regression report in Debian after the update from 6.1.133 to >> >> > > 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array
    stalls idefintively. The full report is inlined below and originates >> >> > > from https://bugs.debian.org/1104460 .

    JFTR, we ran into the same problem with a few Wikimedia servers running >> >> > 6.1.135 and RAID 10: The servers started to lock up once fstrim.service >> >> > got started. Full oops messages are available at
    https://phabricator.wikimedia.org/P75746

    Thanks for this aditional datapoints. Assuming you wont be able to
    thest the other stable series where the commit d05af90d6218
    ("md/raid10: fix missing discard IO accounting") went in, might you at
    least be able to test the 6.1.y branch with the commit reverted again
    and manually trigger the issue?

    If needed I can provide a test Debian package of 6.1.135 (or 6.1.137)
    with the patch reverted.

    So one additional data point as several Debian users were reporting
    back beeing affected: One user did upgrade to 6.12.25 (where the
    commit was backported as well) and is not able to reproduce the issue
    there.

    That would be me.

    I can reproduce the issue as outlined by Moritz above fairly reliably in
    6.1.135 (debian package 6.1.0-34-amd64). The reproducer is simple, on a
    RAID-10 host:

    1. reboot
    2. systemctl start fstrim.service

    We're tracking the issue internally in:

    https://gitlab.torproject.org/tpo/tpa/team/-/issues/42146

    I've managed to workaround the issue by upgrading to the Debian package
    from testing/unstable (6.12.25), as Salvatore indicated above. There,
    fstrim doesn't cause any crash and completes successfully. In stable, it
    just hangs there forever. The kernel doesn't completely panic and the
    machine is otherwise somewhat still functional: my existing SSH
    connection keeps working, for example, but new ones fail. And an `apt
    install` of another kernel hangs forever.

    So likely at least in 6.1.y there are missing pre-requisites causing
    the behaviour.

    If you can test 6.1.135-1 with the commit 4a05f7ae33716d996c5ce56478a36a3ede1d76f2 reverted then you can fetch
    built packages at:

    https://people.debian.org/~carnil/tmp/linux/1104460/

    I can confirm this kernel does not crash when running fstrim.service,
    which seems to confirm the bisect.

    A.

    --
    Drowning people
    Sometimes die
    Fighting their rescuers.
    - Octavia Butler

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)
  • From Melvin Vermeeren@1:229/2 to Bonaccorso on Tue May 6 17:16:15 2025
    XPost: linux.debian.bugs.dist
    From: vermeeren@vermwa.re

    Hi Salvatore,

    I had been unexpectedly busy the past week, caught up to all the mails just now. Many thanks to everyone involved and the additional information from several people, am happy to see it.

    On Wednesday, 30 April 2025 17:55:20 Central European Summer Time Salvatore Bonaccorso wrote:
    Melvin, the same change went as well in other stable series, 6.6.88,
    6.12.25, 6.14.4, can you test e.g. 6.12.25-1 in Debian as well from
    unstable to see if the regression is there as well?

    Specifically for this, I did just now test this with Debian testing's 6.12.25-1, albeit on amd64 instead of ppc64le, with an identical storage
    layout and can confirm the issue does *not* exist there.

    This confirms what others already discovered by now, I agree with the findings and have nothing to add specifically.

    Thanks again to all,

    --
    Melvin Vermeeren
    Systems engineer
    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEiu1YAh/qzdXye6Dmpy9idxbqnZYFAmgaJ78ACgkQpy9idxbq nZZCSw//Tdvc2ZE1nRA2KU7URHqkRIaXInr3PJkPpxb1uAfogjZI9Gku31jX+4f5 j4BSv0i/RPBIOo+THXMrgwBPHEs4eMY8JDiuXkLS2jj7p7VQdDlBqZG06MxGArGg VehqHf2dQVV7L0XdTb/kJfmT5HEUVikZKN+aDL+iBdFdw6x38ncimYiGbeaC3JKB NwWuOnYoNokDsaGjX59cV84p8TnBTL+ev1K3ghVJ1gdT0bNC+cgH8hpwAZ+/5Mey 9jmX00mFizQSbx/goQk3SB4NzqlTSPn5k17WL/XNxHCVim4soelTfREUzTGSXKfz 9WCxJxJEPk3ejAq4Ncd3D4Nm8yKxmtI9egeXAXXz5Ys+DvEdCsBr4JrvhXttlqag 8Eyv/Nym/sTSo0zYvnAXK8+DIvByOnR2XlSuG3nObNxm3eSBIHCqA1xyI/1D+rgg y6iYwCRiXEdMS3E12lp33+CoPCIhLjD52ESsx1ycxxgzA/3QdGfMZayZPyuWqV87 xQC9HJpGj1bM97hw6hBFe2qxqPCrbNiC7GymrFt3Q1A47+jfhttu7rHRxg6bxF/B 2reM9emRPT/np2wlAB3/9tc+xR4d97Cii4P7fruyBs0ntcHWzx3vCb4SdwQa6R/k eTTwX41teU9hkQLIG5pPHg57HGmZHhVJJu7Plb88jKRXj9MDBQg=
    =ogZ0
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)