• Linux kernel stability fixes for older SPARCs

    From John Paul Adrian Glaubitz@21:1/5 to All on Tue Sep 3 10:20:02 2024
    Hi Rene,

    according to these posts [1][2] by Iggi, you figured out the stability problem with newer kernels on older SPARC machines. There has been a regression on older
    SPARCs since around kernel 4.19.x which I haven't gotten around to bisecting yet.

    If you've found and fixed the bug in question, it would be great if you could share
    your fix with the community and maybe whip up a kernel patch to fix the bug upstream.

    Newer SPARCs are not affected by this bug, although there are other issues.

    Thanks,
    Adrian

    [1] https://x.com/Iggi76123640/status/1828396228444389600
    [2] https://x.com/Iggi76123640/status/1828673611080589641

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to All on Tue Sep 3 11:30:01 2024
    Hello Rene,

    On Tue, 2024-09-03 at 11:09 +0200, René Rebe wrote:
    according to these posts [1][2] by Iggi, you figured out the stability problem

    No, we are just sometimes lucky it run that long stable. I was only made aware
    recently that sun4u was not 100% and my fasted UltraSPARC until some year ago was only a 360MHz Ultra5 until I was donated a Sun Blade 1000 recently. I see some MM corruption that I wanted to hunt next.

    Hmm, ok. I was under the impression that you made some changes that made the kernel
    on Iggi's machine stable. Currently, the kernel crashes randomly on older SPARCs
    such as reported by Iggi:

    https://x.com/Iggi76123640/status/1827658841581896152

    with newer kernels on older SPARC machines. There has been a regression on older
    SPARCs since around kernel 4.19.x which I haven't gotten around to bisecting yet.

    Happy to bi-sect. I guess you mean random memory corruption I see or anything else?

    Not sure what the underlying issue is, but the kernel just crashes completely.

    If you have issues to bi-sect just let us know for any arch. Given T2’s cross-compile
    support and I have most hardware in my museum now, I can usually bisect issues
    within a day or two.

    I don't have issues with bisecting, I'm just rather time-constrained at the moment, so
    I'm always happy when someone else can step in and help. Would be great to get this issue
    fixed upstream.

    If you've found and fixed the bug in question, it would be great if you could share
    your fix with the community and maybe whip up a kernel patch to fix the bug upstream.


    Of course - all patches are always nicely sorted in our public and nicely readable
    SVN tree in any case.

    https://t2linux.com

    Is there a web view available? I'm not really a big fan of SVN, to be honest.

    Newer SPARCs are not affected by this bug, although there are other issues.

    You mean sun4v? I found a cheap T4-1 some month ago, and T2/Linux appears
    to run stable on that. Any list of issues w/ sun4v I should be aware of?

    Linux runs mostly stable on sun4v, but there are filesystem corruption issues when you
    run Linux inside an LDOM on Solaris 11.3 and 11.4 even with the latest SRU of Solaris.

    These happen rarely, but they do occur and they are quite annoying as they mandate rebooting
    the LDOM as the root filesystem is mounted read-only and the filesystems as errors afterwards.

    It seems to be a bug in the LDOM vdisk driver (drivers/block/sunvdc.c).

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Gregor Riepl on Tue Sep 3 19:50:01 2024
    Hi Gregor,

    On Tue, 2024-09-03 at 19:19 +0200, Gregor Riepl wrote:
    If you have issues to bi-sect just let us know for any arch. Given T2’s cross-compile
    support and I have most hardware in my museum now, I can usually bisect issues
    within a day or two.

    I don't have issues with bisecting, I'm just rather time-constrained at the moment, so
    I'm always happy when someone else can step in and help. Would be great to get this issue
    fixed upstream.

    My Ultra 10 and Fire V215 are desperately waiting for a more stable kernel.
    I actually wanted to offer help with bisecting, but kept back due to a lack of time and also suitable build system (compiling kernels is so time-consuming).

    Any help is welcome ;-).

    I may have some time to do test runs next week.
    Could you give me some quick starters for setting up a kernel cross build env on
    an amd64 machine, or maybe access to a Sun box I could use?

    It's actually pretty simple these days as the kernel.org mirrors provide binary distributions of freestanding toolchains for all major supported architectures of the Linux kernel.

    To set up on any x86_64 machine, do the following:

    # wget https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
    # tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
    # export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/
    # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
    # cd linux
    # export ARCH=sparc
    # export CROSS_COMPILE=sparc64-linux-
    # make sparc64_defconfig
    # make -j<number of parallel jobs>

    The cross-compiled kernel will be available as "vmlinux".

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Gregor Riepl on Wed Sep 4 07:10:01 2024
    Hi Gregor,

    On Wed, 2024-09-04 at 01:22 +0200, Gregor Riepl wrote:

    It's actually pretty simple these days as the kernel.org mirrors provide binary
    distributions of freestanding toolchains for all major supported architectures
    of the Linux kernel.

    To set up on any x86_64 machine, do the following:

    # wget https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
    # tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
    # export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/
    # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
    # cd linux
    # export ARCH=sparc
    # export CROSS_COMPILE=sparc64-linux-
    # make sparc64_defconfig
    # make -j<number of parallel jobs>

    The cross-compiled kernel will be available as "vmlinux".

    Very good, thanks!

    Thanks for looking into it!

    I'm especially interested in finding a proper reproducer which would make
    the bisecting process much easier. So far, the crashes seem to be rather
    random although they mainly occur with newer kernels.

    FWIW, I found a very handy patch yesterday which could help debugging these crashes once it's been merged into the upstream kernel [1]. What it does is that it dumps the back of the stack after a stack corruption has occurred
    which should in theory help find what part of the kernel is responsible for
    the stack corruption.

    It looks like this particular crash we have been seeing on the older SPARCs
    was always due to stack corruption which could mean that it's related to
    a driver or arch-specific code that is used on the older SPARCs but not on
    the newer machines.

    Adrian

    [1] https://lore.kernel.org/lkml/20231219032254.96685-1-feng.tang@intel.com/

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Gregor Riepl on Wed Sep 18 23:30:01 2024
    On Wed, 2024-09-18 at 18:17 +0200, Gregor Riepl wrote:
    My first attempt at bisecting ran into lots of compilation issues with the default config of each version and gcc 14.
    All the 4.x and 5.x kernels fail with the following errors (at least, some versions have more):

    arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name': arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
    646 | if (!strcmp(names + ep[ret].name_offset, name))
    | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source object 'mdesc' of size 16
    78 | struct mdesc_hdr mdesc;
    | ^~~~~
    ...
    In function 'kernel_lds_init',
    inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2: arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array bounds of 'char[]' [-Werror=array-bounds=]
    3102 | data_resource.end = compute_kern_paddr(_edata - 1);
    | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/asm-generic/sections.h: In function 'report_memory': ./include/asm-generic/sections.h:36:32: note: at offset -1 into object '_edata' of size [0, 9223372036854775807]
    36 | extern char _data[], _sdata[], _edata[];
    | ^~~~~~
    ...

    Yeah, a lot of warnings were actually fixed in the kernel which are handled as errors if CONFIG_WERROR is set.

    Next issue: The default kernel config lacks some essential drivers to make my system bootable. For my Fire V215,
    at least CONFIG_FUSIONMPT and CONFIG_CGROUPS are needed, plus a few other things. systemd requires cgroups v2
    support theses days.

    The default configs for 32-bit and 64-bit SPARC could probably see an update here.

    I started off with a default config in the first bisect step (corresponding with 5.14), added the required options,
    and then did a make oldconfig in each subsequent step, answering all questions with the default.

    "make localmodconfig" is probably easier in this case.

    Building with make bindeb-pkg produces an almost usable kernel package. For some reason, grub-ieee1275 requires an
    unpacked kernel, so the installed vmlinuz needed to be gunzipped afterwards.

    That's not an arbitrary reason, but simply a requirement for GRUB on SPARC due to size limitations. It's documented
    in the GRUB manual.

    Now for the actual testing... triggering a panic/oops reliably was difficult. The Debian 6.10 kernel usually crashes
    relatively quickly on disk I/O, and enabling swap accelerates the effect. bonnie++ should therefore make for a good
    stress test.

    I haven't found a good reproducer yet, either unfortunately.

    I don't have the exact commit IDs of each bisection step, but it was (roughly) 5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10.

    There were a few odd non-critical issues, such as this I/O error with 5.14 (but nothing in dmesg):

    $ /usr/sbin/bonnie++
    Writing a byte at a time...done
    Writing intelligently...done
    Rewriting...Can't write block.: Unknown error 2560
    Bonnie: drastic I/O error (re write(2)): Unknown error 2560

    Just use "git bisect skip" in this case to skip unreleated regressions.

    6.2 produces this warning at boot:

    [ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ +1.422401] rcu: 0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 fqs=44
    [ +0.093960] (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2)
    [ +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18
    [ +0.083641] TSTATE: 0000004411001605 TPC: 000000000042beac TNPC: 000000000042beb0 Y: 00000000 Not tainted
    [ +0.129479] TPC: <arch_cpu_idle+0x8c/0xa0>
    [ +0.053848] g0: 00000000004209d0 g1: 00000000015282c0 g2: 00000000015105c8 g3: 0000000000000001
    [ +0.114585] g4: fff0000000390ba0 g5: fff000027e2f0000 g6: fff0000000398000 g7: 00000000173aa294
    [ +0.114582] o0: fff0000000390ba0 o1: 0000000000000001 o2: 000000000130ae78 o3: 00000000015105c8
    [ +0.114580] o4: 00000000015280c0 o5: 000000000130b580 sp: fff000000039b3d1 ret_pc: 000000000042bea0
    [ +0.119164] RPC: <arch_cpu_idle+0x80/0xa0>
    [ +0.053850] l0: 0000000001407f20 l1: 0000000000022c05 l2: 0000000000000000 l3: 000000000130b538
    [ +0.114585] l4: 000000000130b400 l5: 0000000000000040 l6: 0000000000000000 l7: 0000000001408140
    [ +0.114581] i0: 00000000173aa299 i1: fff000027f814990 i2: 0000000000000001 i3: 0000000000000001
    [ +0.114580] i4: fff000027f814990 i5: 0000000001524990 i6: fff000000039b481 i7: 0000000000b22f68
    [ +0.114582] I7: <default_idle_call+0x48/0x100>
    [ +0.058433] Call Trace:
    [ +0.032082] [<0000000000b22f68>] default_idle_call+0x48/0x100
    [ +0.075624] [<00000000004adc28>] do_idle+0x108/0x180
    [ +0.065311] [<00000000004adf34>] cpu_startup_entry+0x14/0x40
    [ +0.074477] [<000000000043ede4>] smp_callin+0xe4/0x120
    [ +0.067603] [<0000000001318614>] 0x1318614
    [ +0.053853] [<0000000040000000>] 0x40000000

    FWIW, it could be an idea to run the RCU torture test as a check for bisecting.

    See: https://docs.kernel.org/RCU/torture.html

    From what I know, there are a number of hidden bugs in the RCU implementation on some architectures.

    It also failed to shut down properly:

    [ 1634.268777] systemd-journald[181]: Failed to send WATCHDOG=1 notification message: Connection refused
    [ 1754.268963] systemd-journald[181]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected

    The shutdown got stuck after that. I did not see this with any other kernels.

    From 6.2 onward, The tg3 network driver produces this warning at shutdown (but it proceeds from there without issue):

    [ 1594.751376] ------------[ cut here ]------------
    [ 1594.812280] WARNING: CPU: 0 PID: 3914 at kernel/irq/msi.c:196 msi_domain_free_descs+0xdc/0x100
    [ 1594.925813] Modules linked in: binfmt_misc flash sg fuse autofs4 dm_mod mptsas sr_mod scsi_transport_sas mptscsih ehci_pci mptbase tg3 cdrom ehci_hcd libphy
    [ 1595.110450] CPU: 0 PID: 3914 Comm: ip Not tainted 6.2.0-rc7+ #18
    [ 1595.189586] Call Trace:
    [ 1595.221667] [<0000000000465da8>] __warn+0xe8/0x120
    [ 1595.284686] [<0000000000b11088>] warn_slowpath_fmt+0x30/0x70
    [ 1595.359165] [<00000000004cdbfc>] msi_domain_free_descs+0xdc/0x100
    [ 1595.439371] [<00000000004ce878>] msi_domain_free_msi_descs_range+0x18/0x40 [ 1595.529891] [<0000000000819984>] pci_free_msi_irqs+0x4/0x20
    [ 1595.603222] [<0000000000817e94>] pci_disable_msi+0x54/0x80
    [ 1595.675408] [<00000000100b0464>] tg3_ints_fini+0x64/0xe0 [tg3]
    [ 1595.752282] [<00000000100c880c>] tg3_stop+0x22c/0x2c0 [tg3]
    [ 1595.825614] [<00000000100c88c0>] tg3_close+0x20/0xa0 [tg3]
    [ 1595.897799] [<000000000096c8e8>] __dev_close_many+0x88/0x100
    [ 1595.972278] [<0000000000976c64>] __dev_change_flags+0xa4/0x1e0
    [ 1596.049047] [<0000000000976db8>] dev_change_flags+0x18/0x60
    [ 1596.122378] [<00000000009872a0>] do_setlink+0x2e0/0x1140
    [ 1596.192273] [<000000000098d138>] __rtnl_newlink+0x3f8/0x7e0
    [ 1596.265605] [<000000000098d550>] rtnl_newlink+0x30/0x60
    [ 1596.334353] [<0000000000986a7c>] rtnetlink_rcv_msg+0x27c/0x360
    [ 1596.411144] ---[ end trace 0000000000000000 ]---

    On 6.6, I got this warning at boot:

    [ +21.089612] rcu: INFO: rcu_sched self-detected stall on CPU
    [ +0.000007] rcu: 1-....: (281 ticks this GP) idle=36cc/1/0x4000000000000002 softirq=28/28 fqs=1050
    [ +0.000012] rcu: (t=2101 jiffies g=-1175 q=1029 ncpus=2)
    [ +0.000007] CPU: 1 PID: 1 Comm: swapper/1 Not tainted 6.6.0-rc7+ #19
    [ +0.000008] TSTATE: 0000004411001602 TPC: 00000000004c23f0 TNPC: 00000000004c23f4 Y: 00001f91 Not tainted
    [ +0.000005] TPC: <console_flush_all+0x1d0/0x4a0>
    [ +0.000018] g0: 00000000004c23f0 g1: 000000000154bca0 g2: 0000000000000000 g3: 00000000016e1400
    [ +0.000004] g4: fff0001004510000 g5: fff000103d2b6000 g6: fff0001004658000 g7: 000000000000000e
    [ +0.000004] o0: 00000000016e17f8 o1: 0000000000000000 o2: 0000000000000000 o3: 000000000000004d
    [ +0.000004] o4: 00000000016e0bd8 o5: 0000000001753250 sp: fff000100465a9c1 ret_pc: 00000000004c23e4
    [ +0.000004] RPC: <console_flush_all+0x1c4/0x4a0>
    [ +0.000007] l0: 000000000133b078 l1: 0000000000000000 l2: 0000000000000000 l3: 0000000000000000
    [ +0.000004] l4: 0000000001435400 l5: 0000000000000000 l6: 00000000016e0bd8 l7: 00000000014b0840
    [ +0.000004] i0: 0000000000000000 i1: fff000100465b368 i2: fff000100465b367 i3: 00000000016e1400
    [ +0.000004] i4: 00000000016e0bd8 i5: 00000000016e17f8 i6: fff000100465aab1 i7: 00000000004c2730
    [ +0.000004] I7: <console_unlock+0x70/0xe0>
    [ +0.000008] Call Trace:
    [ +0.000003] [<00000000004c2730>] console_unlock+0x70/0xe0
    [ +0.000007] [<00000000004c3c8c>] vprintk_emit+0x1cc/0x220
    [ +0.000009] [<0000000000b32aa4>] _printk+0x24/0x34
    [ +0.000014] [<00000000008851e8>] serial_core_register_port+0x468/0x6c0
    [ +0.000007] [<0000000000888998>] su_probe+0x178/0x3c0
    [ +0.000009] [<0000000000898fe8>] platform_probe+0x28/0x80
    [ +0.000006] [<0000000000896bf8>] really_probe+0xb8/0x2e0
    [ +0.000011] [<0000000000896f04>] driver_probe_device+0x24/0xe0
    [ +0.000007] [<0000000000897104>] __driver_attach+0x64/0x120
    [ +0.000007] [<0000000000894c10>] bus_for_each_dev+0x50/0xa0
    [ +0.000007] [<0000000000895d3c>] bus_add_driver+0x17c/0x1e0
    [ +0.000006] [<00000000008979d4>] driver_register+0x74/0x120
    [ +0.000008] [<000000000151ab90>] sunsu_init+0x170/0x1d4
    [ +0.000009] [<0000000000427bf4>] do_one_initcall+0x34/0x220
    [ +0.000008] [<00000000014f8fb4>] kernel_init_freeable+0x210/0x274
    [ +0.000012] [<0000000000b3c1bc>] kernel_init+0x18/0x13c

    On 6.6, I also found these messages in the kernel log (but apparently no negative consequences):

    [ +0.371437] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091825] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091734] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091763] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091757] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.252176] log_unaligned: 4200 callbacks suppressed
    [ +0.055120] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
    [ +0.000023] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
    [ +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
    [ +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20


    Conclusion:

    It looks very much like it isn't specifically a kernel bug at all, but either something
    wrong with the Debian kernel config, or with newer gcc versions.

    I still think it's a kernel bug.

    I will test some other gcc versions next.

    Unfortunately, I couldn't test the config from the Debian linux-image-6.10.7-sparc64-smp package.
    Trying to build a kernel with this config produced a 700MB package, and the resulting initrd was
    too large to fit into my boot partition. Is there something special about how Debian builds kernel packages?

    It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds the kernel with debug
    symbols enabled and then runs the strip command afterwards. This way both a debug and a standard
    kernel package can be provided from the same build.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Gregor Riepl on Mon Sep 23 08:20:01 2024
    Hi Gregor,

    On Mon, 2024-09-23 at 00:20 +0200, Gregor Riepl wrote:

    It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds the kernel with debug
    symbols enabled and then runs the strip command afterwards. This way both a debug and a standard
    kernel package can be provided from the same build.

    Ah thanks, that did the trick.

    I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with module signing turned off.
    This kernel crashed instantly at boot, just after checking the rootfs. The fsck output was intermingled with the kernel log, but it did complete with a "done."

    Begin: Will now check root file system ... fsck from util-linux 2.38.1
    [ 68.420534] \|/ ____ \|/
    [ 68.420534] "@'/ .. \`@"
    [ 68.420534] /_| \__/ |_\
    [ 68.420534] \__U_/
    [ 68.630552] mount(192): Kernel illegal instruction [#1]
    [ 68.715911] CPU: 0 PID: 192 Comm: mount Tainted: G E 6.10.0 #28
    [ 68.828841] TSTATE: 0000000011001605 TPC: 0000000010320158 TNPC: 000000001032015c Y: 00000000 Tainted: G E
    [ 68.994452] TPC: <ext4_find_extent+0x3f8/0x580 [ext4]>
    [ 69.078729] g0: 0000000000000001 g1: 0000000000010000 g2: 0000000000000000 g3: 0000000000000000
    [ 69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff0000000598000 g7: 0000000000000000
    [ 69.341210] o0: 0000000000000200 o1: fff00000029cc3e8 o2: 0000000000000001 o3: 00000000103cc1b0
    [ 69.472457] o4: 0000000000001678 o5: 000000000000b000 sp: fff000000059a8d1 ret_pc: 0000000000ea309c
    [ 69.608287] RPC: <__cond_resched+0x1c/0x60>
    [ 69.679947] l0: fff0000000d06416 l1: fff00000029cc128 l2: 0000000100010000 l3: 0000000000ffffff
    [ 69.811203] l4: 0000000000000000 l5: 0000000000000005 l6: 0000000000010001 l7: 0000000000000002
    [ 69.942448] i0: 0000000000010001 i1: 0000000000000000 i2: 0000000000000000 i3: 0000000000000000
    [ 70.070570] i4: 0000000000000000 i5: 0000000000000001 i6: fff000000059a9a1 i7: 0000000010325748
    [ 70.185151] I7: <ext4_ext_map_blocks+0x68/0x2060 [ext4]>
    [ 70.255147] Call Trace:
    [ 70.287221] [<0000000010325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4]
    [ 70.374417] [<000000001033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4]
    [ 70.455872] [<000000001033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4]
    [ 70.539621] [<00000000007f494c>] iomap_iter+0x14c/0x420
    [ 70.608470] [<00000000007fa5f0>] iomap_bmap+0x70/0xe0
    [ 70.674929] [<000000001033bd3c>] ext4_bmap+0x9c/0xe0 [ext4]
    [ 70.748366] [<0000000000789404>] bmap+0x24/0x40
    [ 70.808051] [<00000000102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2] [ 70.898675] [<0000000010385c2c>] ext4_load_and_init_journal+0xec/0xd20 [ext4]
    [ 70.992740] [<000000001038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4]
    [ 71.077631] [<000000000076ae5c>] get_tree_bdev+0xfc/0x1c0
    [ 71.148774] [<0000000010377c34>] ext4_get_tree+0x14/0x40 [ext4]
    [ 71.226799] [<000000000076b4c0>] vfs_get_tree+0x20/0x120
    [ 71.296792] [<0000000000796f0c>] path_mount+0x40c/0xa60
    [ 71.365539] [<0000000000797a94>] sys_mount+0xf4/0x1c0
    [ 71.431997] [<0000000000406274>] linux_sparc_syscall+0x34/0x44

    I'll try a bisect with that config now, perhaps I can find something.

    Great! Now you finally have something to work with. Crossing my fingers that you'll find something.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ignacio Soriano Hernandez@21:1/5 to Gregor Riepl on Mon Sep 23 10:00:01 2024
    Good to hear Gregor.

    We are at a very stable working configuration with Radeon support on T2
    SDE.


    I think combining the efforts will be beneficial for the whole
    Linux/SPARC64 community.

    Cheers

    Iggi

    Gregor Riepl <onitake@gmail.com> schrieb am Mi. 18. Sept. 2024 um 20:17:

    Small update:

    I managed to build a 6.10 kernel with gcc 14 now.
    I'll do some more stress testing with it, but looks very stable so far.

    My kernel config is attached.
    It's much smaller than the Debian kernel config config-6.10.7-sparc64-smp, and now I'm quite convinced that it's really some options that cause the stability issues.
    Finding them is big challenge, though...

    <div dir="auto">Good to hear Gregor. </div><div dir="auto"><br></div><div dir="auto">We are at  a very stable working configuration with Radeon support on T2 SDE. </div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto">I think
    combining the efforts will be beneficial for the whole Linux/SPARC64 community. </div><div dir="auto"><br></div><div dir="auto">Cheers</div><div dir="auto"><br></div><div dir="auto">Iggi</div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_
    attr">Gregor Riepl &lt;<a href="mailto:onitake@gmail.com">onitake@gmail.com</a>&gt; schrieb am Mi. 18. Sept. 2024 um 20:17:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-
    left:1ex;border-left-color:rgb(204,204,204)">Small update:<br>

    I managed to build a 6.10 kernel with gcc 14 now.<br>
    I&#39;ll do some more stress testing with it, but looks very stable so far.<br>

    My kernel config is attached.<br>
    It&#39;s much smaller than the Debian kernel config config-6.10.7-sparc64-smp, and now I&#39;m quite convinced that it&#39;s really some options that cause the stability issues.<br>
    Finding them is big challenge, though...</blockquote></div></div>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Ignacio Soriano Hernandez on Mon Sep 23 11:10:02 2024
    On Mon, 2024-09-23 at 09:56 +0200, Ignacio Soriano Hernandez wrote:
    We are at  a very stable working configuration with Radeon support on T2 SDE. 

    From what I have heard, Rene did not apply any local SPARC-specific patches to the kernel, so the fact that your machine runs stable with T2 SDE is more likely
    a result of disabled kernel features.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Gregor Riepl on Fri Nov 1 13:50:01 2024
    Hi Gregor,

    On Mon, 2024-09-23 at 00:20 +0200, Gregor Riepl wrote:

    It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds the kernel with debug
    symbols enabled and then runs the strip command afterwards. This way both a debug and a standard
    kernel package can be provided from the same build.

    Ah thanks, that did the trick.

    I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with module signing turned off.
    This kernel crashed instantly at boot, just after checking the rootfs. The fsck output was intermingled with the kernel log, but it did complete with a "done."

    Begin: Will now check root file system ... fsck from util-linux 2.38.1
    [ 68.420534] \|/ ____ \|/
    [ 68.420534] "@'/ .. \`@"
    [ 68.420534] /_| \__/ |_\
    [ 68.420534] \__U_/
    [ 68.630552] mount(192): Kernel illegal instruction [#1]
    [ 68.715911] CPU: 0 PID: 192 Comm: mount Tainted: G E 6.10.0 #28
    [ 68.828841] TSTATE: 0000000011001605 TPC: 0000000010320158 TNPC: 000000001032015c Y: 00000000 Tainted: G E
    [ 68.994452] TPC: <ext4_find_extent+0x3f8/0x580 [ext4]>
    [ 69.078729] g0: 0000000000000001 g1: 0000000000010000 g2: 0000000000000000 g3: 0000000000000000
    [ 69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff0000000598000 g7: 0000000000000000
    [ 69.341210] o0: 0000000000000200 o1: fff00000029cc3e8 o2: 0000000000000001 o3: 00000000103cc1b0
    [ 69.472457] o4: 0000000000001678 o5: 000000000000b000 sp: fff000000059a8d1 ret_pc: 0000000000ea309c
    [ 69.608287] RPC: <__cond_resched+0x1c/0x60>
    [ 69.679947] l0: fff0000000d06416 l1: fff00000029cc128 l2: 0000000100010000 l3: 0000000000ffffff
    [ 69.811203] l4: 0000000000000000 l5: 0000000000000005 l6: 0000000000010001 l7: 0000000000000002
    [ 69.942448] i0: 0000000000010001 i1: 0000000000000000 i2: 0000000000000000 i3: 0000000000000000
    [ 70.070570] i4: 0000000000000000 i5: 0000000000000001 i6: fff000000059a9a1 i7: 0000000010325748
    [ 70.185151] I7: <ext4_ext_map_blocks+0x68/0x2060 [ext4]>
    [ 70.255147] Call Trace:
    [ 70.287221] [<0000000010325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4]
    [ 70.374417] [<000000001033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4]
    [ 70.455872] [<000000001033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4]
    [ 70.539621] [<00000000007f494c>] iomap_iter+0x14c/0x420
    [ 70.608470] [<00000000007fa5f0>] iomap_bmap+0x70/0xe0
    [ 70.674929] [<000000001033bd3c>] ext4_bmap+0x9c/0xe0 [ext4]
    [ 70.748366] [<0000000000789404>] bmap+0x24/0x40
    [ 70.808051] [<00000000102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2] [ 70.898675] [<0000000010385c2c>] ext4_load_and_init_journal+0xec/0xd20 [ext4]
    [ 70.992740] [<000000001038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4]
    [ 71.077631] [<000000000076ae5c>] get_tree_bdev+0xfc/0x1c0
    [ 71.148774] [<0000000010377c34>] ext4_get_tree+0x14/0x40 [ext4]
    [ 71.226799] [<000000000076b4c0>] vfs_get_tree+0x20/0x120
    [ 71.296792] [<0000000000796f0c>] path_mount+0x40c/0xa60
    [ 71.365539] [<0000000000797a94>] sys_mount+0xf4/0x1c0
    [ 71.431997] [<0000000000406274>] linux_sparc_syscall+0x34/0x44

    I'll try a bisect with that config now, perhaps I can find something.

    Could you try whether reverting the following commit fixes it?

    commit 223b5e57d0d50b0c07b933350dbcde92018d3080
    Author: Mike Rapoport (IBM) <rppt@kernel.org>
    Date: Sun May 5 19:06:20 2024 +0300

    mm/execmem, arch: convert remaining overrides of module_alloc to execmem

    Alternatively, please try the following change:

    diff --git a/mm/execmem.c b/mm/execmem.c
    index 0c4b36bc6d10..8232f9767c8c 100644
    --- a/mm/execmem.c
    +++ b/mm/execmem.c
    @@ -17,7 +17,11 @@ static struct execmem_info default_execmem_info __ro_after_init;
    static void *__execmem_alloc(struct execmem_range *range, size_t size)
    {
    bool kasan = range->flags & EXECMEM_KASAN_SHADOW;
    +#ifndef __sparc__
    unsigned long vm_flags = VM_FLUSH_RESET_PERMS;
    +#else
    + unsigned long vm_flags = 0;
    +#endif
    gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN;
    unsigned long start = range->start;
    unsigned long end = range->end;

    Thanks,
    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)