[1] https://x.com/Iggi76123640/status/1828396228444389600
[2] https://x.com/Iggi76123640/status/1828673611080589641
according to these posts [1][2] by Iggi, you figured out the stability problem
No, we are just sometimes lucky it run that long stable. I was only made aware
recently that sun4u was not 100% and my fasted UltraSPARC until some year ago was only a 360MHz Ultra5 until I was donated a Sun Blade 1000 recently. I see some MM corruption that I wanted to hunt next.
https://x.com/Iggi76123640/status/1827658841581896152
with newer kernels on older SPARC machines. There has been a regression on older
SPARCs since around kernel 4.19.x which I haven't gotten around to bisecting yet.
Happy to bi-sect. I guess you mean random memory corruption I see or anything else?
If you have issues to bi-sect just let us know for any arch. Given T2’s cross-compile
support and I have most hardware in my museum now, I can usually bisect issues
within a day or two.
If you've found and fixed the bug in question, it would be great if you could share
your fix with the community and maybe whip up a kernel patch to fix the bug upstream.
Of course - all patches are always nicely sorted in our public and nicely readable
SVN tree in any case.
https://t2linux.com
Newer SPARCs are not affected by this bug, although there are other issues.
You mean sun4v? I found a cheap T4-1 some month ago, and T2/Linux appears
to run stable on that. Any list of issues w/ sun4v I should be aware of?
If you have issues to bi-sect just let us know for any arch. Given T2’s cross-compile
support and I have most hardware in my museum now, I can usually bisect issues
within a day or two.
I don't have issues with bisecting, I'm just rather time-constrained at the moment, so
I'm always happy when someone else can step in and help. Would be great to get this issue
fixed upstream.
My Ultra 10 and Fire V215 are desperately waiting for a more stable kernel.
I actually wanted to offer help with bisecting, but kept back due to a lack of time and also suitable build system (compiling kernels is so time-consuming).
I may have some time to do test runs next week.
Could you give me some quick starters for setting up a kernel cross build env on
an amd64 machine, or maybe access to a Sun box I could use?
It's actually pretty simple these days as the kernel.org mirrors provide binary
distributions of freestanding toolchains for all major supported architectures
of the Linux kernel.
To set up on any x86_64 machine, do the following:
# wget https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
# tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
# export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
# cd linux
# export ARCH=sparc
# export CROSS_COMPILE=sparc64-linux-
# make sparc64_defconfig
# make -j<number of parallel jobs>
The cross-compiled kernel will be available as "vmlinux".
Very good, thanks!
[1] https://lore.kernel.org/lkml/20231219032254.96685-1-feng.tang@intel.com/
My first attempt at bisecting ran into lots of compilation issues with the default config of each version and gcc 14.
All the 4.x and 5.x kernels fail with the following errors (at least, some versions have more):
arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name': arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
646 | if (!strcmp(names + ep[ret].name_offset, name))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
...
In function 'kernel_lds_init',
inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2: arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array bounds of 'char[]' [-Werror=array-bounds=]
3102 | data_resource.end = compute_kern_paddr(_edata - 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/asm-generic/sections.h: In function 'report_memory': ./include/asm-generic/sections.h:36:32: note: at offset -1 into object '_edata' of size [0, 9223372036854775807]
36 | extern char _data[], _sdata[], _edata[];
| ^~~~~~
...
Next issue: The default kernel config lacks some essential drivers to make my system bootable. For my Fire V215,
at least CONFIG_FUSIONMPT and CONFIG_CGROUPS are needed, plus a few other things. systemd requires cgroups v2
support theses days.
I started off with a default config in the first bisect step (corresponding with 5.14), added the required options,
and then did a make oldconfig in each subsequent step, answering all questions with the default.
Building with make bindeb-pkg produces an almost usable kernel package. For some reason, grub-ieee1275 requires an
unpacked kernel, so the installed vmlinuz needed to be gunzipped afterwards.
Now for the actual testing... triggering a panic/oops reliably was difficult. The Debian 6.10 kernel usually crashes
relatively quickly on disk I/O, and enabling swap accelerates the effect. bonnie++ should therefore make for a good
stress test.
I don't have the exact commit IDs of each bisection step, but it was (roughly) 5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10.
There were a few odd non-critical issues, such as this I/O error with 5.14 (but nothing in dmesg):
$ /usr/sbin/bonnie++
Writing a byte at a time...done
Writing intelligently...done
Rewriting...Can't write block.: Unknown error 2560
Bonnie: drastic I/O error (re write(2)): Unknown error 2560
6.2 produces this warning at boot:
[ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ +1.422401] rcu: 0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 fqs=44
[ +0.093960] (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2)
[ +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18
[ +0.083641] TSTATE: 0000004411001605 TPC: 000000000042beac TNPC: 000000000042beb0 Y: 00000000 Not tainted
[ +0.129479] TPC: <arch_cpu_idle+0x8c/0xa0>
[ +0.053848] g0: 00000000004209d0 g1: 00000000015282c0 g2: 00000000015105c8 g3: 0000000000000001
[ +0.114585] g4: fff0000000390ba0 g5: fff000027e2f0000 g6: fff0000000398000 g7: 00000000173aa294
[ +0.114582] o0: fff0000000390ba0 o1: 0000000000000001 o2: 000000000130ae78 o3: 00000000015105c8
[ +0.114580] o4: 00000000015280c0 o5: 000000000130b580 sp: fff000000039b3d1 ret_pc: 000000000042bea0
[ +0.119164] RPC: <arch_cpu_idle+0x80/0xa0>
[ +0.053850] l0: 0000000001407f20 l1: 0000000000022c05 l2: 0000000000000000 l3: 000000000130b538
[ +0.114585] l4: 000000000130b400 l5: 0000000000000040 l6: 0000000000000000 l7: 0000000001408140
[ +0.114581] i0: 00000000173aa299 i1: fff000027f814990 i2: 0000000000000001 i3: 0000000000000001
[ +0.114580] i4: fff000027f814990 i5: 0000000001524990 i6: fff000000039b481 i7: 0000000000b22f68
[ +0.114582] I7: <default_idle_call+0x48/0x100>
[ +0.058433] Call Trace:
[ +0.032082] [<0000000000b22f68>] default_idle_call+0x48/0x100
[ +0.075624] [<00000000004adc28>] do_idle+0x108/0x180
[ +0.065311] [<00000000004adf34>] cpu_startup_entry+0x14/0x40
[ +0.074477] [<000000000043ede4>] smp_callin+0xe4/0x120
[ +0.067603] [<0000000001318614>] 0x1318614
[ +0.053853] [<0000000040000000>] 0x40000000
From what I know, there are a number of hidden bugs in the RCU implementation on some architectures.
It also failed to shut down properly:
[ 1634.268777] systemd-journald[181]: Failed to send WATCHDOG=1 notification message: Connection refused
[ 1754.268963] systemd-journald[181]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
The shutdown got stuck after that. I did not see this with any other kernels.
From 6.2 onward, The tg3 network driver produces this warning at shutdown (but it proceeds from there without issue):
[ 1594.751376] ------------[ cut here ]------------
[ 1594.812280] WARNING: CPU: 0 PID: 3914 at kernel/irq/msi.c:196 msi_domain_free_descs+0xdc/0x100
[ 1594.925813] Modules linked in: binfmt_misc flash sg fuse autofs4 dm_mod mptsas sr_mod scsi_transport_sas mptscsih ehci_pci mptbase tg3 cdrom ehci_hcd libphy
[ 1595.110450] CPU: 0 PID: 3914 Comm: ip Not tainted 6.2.0-rc7+ #18
[ 1595.189586] Call Trace:
[ 1595.221667] [<0000000000465da8>] __warn+0xe8/0x120
[ 1595.284686] [<0000000000b11088>] warn_slowpath_fmt+0x30/0x70
[ 1595.359165] [<00000000004cdbfc>] msi_domain_free_descs+0xdc/0x100
[ 1595.439371] [<00000000004ce878>] msi_domain_free_msi_descs_range+0x18/0x40 [ 1595.529891] [<0000000000819984>] pci_free_msi_irqs+0x4/0x20
[ 1595.603222] [<0000000000817e94>] pci_disable_msi+0x54/0x80
[ 1595.675408] [<00000000100b0464>] tg3_ints_fini+0x64/0xe0 [tg3]
[ 1595.752282] [<00000000100c880c>] tg3_stop+0x22c/0x2c0 [tg3]
[ 1595.825614] [<00000000100c88c0>] tg3_close+0x20/0xa0 [tg3]
[ 1595.897799] [<000000000096c8e8>] __dev_close_many+0x88/0x100
[ 1595.972278] [<0000000000976c64>] __dev_change_flags+0xa4/0x1e0
[ 1596.049047] [<0000000000976db8>] dev_change_flags+0x18/0x60
[ 1596.122378] [<00000000009872a0>] do_setlink+0x2e0/0x1140
[ 1596.192273] [<000000000098d138>] __rtnl_newlink+0x3f8/0x7e0
[ 1596.265605] [<000000000098d550>] rtnl_newlink+0x30/0x60
[ 1596.334353] [<0000000000986a7c>] rtnetlink_rcv_msg+0x27c/0x360
[ 1596.411144] ---[ end trace 0000000000000000 ]---
On 6.6, I got this warning at boot:
[ +21.089612] rcu: INFO: rcu_sched self-detected stall on CPU
[ +0.000007] rcu: 1-....: (281 ticks this GP) idle=36cc/1/0x4000000000000002 softirq=28/28 fqs=1050
[ +0.000012] rcu: (t=2101 jiffies g=-1175 q=1029 ncpus=2)
[ +0.000007] CPU: 1 PID: 1 Comm: swapper/1 Not tainted 6.6.0-rc7+ #19
[ +0.000008] TSTATE: 0000004411001602 TPC: 00000000004c23f0 TNPC: 00000000004c23f4 Y: 00001f91 Not tainted
[ +0.000005] TPC: <console_flush_all+0x1d0/0x4a0>
[ +0.000018] g0: 00000000004c23f0 g1: 000000000154bca0 g2: 0000000000000000 g3: 00000000016e1400
[ +0.000004] g4: fff0001004510000 g5: fff000103d2b6000 g6: fff0001004658000 g7: 000000000000000e
[ +0.000004] o0: 00000000016e17f8 o1: 0000000000000000 o2: 0000000000000000 o3: 000000000000004d
[ +0.000004] o4: 00000000016e0bd8 o5: 0000000001753250 sp: fff000100465a9c1 ret_pc: 00000000004c23e4
[ +0.000004] RPC: <console_flush_all+0x1c4/0x4a0>
[ +0.000007] l0: 000000000133b078 l1: 0000000000000000 l2: 0000000000000000 l3: 0000000000000000
[ +0.000004] l4: 0000000001435400 l5: 0000000000000000 l6: 00000000016e0bd8 l7: 00000000014b0840
[ +0.000004] i0: 0000000000000000 i1: fff000100465b368 i2: fff000100465b367 i3: 00000000016e1400
[ +0.000004] i4: 00000000016e0bd8 i5: 00000000016e17f8 i6: fff000100465aab1 i7: 00000000004c2730
[ +0.000004] I7: <console_unlock+0x70/0xe0>
[ +0.000008] Call Trace:
[ +0.000003] [<00000000004c2730>] console_unlock+0x70/0xe0
[ +0.000007] [<00000000004c3c8c>] vprintk_emit+0x1cc/0x220
[ +0.000009] [<0000000000b32aa4>] _printk+0x24/0x34
[ +0.000014] [<00000000008851e8>] serial_core_register_port+0x468/0x6c0
[ +0.000007] [<0000000000888998>] su_probe+0x178/0x3c0
[ +0.000009] [<0000000000898fe8>] platform_probe+0x28/0x80
[ +0.000006] [<0000000000896bf8>] really_probe+0xb8/0x2e0
[ +0.000011] [<0000000000896f04>] driver_probe_device+0x24/0xe0
[ +0.000007] [<0000000000897104>] __driver_attach+0x64/0x120
[ +0.000007] [<0000000000894c10>] bus_for_each_dev+0x50/0xa0
[ +0.000007] [<0000000000895d3c>] bus_add_driver+0x17c/0x1e0
[ +0.000006] [<00000000008979d4>] driver_register+0x74/0x120
[ +0.000008] [<000000000151ab90>] sunsu_init+0x170/0x1d4
[ +0.000009] [<0000000000427bf4>] do_one_initcall+0x34/0x220
[ +0.000008] [<00000000014f8fb4>] kernel_init_freeable+0x210/0x274
[ +0.000012] [<0000000000b3c1bc>] kernel_init+0x18/0x13c
On 6.6, I also found these messages in the kernel log (but apparently no negative consequences):
[ +0.371437] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091825] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091734] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091763] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091757] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.252176] log_unaligned: 4200 callbacks suppressed
[ +0.055120] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
[ +0.000023] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
[ +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
[ +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
Conclusion:
It looks very much like it isn't specifically a kernel bug at all, but either something
wrong with the Debian kernel config, or with newer gcc versions.
I will test some other gcc versions next.
Unfortunately, I couldn't test the config from the Debian linux-image-6.10.7-sparc64-smp package.
Trying to build a kernel with this config produced a 700MB package, and the resulting initrd was
too large to fit into my boot partition. Is there something special about how Debian builds kernel packages?
It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds the kernel with debug
symbols enabled and then runs the strip command afterwards. This way both a debug and a standard
kernel package can be provided from the same build.
Ah thanks, that did the trick.
I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with module signing turned off.
This kernel crashed instantly at boot, just after checking the rootfs. The fsck output was intermingled with the kernel log, but it did complete with a "done."
Begin: Will now check root file system ... fsck from util-linux 2.38.1
[ 68.420534] \|/ ____ \|/
[ 68.420534] "@'/ .. \`@"
[ 68.420534] /_| \__/ |_\
[ 68.420534] \__U_/
[ 68.630552] mount(192): Kernel illegal instruction [#1]
[ 68.715911] CPU: 0 PID: 192 Comm: mount Tainted: G E 6.10.0 #28
[ 68.828841] TSTATE: 0000000011001605 TPC: 0000000010320158 TNPC: 000000001032015c Y: 00000000 Tainted: G E
[ 68.994452] TPC: <ext4_find_extent+0x3f8/0x580 [ext4]>
[ 69.078729] g0: 0000000000000001 g1: 0000000000010000 g2: 0000000000000000 g3: 0000000000000000
[ 69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff0000000598000 g7: 0000000000000000
[ 69.341210] o0: 0000000000000200 o1: fff00000029cc3e8 o2: 0000000000000001 o3: 00000000103cc1b0
[ 69.472457] o4: 0000000000001678 o5: 000000000000b000 sp: fff000000059a8d1 ret_pc: 0000000000ea309c
[ 69.608287] RPC: <__cond_resched+0x1c/0x60>
[ 69.679947] l0: fff0000000d06416 l1: fff00000029cc128 l2: 0000000100010000 l3: 0000000000ffffff
[ 69.811203] l4: 0000000000000000 l5: 0000000000000005 l6: 0000000000010001 l7: 0000000000000002
[ 69.942448] i0: 0000000000010001 i1: 0000000000000000 i2: 0000000000000000 i3: 0000000000000000
[ 70.070570] i4: 0000000000000000 i5: 0000000000000001 i6: fff000000059a9a1 i7: 0000000010325748
[ 70.185151] I7: <ext4_ext_map_blocks+0x68/0x2060 [ext4]>
[ 70.255147] Call Trace:
[ 70.287221] [<0000000010325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4]
[ 70.374417] [<000000001033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4]
[ 70.455872] [<000000001033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4]
[ 70.539621] [<00000000007f494c>] iomap_iter+0x14c/0x420
[ 70.608470] [<00000000007fa5f0>] iomap_bmap+0x70/0xe0
[ 70.674929] [<000000001033bd3c>] ext4_bmap+0x9c/0xe0 [ext4]
[ 70.748366] [<0000000000789404>] bmap+0x24/0x40
[ 70.808051] [<00000000102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2] [ 70.898675] [<0000000010385c2c>] ext4_load_and_init_journal+0xec/0xd20 [ext4]
[ 70.992740] [<000000001038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4]
[ 71.077631] [<000000000076ae5c>] get_tree_bdev+0xfc/0x1c0
[ 71.148774] [<0000000010377c34>] ext4_get_tree+0x14/0x40 [ext4]
[ 71.226799] [<000000000076b4c0>] vfs_get_tree+0x20/0x120
[ 71.296792] [<0000000000796f0c>] path_mount+0x40c/0xa60
[ 71.365539] [<0000000000797a94>] sys_mount+0xf4/0x1c0
[ 71.431997] [<0000000000406274>] linux_sparc_syscall+0x34/0x44
I'll try a bisect with that config now, perhaps I can find something.
Small update:
I managed to build a 6.10 kernel with gcc 14 now.
I'll do some more stress testing with it, but looks very stable so far.
My kernel config is attached.
It's much smaller than the Debian kernel config config-6.10.7-sparc64-smp, and now I'm quite convinced that it's really some options that cause the stability issues.
Finding them is big challenge, though...
We are at a very stable working configuration with Radeon support on T2 SDE.a result of disabled kernel features.
From what I have heard, Rene did not apply any local SPARC-specific patches to the kernel, so the fact that your machine runs stable with T2 SDE is more likely
It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds the kernel with debug
symbols enabled and then runs the strip command afterwards. This way both a debug and a standard
kernel package can be provided from the same build.
Ah thanks, that did the trick.
I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with module signing turned off.
This kernel crashed instantly at boot, just after checking the rootfs. The fsck output was intermingled with the kernel log, but it did complete with a "done."
Begin: Will now check root file system ... fsck from util-linux 2.38.1
[ 68.420534] \|/ ____ \|/
[ 68.420534] "@'/ .. \`@"
[ 68.420534] /_| \__/ |_\
[ 68.420534] \__U_/
[ 68.630552] mount(192): Kernel illegal instruction [#1]
[ 68.715911] CPU: 0 PID: 192 Comm: mount Tainted: G E 6.10.0 #28
[ 68.828841] TSTATE: 0000000011001605 TPC: 0000000010320158 TNPC: 000000001032015c Y: 00000000 Tainted: G E
[ 68.994452] TPC: <ext4_find_extent+0x3f8/0x580 [ext4]>
[ 69.078729] g0: 0000000000000001 g1: 0000000000010000 g2: 0000000000000000 g3: 0000000000000000
[ 69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff0000000598000 g7: 0000000000000000
[ 69.341210] o0: 0000000000000200 o1: fff00000029cc3e8 o2: 0000000000000001 o3: 00000000103cc1b0
[ 69.472457] o4: 0000000000001678 o5: 000000000000b000 sp: fff000000059a8d1 ret_pc: 0000000000ea309c
[ 69.608287] RPC: <__cond_resched+0x1c/0x60>
[ 69.679947] l0: fff0000000d06416 l1: fff00000029cc128 l2: 0000000100010000 l3: 0000000000ffffff
[ 69.811203] l4: 0000000000000000 l5: 0000000000000005 l6: 0000000000010001 l7: 0000000000000002
[ 69.942448] i0: 0000000000010001 i1: 0000000000000000 i2: 0000000000000000 i3: 0000000000000000
[ 70.070570] i4: 0000000000000000 i5: 0000000000000001 i6: fff000000059a9a1 i7: 0000000010325748
[ 70.185151] I7: <ext4_ext_map_blocks+0x68/0x2060 [ext4]>
[ 70.255147] Call Trace:
[ 70.287221] [<0000000010325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4]
[ 70.374417] [<000000001033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4]
[ 70.455872] [<000000001033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4]
[ 70.539621] [<00000000007f494c>] iomap_iter+0x14c/0x420
[ 70.608470] [<00000000007fa5f0>] iomap_bmap+0x70/0xe0
[ 70.674929] [<000000001033bd3c>] ext4_bmap+0x9c/0xe0 [ext4]
[ 70.748366] [<0000000000789404>] bmap+0x24/0x40
[ 70.808051] [<00000000102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2] [ 70.898675] [<0000000010385c2c>] ext4_load_and_init_journal+0xec/0xd20 [ext4]
[ 70.992740] [<000000001038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4]
[ 71.077631] [<000000000076ae5c>] get_tree_bdev+0xfc/0x1c0
[ 71.148774] [<0000000010377c34>] ext4_get_tree+0x14/0x40 [ext4]
[ 71.226799] [<000000000076b4c0>] vfs_get_tree+0x20/0x120
[ 71.296792] [<0000000000796f0c>] path_mount+0x40c/0xa60
[ 71.365539] [<0000000000797a94>] sys_mount+0xf4/0x1c0
[ 71.431997] [<0000000000406274>] linux_sparc_syscall+0x34/0x44
I'll try a bisect with that config now, perhaps I can find something.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (0 / 16) |
Uptime: | 164:51:05 |
Calls: | 10,385 |
Calls today: | 2 |
Files: | 14,057 |
Messages: | 6,416,518 |