• Bug#1100120: libopenmpi-dev: mpi4py spawn tests get OPAL ERROR: Unreach

    From Drew Parsons@21:1/5 to All on Tue Mar 11 15:00:01 2025
    Package: libopenmpi-dev
    Version: 5.0.7-1
    Severity: serious
    Justification: FTBFS (dependencies)

    mpi4py build-time tests are showing problems in openmpi, with
    buildtime tests failing. That's with mpi4py 4.0.3-1.
    debci tests from its last build are still passing for now.

    I'm assuming the bug is in openmpi, not mpi4py itself, since mpi4py is
    passing tests with mpich (32 bit arches).

    The first problem comes from PMIX,
    An error occurred in PMIx Event Notification
    The error is reproducible,
    cf. https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/mpi4py.html
    https://tests.reproducible-builds.org/debian/rbuild/unstable/amd64/mpi4py_4.0.3-1.rbuild.log.gz
    It is triggered by test_util_pkl5, and also test_util_pool,
    test_util_sync and test_win.
    It is associated with a kernel general protection fault from prte.
    That bug is reported in Bug#1098576, currently assigned to pmix though
    I suspect it might be an openmpi issue. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1098576


    Here I'm reporting a second problem: spawn is failing,
    for instance:

    ERROR: testNoArgs (test_spawn.TestSpawnSingleWorldMany.testNoArgs) ---------------------------------------------------------------------- Traceback (most recent call last):
    File "/home/drew/projects/python/build/mpi4py/test/test_spawn.py", line 175, in testNoArgs
    child = self.COMM.Spawn(
    script, None, self.MAXPROCS,
    info=self.INFO, root=self.ROOT,
    )
    File "src/mpi4py/MPI.src/Comm.pyx", line 2544, in mpi4py.MPI.Intracomm.Spawn
    with nogil: CHKERR( MPI_Comm_spawn(
    mpi4py.MPI.Exception: MPI_ERR_UNKNOWN: unknown error

    ----------------------------------------------------------------------
    Ran 1857 tests in 84.632s

    FAILED (errors=40, skipped=162)
    [sandy:272668] OPAL ERROR: Unreachable in file ../../../ompi/runtime/ompi_mpi_finalize.c at line 286


    I've marked this bug severity serious because of the message at the
    end concerning the OPAL error in ompi_mpi_finalize.c (as well as the MPI_ERR_UNKNOWN errors in the spawn tests). If the OPAL message is a
    red herring then please downgrade severity if appropriate.



    We could just skip the failing tests in mpi4py (in fact I will for now),
    but the underlying problem should be fixed in any case.

    With mpi4py, I will upload 4.0.3-2 skipping the pmix failures, in
    order to get a reproducible record of the spawn failure. After that I
    will upload a release of mpi4py to skip the spawn tests, until the
    issue is fixed in openmpi (or pmix).



    -- System Information:
    Debian Release: trixie/sid
    APT prefers unstable-debug
    APT policy: (500, 'unstable-debug'), (500, 'unstable'), (1, 'experimental') Architecture: amd64 (x86_64)
    Foreign Architectures: i386

    Kernel: Linux 6.12.17-amd64 (SMP w/8 CPU threads; PREEMPT)
    Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE
    Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8), LANGUAGE=en_AU:en
    Shell: /bin/sh linked to /usr/bin/dash
    Init: systemd (via /run/systemd/system)
    LSM: AppArmor: enabled

    Versions of packages libopenmpi-dev depends on:
    ii gfortran [gfortran-mod-15] 4:14.2.0-1
    ii gfortran-11 [gfortran-mod-15] 11.5.0-2
    ii gfortran-12 [gfortran-mod-15] 12.4.0-4
    ii gfortran-13 [gfortran-mod-15] 13.3.0-12
    ii gfortran-14 [gfortran-mod-15] 14.2.0-17
    ii libevent-dev 2.1.12-stable-10+b1
    ii libhwloc-dev 2.12.0-1
    ii libibverbs-dev 56.0-2
    ii libjs-jquery 3.6.1+dfsg+~3.5.14-1
    ii libjs-jquery-ui 1.13.2+dfsg-1
    ii libopenmpi40 5.0.7-1
    ii libpmix-dev 5.0.6-5
    ii openmpi-bin 5.0.7-1
    ii openmpi-common 5.0.7-1
    ii zlib1g-dev 1:1.3.dfsg+really1.3.1-1+b1

    Versions of packages libopenmpi-dev recommends:
    ii libcoarrays-openmpi-dev 2.10.2+ds-4

    Versions of packages libopenmpi-dev suggests:
    pn openmpi-doc <none>

    -- no debconf information

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Drew Parsons@21:1/5 to All on Tue Mar 11 20:40:01 2025
    Package: libopenmpi-dev
    Followup-For: Bug #1100120

    We can see the spawn errors in the mpi4py 4.0.3-2 build logs,
    e.g. https://buildd.debian.org/status/fetch.php?pkg=mpi4py&arch=amd64&ver=4.0.3-2&stamp=1741705221&raw=0

    Skipping the test_spawn tests, remaing tests pass
    but I get the following error locally (in pdebuild pbuilder chroot).

    It looks like a different variation of the OPAL ERROR I reported in
    this bug. But this backtrace refers to libucs.so.0. Could it a bug in
    ucx, which has just been upgraded to 1.18.1?

    The end of the build log is:

    ...
    test_starmap (test_util_pool.TestThreadPool.test_starmap) ... ok testConstructor (test_win.TestWinNull.testConstructor) ... ok
    testGetName (test_win.TestWinNull.testGetName) ... ok

    ----------------------------------------------------------------------
    Ran 1813 tests in 58.573s

    OK (skipped=162)
    [sandy:25412:0:25412] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
    ==== backtrace (tid: 25412) ====
    0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2bc) [0x7fd827e8064c]
    1 /lib/x86_64-linux-gnu/libucs.so.0(+0x3182f) [0x7fd827e8082f]
    2 /lib/x86_64-linux-gnu/libucs.so.0(+0x319fa) [0x7fd827e809fa]
    3 /lib/x86_64-linux-gnu/libc.so.6(+0x3fdb0) [0x7fd82aa90db0]
    4 /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_net_get_hostname+0x12) [0x7fd827dd90a2]
    5 /lib/x86_64-linux-gnu/libopen-pal.so.80(+0xfc518) [0x7fd827e0e518]
    6 /lib/x86_64-linux-gnu/libopen-pal.so.80(+0xfc8a5) [0x7fd827e0e8a5]
    7 /lib/x86_64-linux-gnu/libopen-pal.so.80(mca_btl_tcp_proc_create+0x464) [0x7fd827df8924]
    8 /lib/x86_64-linux-gnu/libopen-pal.so.80(mca_btl_tcp_add_procs+0x6f) [0x7fd827df09cf]
    9 /lib/x86_64-linux-gnu/libmpi.so.40(+0xf3c0c) [0x7fd8282f3c0c]
    10 /lib/x86_64-linux-gnu/libmpi.so.40(mca_pml_ob1_isend+0x84d) [0x7fd8284841ad]
    11 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_dpm_dyn_finalize+0x1b5) [0x7fd828280575]
    12 /lib/x86_64-linux-gnu/libmpi.so.40(+0x64c27) [0x7fd828264c27]
    13 /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_finalize_cleanup_domain+0x52) [0x7fd827d47ec2]
    14 /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_finalize+0x37) [0x7fd827d3a6b7]
    15 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_rte_finalize+0x14b) [0x7fd8282980eb]
    16 /lib/x86_64-linux-gnu/libmpi.so.40(+0x9adfc) [0x7fd82829adfc]
    17 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_instance_finalize+0xbd) [0x7fd82829c3cd]
    18 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_finalize+0x21f) [0x7fd82829447f]
    19 /usr/bin/python3.13() [0x45b262]
    20 /usr/bin/python3.13(Py_Exit+0x2f) [0x69abef]
    21 /usr/bin/python3.13() [0x68858b]
    22 /usr/bin/python3.13() [0x6883fb]
    23 /usr/bin/python3.13() [0x67ed22]
    24 /usr/bin/python3.13() [0x67eb1e]
    25 /usr/bin/python3.13(Py_RunMain+0x3c1) [0x67d961]
    26 /usr/bin/python3.13(Py_BytesMain+0x2b) [0x63a72b]
    27 /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7fd82aa7aca8]
    28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fd82aa7ad65]
    29 /usr/bin/python3.13(_start+0x21) [0x639ae1] =================================

    python3.13:25412 terminated with signal 11 at PC=7fd827dd90a2 SP=7ffe028bd5d0. Backtrace:
    /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_net_get_hostname+0x12) [0x7fd827dd90a2]
    /lib/x86_64-linux-gnu/libopen-pal.so.80(+0xfc518) [0x7fd827e0e518] /lib/x86_64-linux-gnu/libopen-pal.so.80(+0xfc8a5) [0x7fd827e0e8a5] /lib/x86_64-linux-gnu/libopen-pal.so.80(mca_btl_tcp_proc_create+0x464) [0x7fd827df8924]
    /lib/x86_64-linux-gnu/libopen-pal.so.80(mca_btl_tcp_add_procs+0x6f) [0x7fd827df09cf]
    /lib/x86_64-linux-gnu/libmpi.so.40(+0xf3c0c) [0x7fd8282f3c0c] /lib/x86_64-linux-gnu/libmpi.so.40(mca_pml_ob1_isend+0x84d) [0x7fd8284841ad] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_dpm_dyn_finalize+0x1b5) [0x7fd828280575]
    /lib/x86_64-linux-gnu/libmpi.so.40(+0x64c27) [0x7fd828264c27] /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_finalize_cleanup_domain+0x52) [0x7fd827d47ec2]
    /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_finalize+0x37) [0x7fd827d3a6b7] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_rte_finalize+0x14b) [0x7fd8282980eb] /lib/x86_64-linux-gnu/libmpi.so.40(+0x9adfc) [0x7fd82829adfc] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_instance_finalize+0xbd) [0x7fd82829c3cd]
    /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_finalize+0x21f) [0x7fd82829447f] /usr/bin/python3.13() [0x45b262]
    /usr/bin/python3.13(Py_Exit+0x2f) [0x69abef]
    /usr/bin/python3.13() [0x68858b]
    /usr/bin/python3.13() [0x6883fb]
    /usr/bin/python3.13() [0x67ed22]
    /usr/bin/python3.13() [0x67eb1e]
    /usr/bin/python3.13(Py_RunMain+0x3c1) [0x67d961] /usr/bin/python3.13(Py_BytesMain+0x2b) [0x63a72b] /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7fd82aa7aca8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fd82aa7ad65] /usr/bin/python3.13(_start+0x21) [0x639ae1]
    make[1]: *** [debian/rules:269: override_dh_auto_test] Error 1

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Debian Bug Tracking System@21:1/5 to All on Wed Mar 12 03:40:01 2025
    Processing control commands:

    retitle -1 libopenmpi-dev: mpi4py spawn tests fail with MPI_ERR_UNKNOWN
    Bug #1100120 [libopenmpi-dev] libopenmpi-dev: mpi4py spawn tests get OPAL ERROR: Unreachable in file ../../../ompi/runtime/ompi_mpi_finalize.c at line 286
    Changed Bug title to 'libopenmpi-dev: mpi4py spawn tests fail with MPI_ERR_UNKNOWN' from 'libopenmpi-dev: mpi4py spawn tests get OPAL ERROR: Unreachable in file ../../../ompi/runtime/ompi_mpi_finalize.c at line 286'.
    severity -1 normal
    Bug #1100120 [libopenmpi-dev] libopenmpi-dev: mpi4py spawn tests fail with MPI_ERR_UNKNOWN
    Severity set to 'normal' from 'serious'

    --
    1100120: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1100120
    Debian Bug Tracking System
    Contact owner@bugs.debian.org with problems

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Drew Parsons@21:1/5 to All on Wed Mar 12 03:40:01 2025
    Package: libopenmpi-dev
    Followup-For: Bug #1100120
    Control: retitle -1 libopenmpi-dev: mpi4py spawn tests fail with MPI_ERR_UNKNOWN
    Control: severity -1 normal

    The OPAL ERROR or libucs segfault at the end of the tests can be
    avoided by setting OMPI_MCA_btl_tcp_if_include=lo
    (that's with OMPI_ not PRTE_ prefix, see Bug#1087784 for hypre)

    Is it expected that we should need to be setting
    OMPI_MCA_btl_tcp_if_include=lo in debian/tests (and debian/rules)?

    But that still leaves the spawn test failures with MPI_ERR_UNKNOWN
    (and the PMIx errors). Updating the bug title so.

    Looking into the history of spawn tests, spawn is known to give
    problems. Some problems with mpich were dealt with previously, https://github.com/pmodels/mpich/issues/7073 https://github.com/mpi4py/mpi4py/issues/541

    Indeed mpi4py has assumed that openmpi spawn tests fail, and has an
    explicit skip checking the openmpi version in test_spawn.py

    This skip condition was recently changed. Previously spawn tests were
    skipped for all openmpi 5.0.x versions with
    @unittest.skipMPI('openmpi(>=5.0.0,<5.1.0)', skip_spawn())
    Recently that was relaxed (https://github.com/mpi4py/mpi4py/pull/601)
    to
    @unittest.skipMPI('openmpi(>=5.0.0,<5.0.7)', skip_spawn())

    And debian's openmpi has just updated to 5.0.7. So mpi4py is now
    running the spawn tests where previously it was skipping them,
    resulting in the error reported here.

    I'm not sure if mpi4py upstream meant the condition to say "<=5.0.7"
    rather than "<5.0.7", or was wrong about spawn now working properly in
    openmpi 5.0.7. But we can see that it is not working, so mpi4py will
    want to continue skipping spawn tests for now.

    In summary, this is a known issue in openmpi, so I'm downgrading the
    severity to normal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)