• Bug#1102612: mpich 4.3 not initialising multiple processes

    From Drew Parsons@21:1/5 to All on Fri Apr 11 02:00:01 2025
    Package: mpich
    Version: 4.3.0-5
    Severity: serious
    Justification: debci

    I apologise for another serious bug, but mpich 4.3 is doing weird
    things that we don't want in trixie. I see the problem in mpich test
    errors in armci-mpi (https://buildd.debian.org/status/fetch.php?pkg=armci-mpi&arch=amd64&ver=0.4-5&stamp=1744327219&raw=0 )
    but can reproduce in a trivial test.

    The problem is that mpich is not initialising multiple processes.
    Instead it is simply launching multiple single processes (each with MPI_Comm_size = 1).

    You can see the problem in the armci-mpi test errors, e.g.

    FAIL: benchmarks/ping-pong
    ==========================

    [1744327153.607644] [sbuild:19884:0] sock.c:513 UCX WARN unable to read somaxconn value from /proc/sys/net/core/somaxconn file
    [0] ARMCI Error: This benchmark should be run on at least two processes Abort(1) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000000, 1) - process 0
    [0] ARMCI Error: This benchmark should be run on at least two processes [1744327153.612861] [sbuild:19883:0] sock.c:513 UCX WARN unable to read somaxconn value from /proc/sys/net/core/somaxconn file
    Abort(1) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000000, 1) - process 0
    FAIL benchmarks/ping-pong (exit status: 1)

    The error message "at least two processes" is issued by armci-mpi's ping-pong.c,
    when it detects MPI_Comm_size = 1. But the test is launched with
    mpiexec.mpich -np 2 (that's why the error is repeated twice).

    I can reproduce the issue with a trivial test:
    ```
    $ cat mpich_test.c
    #include <stdio.h>
    #include <mpi.h>

    int main(int argc, char **argv) {
    int me, nproc;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &me);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);

    printf("mpi test rank %d of %d\n", me, nproc);
    MPI_Finalize();

    return 0;
    }
    $ mpicc.mpich -o mpich_test mpich_test.c
    $ mpiexec.mpich -n 4 ./mpich_test
    mpi test rank 0 of 1
    mpi test rank 0 of 1
    mpi test rank 0 of 1
    mpi test rank 0 of 1
    ```

    It should instead be reporting
    mpi test rank 3 of 4
    mpi test rank 1 of 4
    mpi test rank 0 of 4
    mpi test rank 2 of 4


    There is even more weirdness however. The first time I compiled and
    ran this trivial test, it did report having 4 processes, but that
    correct output was accompanied with pmix warnings:
    Query for unrecognized attribute: pmix.qry.node
    Query for unrecognized attribute: pmix.qry.peers

    But after recompiling the same way, it no longer gave the correct output
    but also did not give the pmix warnings.

    Can you reproduce this problem?


    -- System Information:
    Debian Release: trixie/sid
    APT prefers unstable-debug
    APT policy: (500, 'unstable-debug'), (500, 'unstable'), (1, 'experimental') Architecture: amd64 (x86_64)
    Foreign Architectures: i386

    Kernel: Linux 6.12.21-amd64 (SMP w/8 CPU threads; PREEMPT)
    Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE
    Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8), LANGUAGE=en_AU:en
    Shell: /bin/sh linked to /usr/bin/dash
    Init: systemd (via /run/systemd/system)
    LSM: AppArmor: enabled

    Versions of packages mpich depends on:
    ii hwloc 2.12.0-1
    ii libc6 2.41-6
    ii libhwloc15 2.12.0-1
    ii libmpich12 4.3.0-5
    ii libslurm42t64 24.11.3-2
    ii perl 5.40.1-2

    Versions of packages mpich recommends:
    ii libmpich-dev 4.3.0-5

    Versions of packages mpich suggests:
    ii mpich-doc 4.3.0-5

    -- no debconf information

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Drew Parsons@21:1/5 to All on Fri Apr 11 17:20:01 2025
    Package: mpich
    Followup-For: Bug #1102612

    There is evidence that libucx0 might be the problem, or a problem,
    in the ga (libglobalarrays-dev) build logs

    https://buildd.debian.org/status/fetch.php?pkg=ga&arch=amd64&ver=5.9.1-1&stamp=1744381093&raw=0

    e.g. FAIL: global/testing/pgtest
    copy is OK

    Checking scatter/gather (might be slow)...
    [sbuild:90178:0:90178] ucp_request.c:212 Assertion `ucs_async_check_owner_thread(&(worker)->async)' failed
    ==== backtrace (tid: 90178) ====
    0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2bc) [0x7fc32aa2564c]
    1 /lib/x86_64-linux-gnu/libucs.so.0(ucs_fatal_error_message+0xb6) [0x7fc32aa231f6]
    2 /lib/x86_64-linux-gnu/libucs.so.0(ucs_fatal_error_format+0x11a) [0x7fc32aa2331a]
    3 /lib/x86_64-linux-gnu/libucp.so.0(ucp_request_release+0x1a7) [0x7fc32aab7487]
    4 /lib/x86_64-linux-gnu/libmpich.so.12(+0x349457) [0x7fc32b3ff457]
    5 /lib/x86_64-linux-gnu/libmpich.so.12(+0x349685) [0x7fc32b3ff685]
    6 /lib/x86_64-linux-gnu/libucp.so.0(ucp_am_rndv_process_rts+0x17f) [0x7fc32aa9fb9f]
    7 /lib/x86_64-linux-gnu/libucp.so.0(ucp_rndv_rts_handler+0xbd) [0x7fc32ab20b2d]
    8 /lib/x86_64-linux-gnu/libuct.so.0(+0x21fa9) [0x7fc32a93cfa9]
    9 /lib/x86_64-linux-gnu/libuct.so.0(uct_self_ep_am_bcopy+0x7e) [0x7fc32a93d6ce]
    10 /lib/x86_64-linux-gnu/libucp.so.0(ucp_am_rndv_proto_progress+0x468) [0x7fc32aa8b1a8]
    11 /lib/x86_64-linux-gnu/libucp.so.0(ucp_am_send_nbx+0x9aa) [0x7fc32aa9b01a] 12 /lib/x86_64-linux-gnu/libmpich.so.12(+0x17d593) [0x7fc32b233593]
    13 /lib/x86_64-linux-gnu/libmpich.so.12(+0x17f378) [0x7fc32b235378]
    14 /lib/x86_64-linux-gnu/libmpich.so.12(MPI_Get_accumulate+0x2fc) [0x7fc32b235e3c]
    15 ./global/testing/pgtest.x(+0xcd40f) [0x55c644f5140f]
    16 ./global/testing/pgtest.x(+0xd3693) [0x55c644f57693]
    17 ./global/testing/pgtest.x(+0xd38ba) [0x55c644f578ba]
    18 ./global/testing/pgtest.x(+0xd3aaf) [0x55c644f57aaf]
    19 ./global/testing/pgtest.x(+0x929db) [0x55c644f169db]
    20 ./global/testing/pgtest.x(+0xfa8e) [0x55c644e93a8e]
    21 ./global/testing/pgtest.x(+0x111e5) [0x55c644e951e5]
    22 ./global/testing/pgtest.x(+0x117c3) [0x55c644e957c3]
    23 ./global/testing/pgtest.x(+0x4d05) [0x55c644e88d05]
    24 /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7fc32aebcca8]
    25 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fc32aebcd65]
    26 ./global/testing/pgtest.x(+0x4d31) [0x55c644e88d31] =================================

    Program received signal SIGABRT: Process abort signal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Drew Parsons@21:1/5 to All on Mon Apr 14 17:30:01 2025
    Package: mpich
    Followup-For: Bug #1102612

    Is this upstream bug related to the problem? https://github.com/pmodels/mpich/issues/7064#issuecomment-2301026290

    It suggests the pmix build configuration might be an issue.
    The timing of that bug doesn't match well, though, it was made in
    August 2024, before we had mpich 4.3.

    On the other hand, pmix support was added to the debian mpich package
    in 4.3.0-2. And Ubuntu 24.04 (the version in the upstream bug report)
    has mpich 4.2.0-5build3, which did have pmix support activated.

    Debian deactivated pmix support in mpich 4.2.0-5.1 (it had been
    enabled in 4.2.0-1), evidently ubuntu didn't apply the update to
    deactivate it.

    There is more discussion from ubuntu at https://bugs.launchpad.net/ubuntu/+source/mpich/+bug/2072338
    though no actual fix for it, just a reference to the pmix deactivation
    in 4.2.0-5.1

    Sounds like pmix configuration is still an issue for mpich.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Debian Bug Tracking System@21:1/5 to Drew Parsons on Wed Apr 16 13:20:01 2025
    This is a multi-part message in MIME format...

    Your message dated Wed, 16 Apr 2025 12:05:13 +0100
    with message-id <d91d2a47-69d2-4c35-8924-dbcf9ba098ef@mckinstry.ie>
    and subject line Re: Bug#1102068: libfabric: FTBFS on 32-bit arches: ofi_cma.h: error: passing argument 2 of 'ofi_consume_iov' from incompatible pointer type
    has caused the Debian Bug report #1102612,
    regarding mpich 4.3 not initialising multiple processes
    to be marked as done.

    This means that you claim that the problem has been dealt with.
    If this is not the case it is now your responsibility to reopen the
    Bug report if necessary, and/or fix the problem forthwith.

    (NB: If you are a system administrator and have no idea what this
    message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact owner@bugs.debian.org
    immediately.)


    --
    1102612: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1102612
    Debian Bug Tracking System
    Contact owner@bugs.debian.org with problems

    Received: (at submit) by bugs.debian.org; 10 Apr 2025 23:53:49 +0000 X-Spam-Checker-Version: SpamAssassin 3.4.6-bugs.debian.org_2005_01_02
    (2021-04-09) on buxtehude.debian.org
    X-Spam-Level:
    X-Spam-Status: No, score=-120.7 required=4.0 tests=BAYES_00,
    BODY_INCLUDES_PACKAGE,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,
    DKIM_VALID_AU,DKIM_VALID_EF,FOURLA,FROMDEVELOPER,FVGT_m_MULTI_ODD,
    HAS_PACKAGE,SPF_HELO_NONE,SPF_NONE,UNPARSEABLE_RELAY,
    USER_IN_DKIM_WELCOMELIST,USER_IN_DKIM_WHITELIST,XMAILER_REPORTBUG
    autolearn=ham autolearn_force=no
    version=3.4.6-bugs.debian.org_2005_01_02
    X-Spam-Bayes: score:0.0000 Tokens: new, 46; hammy, 150; neutral, 197; spammy,
    0. spammytokens:
    hammytokens:0.000-+--Hx-spam-relays-external:sk:stravin,
    0.000-+--H*RT:sk:stravin, 0.000-+--Hx-spam-relays-external:311,
    0.000-+--H*RT:311, 0.000-+--H*RT:108
    Return-path: <dparsons@
  • From Debian Bug Tracking System@21:1/5 to All on Sun Apr 20 16:50:01 2025
    Processing control commands:

    found -1 4.3.0-5
    Bug #1102612 {Done: Alastair McKinstry <alastair.mckinstry@mckinstry.ie>} [mpich] mpich 4.3 not initialising multiple processes
    Did not alter found versions and reopened.
    fixed -1 4.3.0-6
    Bug #1102612 [mpich] mpich 4.3 not initialising multiple processes
    Marked as fixed in versions mpich/4.3.0-6.

    --
    1102612: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1102612
    Debian Bug Tracking System
    Contact owner@bugs.debian.org with problems

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Drew Parsons@21:1/5 to All on Sun Apr 20 16:50:01 2025
    Package: mpich
    Version: 4.3.0-5
    Followup-For: Bug #1102612
    Control: found -1 4.3.0-5
    Control: fixed -1 4.3.0-6

    Updating tags (the fixed version is in experimental, the version in
    unstable is not yet updated).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)