• Bug#1106105: #1106105 dask.distributed flaky autopkgtest on ppc64el

    From Rebecca N. Palmer@21:1/5 to All on Tue Jul 15 14:50:01 2025
    dask.distributed has both an explicit autopkgtest that is marked as
    flaky, and a pybuild-autopkgtest that isn't, that run mostly the same
    tests. (The git log says this is to have both a needs-internet and a non-needs-internet test, but the set of tests also differs because they
    have different Depends. In particular, the test_serialize_scipy_sparse failures (https://github.com/dask/distributed/commit/94222c0fc49c3ad14353611ecdc2c699b97bf8d4)
    are *not* part of this bug because that is only run by the
    marked-as-flaky test.) Both of them *already* try 5 times (see debian/run-tests).

    The debci logs seem to have two kinds of pybuild-autopkgtest failure:

    - Scheduler.computations contains multiple entries (the exact number
    varies) where only one is expected. Can occur in tests/test_client.py::test_computation_object_code_client_submit_list_comp, tests/test_client.py::test_computation_object_code_client_submit_dict_comp, and/or tests/test_computations.py::test_computations_futures.
    e.g. https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60904054/

    - Timeout in tests/test_tls_functional.py::test_retire_workers.
    e.g. https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60794087/

    This package already has a mechanism for excluding tests on some
    architectures (debian/get-test-exclusions), so we could skip these tests
    on ppc64el, but I don't yet know if we want to.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Diane Trout@21:1/5 to Rebecca N. Palmer on Tue Jul 15 19:40:01 2025
    On Tue, 2025-07-15 at 12:54 +0100, Rebecca N. Palmer wrote:
    dask.distributed has both an explicit autopkgtest that is marked as
    flaky, and a pybuild-autopkgtest that isn't, that run mostly the same tests.  (The git log says this is to have both a needs-internet and a non-needs-internet test, but the set of tests also differs because
    they
    have different Depends.  In particular, the
    test_serialize_scipy_sparse
    failures
    (
    https://github.com/dask/distributed/commit/94222c0fc49c3ad14353611ecdc 2c699b97bf8d4)
    are *not* part of this bug because that is only run by the
    marked-as-flaky test.)  Both of them *already* try 5 times (see debian/run-tests).

    Yes.

    I found a fix for the scipy test here: https://github.com/dask/distributed/pull/8977
    and pushed it into the dask.distributed repository, it makes the
    float32 errors with scipy go away.

    and/or tests/test_computations.py::test_computations_futures.
    e.g. https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60904054/

    - Timeout in tests/test_tls_functional.py::test_retire_workers.
    e.g. https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60794087/


    The other issue looks like a race condition starting a distributed
    worker cluster using TLS. Sometimes initialization takes a little too
    long and the test connects before the cluster is ready.

    For example running one test with this loop in a sbuild chroot on an
    x86_64 laptop

    for a in $(seq 30 ) ; do
    runuser -u sbuild -- \
    python3 -m pytest -k test_nanny \
    distributed/tests/test_tls_functional.py \
    --pdb --capture=no ;
    done

    I'd get 1-3 failures out of 30 runs, I observed either getting a
    timeout error or a TLS protocol error like this.

    2025-07-15 16:42:14,661 - distributed.comm.tcp - WARNING - Listener on 'tls://127.0.0.1:41241': TLS handshake failed with remote 'tls://127.0.0.1:46086': [SSL: UNEXPECTED_EOF_WHILE_READING] EOF
    occurred in violation of protocol (_ssl.c:1029)

    I could get test_nanny to never throw an error in the loop by adding
    await asyncio.sleep(0.1) into the function before it started trying to
    use the cluster, but I'm not sure that's necessary given all the pytest
    and autopkgtest configuration to rerun flaky tests.

    Should I push a new dask.distributed with the scipy fix and see if the
    current flaky test handling is sufficient?

    Diane

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rebecca N. Palmer@21:1/5 to Diane Trout on Tue Jul 15 23:20:01 2025
    On 15/07/2025 18:13, Diane Trout wrote:

    , but I'm not sure that's necessary given all the pytest
    and autopkgtest configuration to rerun flaky tests.

    Should I push a new dask.distributed with the scipy fix and see if the current flaky test handling is sufficient?

    No, because this already shows it isn't.

    There are 2+ issues here:
    - (this: Debian#1106105, RC) several tests randomly fail on ppc64el
    - (upstream#8977, no Debian number, a bug but non-RC) test_serialize_scipy_sparse always fails everywhere

    and two existing levels of flaky handling:
    - Both autopkgtests try 5 times. This ignores some random failures, but
    is not sufficient to reliably pass on ppc64el.
    - The explicit autopkgtest, but not the pybuild autopkgtest, is marked
    flaky at the debian/tests/control level, which means 'run but ignore'.
    Because only the explicit autopkgtest has a scipy dependency, the test_serialize_scipy_sparse failure is already being ignored.

    I agree that this isn't a _good_ way to handle the scipy issue and that
    we should apply upstream's fix at *some* point, but it's too far into
    the freeze to do that now.

    I could get test_nanny to never throw an error in the loop by adding
    await asyncio.sleep(0.1) into the function before it started trying to
    use the cluster

    That's plausibly a better idea, but I haven't tried it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Diane Trout@21:1/5 to Rebecca N. Palmer on Wed Jul 16 00:50:01 2025
    On Tue, 2025-07-15 at 22:08 +0100, Rebecca N. Palmer wrote:

    I could get test_nanny to never throw an error in the loop  by
    adding
    await asyncio.sleep(0.1) into the function before it started trying
    to
    use the cluster

    That's plausibly a better idea, but I haven't tried it.

    Unfortunately it doesn't work on the ppc64el porterbox
    platti.debian.org

    I still get a fair number of timeout test failures when running
    test_nanny in a loop.

    I did manage to capture a little bit earlier from where the logs start
    to look different for a failed case.

    When it's about to fail I get something like the following, the first
    error is the worker-handle-scheduler-connection-broken. Then it waits
    for the nanny to shutdown and after that times out it starts to spew
    stack traces and complains about 0 byte TLS responses.

    2025-07-15 22:25:26,920 - distributed.core - INFO - Connection to tls://127.0.0.1:51942 has been closed.
    2025-07-15 22:25:26,920 - distributed.scheduler - INFO - Remove worker
    addr: tls://127.0.0.1:46743 name: 0 (stimulus_id='handle-worker- cleanup-1752618326.9205813') 2025-07-15 22:25:26,921 - distributed.core - INFO - Starting
    established connection to tls://127.0.0.1:42399
    2025-07-15 22:25:26,922 - distributed.core - INFO - Connection to tls://127.0.0.1:42399 has been closed.
    2025-07-15 22:25:26,922 - distributed.worker - INFO - Stopping worker
    at tls://127.0.0.1:46743. Reason: worker-handle-scheduler-connection-
    broken
    2025-07-15 22:25:26,969 - distributed.nanny - INFO - Closing Nanny
    gracefully at 'tls://127.0.0.1:35207'. Reason: worker-handle-scheduler- connection-broken
    2025-07-15 22:25:26,970 - distributed.worker - INFO - Removing Worker
    plugin shuffle
    2025-07-15 22:25:26,971 - distributed.nanny - INFO - Worker closed
    2025-07-15 22:25:28,974 - distributed.nanny - ERROR - Worker process
    died unexpectedly
    2025-07-15 22:25:29,076 - distributed.nanny - INFO - Closing Nanny at 'tls://127.0.0.1:35207'. Reason: nanny-close-gracefully
    2025-07-15 22:25:29,077 - distributed.nanny - INFO - Nanny at 'tls://127.0.0.1:35207' closed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)