Forum: >>> Magnum BBS <<<

Bug#1106105: #1106105 dask.distributed flaky autopkgtest on ppc64el

From Rebecca N. Palmer@21:1/5 to All on Tue Jul 15 14:50:01 2025

dask.distributed has both an explicit autopkgtest that is marked as
flaky, and a pybuild-autopkgtest that isn't, that run mostly the same
tests. (The git log says this is to have both a needs-internet and a non-needs-internet test, but the set of tests also differs because they
have different Depends. In particular, the test_serialize_scipy_sparse failures (https://github.com/dask/distributed/commit/94222c0fc49c3ad14353611ecdc2c699b97bf8d4)
are *not* part of this bug because that is only run by the
marked-as-flaky test.) Both of them *already* try 5 times (see debian/run-tests).

The debci logs seem to have two kinds of pybuild-autopkgtest failure:

- Scheduler.computations contains multiple entries (the exact number
varies) where only one is expected. Can occur in tests/test_client.py::test_computation_object_code_client_submit_list_comp, tests/test_client.py::test_computation_object_code_client_submit_dict_comp, and/or tests/test_computations.py::test_computations_futures.
e.g. https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60904054/

- Timeout in tests/test_tls_functional.py::test_retire_workers.
e.g. https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60794087/

This package already has a mechanism for excluding tests on some
architectures (debian/get-test-exclusions), so we could skip these tests
on ppc64el, but I don't yet know if we want to.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Diane Trout@21:1/5 to Rebecca N. Palmer on Tue Jul 15 19:40:01 2025

On Tue, 2025-07-15 at 12:54 +0100, Rebecca N. Palmer wrote:

dask.distributed has both an explicit autopkgtest that is marked as
flaky, and a pybuild-autopkgtest that isn't, that run mostly the same tests. (The git log says this is to have both a needs-internet and a non-needs-internet test, but the set of tests also differs because
they
have different Depends. In particular, the
test_serialize_scipy_sparse
failures
(
https://github.com/dask/distributed/commit/94222c0fc49c3ad14353611ecdc 2c699b97bf8d4)
are *not* part of this bug because that is only run by the
marked-as-flaky test.) Both of them *already* try 5 times (see debian/run-tests).

Yes.

I found a fix for the scipy test here: https://github.com/dask/distributed/pull/8977
and pushed it into the dask.distributed repository, it makes the
float32 errors with scipy go away.

and/or tests/test_computations.py::test_computations_futures.
e.g. https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60904054/

- Timeout in tests/test_tls_functional.py::test_retire_workers.
e.g. https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60794087/

The other issue looks like a race condition starting a distributed
worker cluster using TLS. Sometimes initialization takes a little too
long and the test connects before the cluster is ready.

For example running one test with this loop in a sbuild chroot on an
x86_64 laptop

for a in $(seq 30 ) ; do
runuser -u sbuild -- \
python3 -m pytest -k test_nanny \
distributed/tests/test_tls_functional.py \
--pdb --capture=no ;
done

I'd get 1-3 failures out of 30 runs, I observed either getting a
timeout error or a TLS protocol error like this.

2025-07-15 16:42:14,661 - distributed.comm.tcp - WARNING - Listener on 'tls://127.0.0.1:41241': TLS handshake failed with remote 'tls://127.0.0.1:46086': [SSL: UNEXPECTED_EOF_WHILE_READING] EOF
occurred in violation of protocol (_ssl.c:1029)

I could get test_nanny to never throw an error in the loop by adding
await asyncio.sleep(0.1) into the function before it started trying to
use the cluster, but I'm not sure that's necessary given all the pytest
and autopkgtest configuration to rerun flaky tests.

Should I push a new dask.distributed with the scipy fix and see if the
current flaky test handling is sufficient?

Diane

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rebecca N. Palmer@21:1/5 to Diane Trout on Tue Jul 15 23:20:01 2025

On 15/07/2025 18:13, Diane Trout wrote:

, but I'm not sure that's necessary given all the pytest

and autopkgtest configuration to rerun flaky tests.

Should I push a new dask.distributed with the scipy fix and see if the current flaky test handling is sufficient?

No, because this already shows it isn't.

There are 2+ issues here:
- (this: Debian#1106105, RC) several tests randomly fail on ppc64el
- (upstream#8977, no Debian number, a bug but non-RC) test_serialize_scipy_sparse always fails everywhere

and two existing levels of flaky handling:
- Both autopkgtests try 5 times. This ignores some random failures, but
is not sufficient to reliably pass on ppc64el.
- The explicit autopkgtest, but not the pybuild autopkgtest, is marked
flaky at the debian/tests/control level, which means 'run but ignore'.
Because only the explicit autopkgtest has a scipy dependency, the test_serialize_scipy_sparse failure is already being ignored.

I agree that this isn't a _good_ way to handle the scipy issue and that
we should apply upstream's fix at *some* point, but it's too far into
the freeze to do that now.

I could get test_nanny to never throw an error in the loop by adding
await asyncio.sleep(0.1) into the function before it started trying to
use the cluster

That's plausibly a better idea, but I haven't tried it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Diane Trout@21:1/5 to Rebecca N. Palmer on Wed Jul 16 00:50:01 2025

On Tue, 2025-07-15 at 22:08 +0100, Rebecca N. Palmer wrote:

I could get test_nanny to never throw an error in the loop by
adding
await asyncio.sleep(0.1) into the function before it started trying
to
use the cluster

That's plausibly a better idea, but I haven't tried it.

Unfortunately it doesn't work on the ppc64el porterbox
platti.debian.org

I still get a fair number of timeout test failures when running
test_nanny in a loop.

I did manage to capture a little bit earlier from where the logs start
to look different for a failed case.

When it's about to fail I get something like the following, the first
error is the worker-handle-scheduler-connection-broken. Then it waits
for the nanny to shutdown and after that times out it starts to spew
stack traces and complains about 0 byte TLS responses.

2025-07-15 22:25:26,920 - distributed.core - INFO - Connection to tls://127.0.0.1:51942 has been closed.
2025-07-15 22:25:26,920 - distributed.scheduler - INFO - Remove worker
addr: tls://127.0.0.1:46743 name: 0 (stimulus_id='handle-worker- cleanup-1752618326.9205813') 2025-07-15 22:25:26,921 - distributed.core - INFO - Starting
established connection to tls://127.0.0.1:42399
2025-07-15 22:25:26,922 - distributed.core - INFO - Connection to tls://127.0.0.1:42399 has been closed.
2025-07-15 22:25:26,922 - distributed.worker - INFO - Stopping worker
at tls://127.0.0.1:46743. Reason: worker-handle-scheduler-connection-
broken
2025-07-15 22:25:26,969 - distributed.nanny - INFO - Closing Nanny
gracefully at 'tls://127.0.0.1:35207'. Reason: worker-handle-scheduler- connection-broken
2025-07-15 22:25:26,970 - distributed.worker - INFO - Removing Worker
plugin shuffle
2025-07-15 22:25:26,971 - distributed.nanny - INFO - Worker closed
2025-07-15 22:25:28,974 - distributed.nanny - ERROR - Worker process
died unexpectedly
2025-07-15 22:25:29,076 - distributed.nanny - INFO - Closing Nanny at 'tls://127.0.0.1:35207'. Reason: nanny-close-gracefully
2025-07-15 22:25:29,077 - distributed.nanny - INFO - Nanny at 'tls://127.0.0.1:35207' closed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (0 / 16)
Uptime:	166:43:09
Calls:	10,385
Calls today:	2
Files:	14,057
Messages:	6,416,529

Bug#1106105: #1106105 dask.distributed flaky autopkgtest on ppc64el

Who's Online

Recent Visitors

System Info