• Bug#1102062: zookeeper: FTBFS: expected: <1> but was: <0>

    From tony mancill@21:1/5 to Santiago Vila on Sun Jun 22 23:00:01 2025
    On Thu, Jun 19, 2025 at 06:07:01PM +0200, Santiago Vila wrote:
    severity 1102062 serious
    thanks

    Hi. This package FTBFS around 50% of the time here.

    It may not fail for everybody, but we promised our users that
    they can rebuild the packages from source. A failure rate
    of 50% exceeds the common thresholds used by the RT.

    [ I can still offer a test VM to reproduce if required, but if I
    was the maintainer, I would just disable the flaky test ].

    ( Note: The stable version also FTBFS randomly )

    Hi Santiago,

    Thank you for performing these test builds.

    I'm not able to coerce the test or the build to fail on 3 different
    systems in 10 overall attempts. This makes me suspect something related
    to the build environment, possibly either the networking stack or
    CPU. The logs you provide show a fairly simple test that creates a
    ZK client timing out (taking over 60 seconds) because it never
    registers with JMX. Are sockets scarce in your build environment?
    Cores?

    [ERROR] Tests run: 2, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 122.365 s <<< FAILURE! - in org.apache.zookeeper.test.SaslAuthRequiredMultiClientTest

    The tests are getting stuck here during some of your builds: https://salsa.debian.org/java-team/zookeeper/-/blob/master/zookeeper-server/src/test/java/org/apache/zookeeper/test/JMXEnv.java#L100-129

    The test completes quickly in all of my trials:

    [INFO] Running org.apache.zookeeper.test.SaslAuthRequiredMultiClientTest
    [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.205 s - in org.apache.zookeeper.test.SaslAuthRequiredMultiClientTest

    Please do provide either a VM or more information about the build
    environment where this fails and I will work to get the bottom of it.
    As far as I can see, there's nothing flaky about the test code.

    If there are specific system requirements for a successful build, we
    will document those. And if we can get patch the test to build
    successfully in the environment where it fails, we will do that too.

    However, I don't see the utility in increasing the severity of the bug
    to RC and thereby excluding this package and its multiple reverse
    dependencies from trixie. I think the bug severity should be important.

    Thanks,
    tony

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEE5Qr9Va3SequXFjqLIdIFiZdLPpYFAmhYbZcACgkQIdIFiZdL PpaGJxAAsRpf9QgpRQYFQ4Cs7Jhrgzm0+Ygn9t48fT+sBZRjMBlQUMnMr6BX3bxP xlfE+yENSKUpBTY+1EHNIg3IHcrHcaxvO3v5TpRwaPEwIWYFQoAS4bjRJgTQ02+2 RbxhiHerYi6ES80MwiW7kniazdhtI2+lhD/7O9POkFZ9b4WgrP+qOB0QT/yp3hxf i7Qh5tVCUHexnJsosTsrywgunAKsIpWBgKa4/C7M5r5jXnMp8qTPdwSm0YPEUcyw VA/X6AlFut4YfvWoj8o3Yn4knyEgN3JetnYoZ3D7dTy8SPWuFhdoC4/8PbnFUC54 IwOQjcHdTM+JTGV2oVvAN/Z/0jPrFTWOH/v6rWhRPpl6SfXpb0s6JLFAOMvqrk2Q EiBdbm8EAIheF2U5Ihs1VVFU2gdM9i8vQNuj4IuhzcaqBpL0VLQwxiJJmetT3BUL lEJmq/B58JxJL6i/QCyK1/q7cjMJNAZwxuUYuDgkxJbqSNbXZVYhAe6FOsyJPozK MjLa5meNMDssebnCaBQd9Hm9av2ELO716gBLZf86igJHZFVHCvBKObC6Rt7y4qJk vdpH/sk+gyh3oqQAfnBZMYJOOrFMy1qYLxgn6XcqdkD9SR4WLaSkZ6dJK8Inphv+ gv6mhc/I0GUf5PmXkaw4AaDEYhjK3AZYvU1K506bcaL0yRgYw68=
    =SiBB
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Santiago Vila@21:1/5 to All on Sun Jun 22 23:40:01 2025
    Note for the record: I've just given Tony access to an AWS machine
    of type m7a.large, which has 2 CPUs and 8 GB of RAM (I know this
    is enough because I monitor /proc/meminfo during the builds).

    In this type of machine the failure rate is around 30%
    so several tries might be necessary to get build failures.

    Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Gevers@21:1/5 to All on Fri Jul 4 23:20:01 2025
    To: sanvila@debian.org (Santiago Vila)

    This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------QzcdSmir6880GZTU0d6DZDj4
    Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

    Q29udHJvbDogdGFncyAtMSB0cml4aWUtaWdub3JlDQoNCkhpLA0KDQpPbiBGcmksIDQgSnVs IDIwMjUgMjE6MjE6MTkgKzAyMDAgU2FudGlhZ28gVmlsYSA8c2FudmlsYUBkZWJpYW4ub3Jn PiB3cm90ZToNCj4gT24gRnJpLCBKdWwgMDQsIDIwMjUgYXQgMDY6MTI6MzRQTSArMDAwMCwg dG9ueSBtYW5jaWxsIHdyb3RlOg0KPiANCj4gPiBCZWNhdXNlIHRoZSBmYWlsdXJlcyBvY2N1 ciBmb3IgbXVsdGlwbGUgZGlmZmVyZW50IHRlc3RzLCBJIGRvbid0IHRoaW5rDQo+ID4gd2Ug c2hvdWxkIGF0dGVtcHQgdG8gZGlzYWJsZSB0ZXN0cyAxLWJ5LTEsIEkgZXhwZWN0IHRoYXQg dG8gYmVjb21lIGENCj4gPiBnYW1lIG9mIHdoYWNrLWEtbW9sZS4gIEFzIHlvdSBzdWdnZXN0 ZWQsIHdlIHNob3VsZCBlbmdhZ2Ugd2l0aCB1cHN0cmVhbQ0KPiA+IHJlZ2FyZGluZyB0aGUg SGVpc2VudGVzdHMuICBJIHdpbGwgd29yayBvbiB0aGF0Lg0KPiANCj4gVGhhbmtzIGEgbG90 IQ0KDQpQbGVhc2UgdHJ5IHRvIGZpbmQgdGhlIGNhdXNlIG9mIHRoaXMsIGJ1dCBpbmRlZWQs IHdlJ2xsIGlnbm9yZSB0aGlzIGZvciANCnRyaXhpZSBpZiBhIHNvbHV0aW9uIGlzbid0IGZv dW5kIHNvb24uDQoNClBhdWwNCg0K

    --------------QzcdSmir6880GZTU0d6DZDj4--

    -----BEGIN PGP SIGNATURE-----

    wsC7BAABCABvBYJoaET6CRCcXJnrBb11CkcUAAAAAAAeACBzYWx0QG5vdGF0aW9u cy5zZXF1b2lhLXBncC5vcmcGwB1vjE1NkMTMfwP8YTJih0RDOc7No3Ek6UWoKElK xhYhBFi2bUhza+k7BS3mcpxcmesFvXUKAADRBQgArd9GOOFlCW44PvG+2s5obfBC eCHe1WE9mDwTh/S2ICAwp33ooJzHjVsC8f+cHzohvzJxMnyJT2buNv6xIS4IjO5U gih6c41KVUW6ZPJmMx2wj2meYlsIic/pOPQ6104odSwz2iBPKTGZ9PxS5ZPjOrk1 66a0u8zN5a7YOanjc3O1n08IgjF1MQKrPoflnMKrogViSBOCBaIjphSOA6Qgv8xM Az5JOhQ5YSm/SfPrucMB33USFxbcRcUWvtHVtYmxYDf/1Xl+V+wWxwbylGY+zsfR jhZEYBruYyhIAiWfddXkUWoY7WM1LDK0uqV5vtB6aldd2IFT1kyWLLzjskCuZw==
    =EXTd
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From tony mancill@21:1/5 to Santiago Vila on Fri Jul 11 07:50:02 2025
    On Fri, Jul 04, 2025 at 09:21:19PM +0200, Santiago Vila wrote:
    On Fri, Jul 04, 2025 at 06:12:34PM +0000, tony mancill wrote:

    Because the failures occur for multiple different tests, I don't think
    we should attempt to disable tests 1-by-1, I expect that to become a
    game of whack-a-mole. As you suggested, we should engage with upstream regarding the Heisentests. I will work on that.

    Thanks a lot!

    For the trixie release, we can either request that the bug be ignored by the Release Managers or I can upload a packaging change to skip tests during the build by default and then request a freeze exception.

    If anyone has a strong preference, please speak up.

    Requesting that the bug is ignored seems ok to me, as far as we can
    apply the fix in trixie after we have such fix, even if that
    happens after the release of Debian 13 (i.e. stable-proposed-updates
    where stable=trixie).

    As a justification for disabling the tests by default, here are failure
    counts by test for a sequence of 50 or so builds. Approximately a 50%
    build failure rate and 10 distinct failing tests (so far):

    9 [ERROR] RequestThrottlerTest.testRequestThrottler:235 expected: <5> but was: <4>
    8 [ERROR] SaslAuthRequiredMultiClientTest.testClientOpWithInvalidSASLPasswordAuthAfterSuccessLogin:76->ClientBase.createClient:185->ClientBase.createClient:190->ClientBase.createClient:205->ClientBase.createClient:224 expected [0x1004bb8efa30001]
    expected: <1> but was: <0>
    5 [ERROR] SnapshotAndRestoreCommandTest.testSnapshotAndRestoreCommand_streaming:168->validateSnapshotMetrics:398 expected: <true> but was: <false>
    3 [ERROR] QuorumPeerMainTest.testLeaderOutOfView:884 expected: <LOOKING> but was: <FOLLOWING>
    3 [ERROR] SessionTest.testSessionStateNoDupStateReporting:294 expected: <[SyncConnected, Disconnected, Expired]> but was: <[SyncConnected, Disconnected]>
    1 [ERROR] WatcherCleanerTest.testDeadWatcherMetrics:168->ZKTestCase.waitForMetric:144->ZKTestCase.waitForMetric:150->ZKTestCase.waitFor:140 metric "max_dead_watchers_cleaner_latency" failed to match after 30 seconds
    1 [ERROR] RestoreQuorumTest.testRestoreAfterQuorumLost:93 expected: <20> but was: <19>
    1 [ERROR] ReconfigRollingRestartCompatibilityTest.testRollingRestartWithHostAddedAndRemoved:317 waiting for server 3 being up ==> expected: <true> but was: <false>
    1 [ERROR] DIFFSyncTest.testLeaderShutdown_AckProposalBeforeAckNewLeader:191 expected: <200000001> but was: <100000003>
    1 [ERROR] CnxManagerTest.testWorkerThreads:511 Mon Jul 07 06:25:51 UTC 2025 Incorrect number of Worker threads for sid=0 expected 4 found 2 ==> expected: <null> but was: <Mon Jul 07 06:25:51 UTC 2025 Incorrect number of Worker threads for sid=0
    expected 4 found 2>


    So, I have prepared an upload for unstable that skips running the test
    suite by default but permits the builder to enable it by setting an
    environment variable. Kind of the opposite of DEB_BUILD_OPTIONS=nocheck, although I used a different, non-standard variable to avoid confusion.

    Any concerns with an upload now? I see that Paul has already tagged the
    bug with trixie-ignore. I am not expecting this to migrate before the
    release (unless the Release Managers feel otherwise).

    I will continue to work on addressing the root cause.

    Thank you,
    tony

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)