• Bug#1106397: debvm: flaky autopkgtest: exp3 not open

    From Helmut Grohne@21:1/5 to Paul Gevers on Mon May 26 07:10:01 2025
    Hi Paul,

    On Sat, May 24, 2025 at 12:03:27PM +0200, Paul Gevers wrote:
    I looked at the results of the autopkgtest of your package because it was showing up in the excuses for dpkg. I noticed that it regularly fails on amd64. Maybe you need to wait longer before deciding that starting the VM failed?

    I'm not sure extending the timout cuts it. For the record, the timeout presently is 5 minutes for start to login prompt. Shouldn't that work?

    https://ci.debian.net/packages/d/debvm/testing/amd64/60842802/

    The actual expect invocation started at 354s, so the time from expect invocation to test termination is 16 seconds < 300 seconds.


    369s send: spawn id exp3 not open
    369s while executing
    369s "send "echo 6coF0JBW\$((2+3))\r""
    369s (file "./tests/shell_interaction.expect" line 7)
    369s + cleanup
    369s + rm -f ssh_id ssh_id.pub test.img
    370s autopkgtest [04:17:32]: test debefivm-root: -----------------------]

    The error message hints at: https://sources.debian.org/src/expect/5.45.4-4/exp_command.c/?hl=209#L209

    This suggests that qemu might have died. Also the location suggests that
    we did see a root shell prompt already.

    In another test, the error is different:

    https://ci.debian.net/packages/d/debvm/testing/amd64/60944789/#L9985 in debefivm-root
    900s /usr/bin/debvm-waitssh: timeout reached trying to contact 127.0.0.1:2222 after waiting 540 seconds.

    One thing I notice here is that most amd64 tests run with 64 CPUs. debvm assigns the same amount of CPUs to the guest as the host has such that
    you can do compiling inside at full capacity. When using lots of CPUs
    and tcg emulation, this incurs quite a bit of slowness.

    How about limiting concurrency to at most four virtual CPUs? That should
    at least make the amd64 test noticeably faster.

    How do you see me validate a prospective change fixing the autopkgtest?
    The tests (and more) run in salsa-ci with few problems and locally, so
    the best I can do here is throwing changes at the archive and hoping
    that they fix. Would you agree to me doing an upload the limits
    concurrency and closes the flakiness bug with the idea that you'd reopen
    it if it doesn't fix?

    Helmut

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helmut Grohne@21:1/5 to Helmut Grohne on Wed May 28 11:00:01 2025
    Hi Paul,

    On Sun, May 25, 2025 at 09:15:17PM +0200, Helmut Grohne wrote:
    One thing I notice here is that most amd64 tests run with 64 CPUs. debvm assigns the same amount of CPUs to the guest as the host has such that
    you can do compiling inside at full capacity. When using lots of CPUs
    and tcg emulation, this incurs quite a bit of slowness.

    How about limiting concurrency to at most four virtual CPUs? That should
    at least make the amd64 test noticeably faster.

    I uploaded 0.4.2 limiting concurrency to 4 CPUs at most. I scheduled 10
    runs on amd64. https://ci.debian.net/user/helmutg/jobs Two of them
    failed in debefivm and not in debvm. It's still flaky, but maybe less so
    (or luck). Given the output, I'm quite sure that qemu actually hangs and
    that increasing any timeout does not buy us anything. To me, this feels
    like qemu and/or linux being randomly buggy.

    I note that I have never reproduced the specific failure mode outside
    ci.d.n.

    Any suggestions for how to move forward from here?

    Helmut

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helmut Grohne@21:1/5 to Paul Gevers on Thu May 29 22:40:01 2025
    Control: clone -1 -2
    Control: reassign -2 qemu-system-x86
    Control: retitle -2 qemu-system-x86_64: system/physmem.c:2680: iotlb_to_section: Assertion `section_index < d->map.sections_nb' failed.
    Control: severity -2 important
    Control: block -1 by -2
    Control: affects -1 + debvm

    On Wed, May 28, 2025 at 08:57:54PM +0200, Paul Gevers wrote:
    If you think it could help you debug, I (or terceiro) can give you access to a testbed on the ci.d.n infra (while I'm on-line).

    Paul got me a testbed and I learned thf ollowing things:

    The crazy error message from expect means that it got -EIO while trying
    to read(2) data from the ptmx slave file descriptor. It then closes the
    fd and gives up.

    Causing load on the host CPU makes the issue more likely to reproduce.

    In the non-expect variant for debvm (non-efi), I saw the following
    error:

    qemu-system-x86_64: system/physmem.c:2680: iotlb_to_section: Assertion `section_index < d->map.sections_nb' failed.

    A matching core file is available at https://people.debian.org/~elbrus/tmp/qemu-debvm_core.zst

    Helmut

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)