• [gentoo-user] Too many simultaneous install jobs

    From Peter Humphrey@21:1/5 to All on Thu Jun 20 14:50:01 2024
    Hello list,

    While building a new KDE system (see my post a few minutes ago), I'm finding the system stalling because it can't handle all its install jobs. I have this set:

    $ grep '\-j' /etc/portage/make.conf
    EMERGE_DEFAULT_OPTS="--jobs --load-average=30 [...]"
    MAKEOPTS="-j16 -l16"

    The CPU has 24 threads and 64GB RAM, and lots of swap space, and those values have worked well for some time. Now, though, I'm going to have to limit the --jobs or the --load-average.

    On interrupting one such hang, I found that 32 install jobs had been waiting
    to run; is this limit hard coded? I also saw "too many jobs" or something, and "could not read job counter".

    Is it now bug-report time?

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jack@21:1/5 to Peter Humphrey on Thu Jun 20 15:30:01 2024
    On 6/20/24 8:46 AM, Peter Humphrey wrote:
    Hello list,

    While building a new KDE system (see my post a few minutes ago), I'm finding the system stalling because it can't handle all its install jobs. I have this set:

    $ grep '\-j' /etc/portage/make.conf
    EMERGE_DEFAULT_OPTS="--jobs --load-average=30 [...]"
    I don't know how  much it would matter, but are you missing a number
    after --jobs?
    MAKEOPTS="-j16 -l16"

    The CPU has 24 threads and 64GB RAM, and lots of swap space, and those values have worked well for some time. Now, though, I'm going to have to limit the --jobs or the --load-average.

    On interrupting one such hang, I found that 32 install jobs had been waiting to run; is this limit hard coded? I also saw "too many jobs" or something, and
    "could not read job counter".

    Is it now bug-report time?


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Thu Jun 20 15:40:02 2024
    On Thursday, 20 June 2024 14:27:18 BST Jack wrote:
    On 6/20/24 8:46 AM, Peter Humphrey wrote:
    Hello list,

    While building a new KDE system (see my post a few minutes ago), I'm finding the system stalling because it can't handle all its install jobs.
    I have this set:

    $ grep '\-j' /etc/portage/make.conf
    EMERGE_DEFAULT_OPTS="--jobs --load-average=30 [...]"

    I don't know how much it would matter, but are you missing a number
    after --jobs?

    MAKEOPTS="-j16 -l16"

    The CPU has 24 threads and 64GB RAM, and lots of swap space, and those values have worked well for some time. Now, though, I'm going to have to limit the --jobs or the --load-average.

    On interrupting one such hang, I found that 32 install jobs had been waiting to run; is this limit hard coded? I also saw "too many jobs" or something, and "could not read job counter".

    Is it now bug-report time?

    No, that's intended; it's what I meant about limiting it In fact I have now
    set --jobs=24; let's see how that goes.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael@21:1/5 to All on Thu Jun 20 14:40:12 2024
    On Thursday, 20 June 2024 14:27:18 BST Jack wrote:
    On 6/20/24 8:46 AM, Peter Humphrey wrote:
    Hello list,

    While building a new KDE system (see my post a few minutes ago), I'm finding the system stalling because it can't handle all its install jobs.
    I have this set:

    $ grep '\-j' /etc/portage/make.conf
    EMERGE_DEFAULT_OPTS="--jobs --load-average=30 [...]"

    I don't know how much it would matter, but are you missing a number
    after --jobs?

    Without a number of jobs specified in make.conf emerge will not limit the number of packages it tries to build, except it will not start new jobs while there are at least --load-average=30 running already.

    MAKEOPTS="-j16 -l16"

    The CPU has 24 threads and 64GB RAM, and lots of swap space, and those values have worked well for some time. Now, though, I'm going to have to limit the --jobs or the --load-average.

    On interrupting one such hang, I found that 32 install jobs had been waiting to run; is this limit hard coded?

    I take it the --load-average is what it says, an average, so it will jump
    above the specified number if you have not limited the --jobs number.


    I also saw "too many jobs" or
    something, and "could not read job counter".

    Is it now bug-report time?

    You could set up a swapfile, to avoid OOM situations, while you're tweaking
    the --jobs & --load-average.
    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEEXqhvaVh2ERicA8Ceseqq9sKVZxkFAmZ0MTwACgkQseqq9sKV ZxmCCw//QawIyMfESM76NSwbkx7yElkWYkP+TXJR0K89NUJIs41JRopQHtc9BdTs hsP4ZWBH9eFSZbeiWjIYKsDSg5b73IkG4ytt/iRCv2ayH5TnaPD2CVJiLeZi3iDc OxWPhwx//tnoAJmIEO6E8uYba6vWko6iQNqDXezlvZweo3lwJI4+4bnVZETnLAlQ Ahlx8rrvDlLsmLxavhUgpkNX+UkKzjfZd0yhhc9wQQthFuRcGXxhM+ihtL6OFFJj zpBHW7QJA1Qg87hPLyloK5hOq/61Wf+EyRP0TWDBZIzoB+kmX0eb7OMcmP8yn/IA xWLGI3B+SIkvweY17BSJ8W63s8RV7l/m0fkXuzE1sqtYHIpeQopaVwRMSBzMkyRe FzhNffzD6nCL1gATA9nZyym/h5bYBr6qNJ6d9oMSJazudJEdax54hdSA22hsDXiU IZu4TqCY0HZfMrafkgaddUV5LlLCYh6LwwZApC9iE7lrbZFnq+rH9BTbdPnTVG6f xMDYdKKmg5597mH8M8Creu+ks6in1jkdmS435lVcbT1tyFlIjK2auS2zNR1o9PSd tc1ac1bC5k066NgGL2p/D6RIKyjjPfkpjBivl+scdC2gJ63Kc8Mgw/4O6THus9GR /7h+7fwcEr/f1YxMcpxBwH9r+V/2/G6nsFp3SE0p2213rXV2AM8=
    =yPFH
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jack@21:1/5 to Peter Humphrey on Thu Jun 20 18:10:01 2024
    On 6/20/24 11:29 AM, Peter Humphrey wrote:
    On Thursday, 20 June 2024 14:40:12 BST Michael wrote:
    On Thursday, 20 June 2024 14:27:18 BST Jack wrote:
    On 6/20/24 8:46 AM, Peter Humphrey wrote:
    While building a new KDE system (see my post a few minutes ago), I'm
    finding the system stalling because it can't handle all its install
    jobs. I have this set:

    $ grep '\-j' /etc/portage/make.conf
    EMERGE_DEFAULT_OPTS="--jobs --load-average=30 [...]"
    I don't know how much it would matter, but are you missing a number
    after --jobs?
    Without a number of jobs specified in make.conf emerge will not limit the
    number of packages it tries to build, except it will not start new jobs
    while there are at least --load-average=30 running already.

    MAKEOPTS="-j16 -l16"
    We went through all this at great length not long ago (months, perhaps: a certain A. McK had returned to the list for a while). /usr/bin/make will stop spawning make jobs once either (a) the number it's running reaches -j16 or (b)
    the load average of those reaches -l16. Portage sending more tasks to /usr/bin/make simply fills the latter's input queue.
    Again, I don't know if it matters in this case, but my understanding is
    that MAKEOPTS only affects jobs using make.  I don't know if there are equivalent controls for ninja or other build systems. Might that be
    relevant here?  If you run top, limit to running jobs and show the full command, does that give any hints?
    The CPU has 24 threads and 64GB RAM, and lots of swap space, and those >>>> values have worked well for some time. Now, though, I'm going to have to >>>> limit the --jobs or the --load-average.

    On interrupting one such hang, I found that 32 install jobs had been
    waiting to run; is this limit hard coded?
    It's certainly a suspicious number.

    I take it the --load-average is what it says, an average, so it will jump
    above the specified number if you have not limited the --jobs number.
    See above re. input queue.

    I also saw "too many jobs" or something, and "could not read job
    counter".

    Is it now bug-report time?
    You could set up a swap file, to avoid OOM situations, while you're tweaking >> the --jobs & --load-average.
    The existing 64GiB swap partition is rarely touched, if ever. I've never seen an OOM error. I haven't touched jobs or loads for many months until today, nor
    have I seen a failure to read a job counter.

    Anyway, it still rankles that I can't use more than half the machine's power because of limits in portage. This can't be the only 64GiB machine in gentoo- land, surely.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Thu Jun 20 17:30:02 2024
    On Thursday, 20 June 2024 14:40:12 BST Michael wrote:
    On Thursday, 20 June 2024 14:27:18 BST Jack wrote:
    On 6/20/24 8:46 AM, Peter Humphrey wrote:
    While building a new KDE system (see my post a few minutes ago), I'm
    finding the system stalling because it can't handle all its install
    jobs. I have this set:

    $ grep '\-j' /etc/portage/make.conf
    EMERGE_DEFAULT_OPTS="--jobs --load-average=30 [...]"

    I don't know how much it would matter, but are you missing a number
    after --jobs?

    Without a number of jobs specified in make.conf emerge will not limit the number of packages it tries to build, except it will not start new jobs
    while there are at least --load-average=30 running already.

    MAKEOPTS="-j16 -l16"

    We went through all this at great length not long ago (months, perhaps: a certain A. McK had returned to the list for a while). /usr/bin/make will stop spawning make jobs once either (a) the number it's running reaches -j16 or (b) the load average of those reaches -l16. Portage sending more tasks to /usr/bin/make simply fills the latter's input queue.

    The CPU has 24 threads and 64GB RAM, and lots of swap space, and those values have worked well for some time. Now, though, I'm going to have to limit the --jobs or the --load-average.

    On interrupting one such hang, I found that 32 install jobs had been waiting to run; is this limit hard coded?

    It's certainly a suspicious number.

    I take it the --load-average is what it says, an average, so it will jump above the specified number if you have not limited the --jobs number.

    See above re. input queue.

    I also saw "too many jobs" or something, and "could not read job counter".

    Is it now bug-report time?

    You could set up a swap file, to avoid OOM situations, while you're tweaking the --jobs & --load-average.

    The existing 64GiB swap partition is rarely touched, if ever. I've never seen an OOM error. I haven't touched jobs or loads for many months until today, nor have I seen a failure to read a job counter.

    Anyway, it still rankles that I can't use more than half the machine's power because of limits in portage. This can't be the only 64GiB machine in gentoo- land, surely.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Wols Lists@21:1/5 to Peter Humphrey on Thu Jun 20 20:40:02 2024
    On 20/06/2024 16:29, Peter Humphrey wrote:
    Anyway, it still rankles that I can't use more than half the machine's power because of limits in portage. This can't be the only 64GiB machine in gentoo- land, surely.

    Well, I think my machine has 4x32GiB slots, and two are full, so that
    makes 64GiB here. They weren't even expensive - about £50 / DIMM iirc.

    Cheers,
    Wol

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael@21:1/5 to All on Thu Jun 20 20:04:53 2024
    On Thursday, 20 June 2024 16:29:11 BST Peter Humphrey wrote:
    On Thursday, 20 June 2024 14:40:12 BST Michael wrote:
    On Thursday, 20 June 2024 14:27:18 BST Jack wrote:
    On 6/20/24 8:46 AM, Peter Humphrey wrote:
    While building a new KDE system (see my post a few minutes ago), I'm

    finding the system stalling because it can't handle all its install jobs. I have this set:

    $ grep '\-j' /etc/portage/make.conf
    EMERGE_DEFAULT_OPTS="--jobs --load-average=30 [...]"

    I don't know how much it would matter, but are you missing a number after --jobs?

    Without a number of jobs specified in make.conf emerge will not limit the number of packages it tries to build, except it will not start new jobs while there are at least --load-average=30 running already.

    MAKEOPTS="-j16 -l16"

    We went through all this at great length not long ago (months, perhaps: a certain A. McK had returned to the list for a while). /usr/bin/make will
    stop spawning make jobs once either (a) the number it's running reaches
    -j16 or (b) the load average of those reaches -l16. Portage sending more tasks to /usr/bin/make simply fills the latter's input queue.

    Quite. Make will queue up anything above ~16 jobs, but emerge runs more than just make jobs. More and more emerge processes will kick off, up to ~30.
    Each emerge process will eventually launch make jobs, only for these to join a pile up in an ever congested make queue, unable to proceed further. At some point memory allocation and reallocation of queues appears to have become gnarly. Perhaps something in portage's python code leads to a race condition? I don't know if a combination of the queuing up of all these parent-child instructions and their parallelism can create an unchecked race condition, perhaps you reached some memory allocation limit, or indeed a bug in the code. Just loose suppositions of mine, not evidence by detailed debugging, let alone knowledge of python.


    The CPU has 24 threads and 64GB RAM, and lots of swap space, and those values have worked well for some time. Now, though, I'm going to have to
    limit the --jobs or the --load-average.

    On interrupting one such hang, I found that 32 install jobs had been waiting to run; is this limit hard coded?

    It's certainly a suspicious number.

    Apologies if I'm being dense here - why is it a suspicious number? I see a -- load-average of ~30 emerge instigated 'make install' jobs being queued up, while some previous 16 x make jobs are currently being processed.


    I take it the --load-average is what it says, an average, so it will jump above the specified number if you have not limited the --jobs number.

    See above re. input queue.

    I also saw "too many jobs" or something, and "could not read job counter".

    Is it now bug-report time?

    You could set up a swap file, to avoid OOM situations, while you're tweaking the --jobs & --load-average.

    The existing 64GiB swap partition is rarely touched, if ever. I've never
    seen an OOM error. I haven't touched jobs or loads for many months until today, nor have I seen a failure to read a job counter.

    I don't know if counters are stored in memory, with running/completed/failed counts, or on disk. I can't think either DDR4, or an NVMe, would clog up
    their I/O channels, but you clearly witnessed a failure. Could this be a hardware glitch? You'll soon know if it shows up as a repeatable problem.


    Anyway, it still rankles that I can't use more than half the machine's power because of limits in portage. This can't be the only 64GiB machine in
    gentoo- land, surely.

    I use 64G with no swap and MAKEOPTS="-j25 -l24.8"

    I haven't as yet come across a failure like yours, but I rarely try to run
    more than one emerge process at a time on this system. It's fast enough for
    my limited needs without having to increase the number of emerges at a time.

    On another PC which I often use as a binhost with 32G RAM, when I start two separate emerge processes manually with MAKEOPTS="-j10 -l9.8" I see swap being used a bit some times, especially when the PC user is hammering the browser
    for hours with many tabs open. Anyway, the MAKEOPTS directives control resource usage without hiccups.

    Does it make much of a difference in time saved running parallel emerges to require the addition of EMERGE_DEFAULT_OPTS?


    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEEXqhvaVh2ERicA8Ceseqq9sKVZxkFAmZ0fVUACgkQseqq9sKV Zxk9phAAh9p+dj73TD5kNCDBbTWr82ZO6sFqA6tkvCgdnJW0Il+Xxlbbkh8VimL1 DCkBTJLtRh3wJONx4+TOkzG/FvM6QjDms9Tk55rpWdn2PkhJJAg+pfnqQAniq/Y+ O6r+bBYQGtq2dgLvT5/QaHfpdcm2ifYQeZ35hb3lZrQTO0f9vubuiasHJp7K4W5b 5g8aRiKp28J1MvCcH/55Vp+NJJ8wtNrQrQw9Vc3jkLYCDDWQzA8EDpzgM///x2h9 Z8pxDRE4DyoTO1nn40MfjFa+oMXlubYCY9y55CFZx+CNjiMp5bUl6sX858mZxvSb MIR3CM8dSkVVGeWzMAyZ+ksDr7HPrQizaK0LF0hhpsJq46tvaNvx5/sBUcPdM0b1 1nAm+P3rjblByBCl74yxhhYV985A9yyaopikfNc2ivzY92ZfRdg066WhOrXg+BAH cb6w/lZfq0Z5/09q+ejWOs2el4NSEncHdLI4IxoYOUAR9YdIzjORgkzE/OAo26zP vP8zJ6sJk1Cx2G/lIISlnmt4mQ24RkUHt7P+01gs2fQHG69NFVI8uRj/NQMoYGkF ymo3GhkOAfS6jdJ8GuWfVraMILZIxc5j/uJHtN2fHaVH+oQSb17e84foZDFRIEvR PrwR2fpChKzMzqFCEuduyJSOt6zv677jymWRlu9Qvra0d3ZEZv0=
    =xPXc
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Fri Jun 21 13:40:01 2024
    On Thursday, 20 June 2024 20:06:17 BST Eli Schwartz wrote:
    On 6/20/24 8:46 AM, Peter Humphrey wrote:
    On interrupting one such hang, I found that 32 install jobs had been waiting to run; is this limit hard coded? I also saw "too many jobs" or something, and "could not read job counter".

    Is it now bug-report time?

    It's not clear to me what "stalling" means here. Did portage stop doing
    any work as verified by ps or htop? Did it just spend a long time
    showing no progress?

    All activity had stopped: top showed no portage processes at all. I didn't check ps.

    I do know what the 32 install jobs "waiting to run" is though. Or at
    least I'm pretty sure I know what it means.

    Recent portage has this change: https://gitweb.gentoo.org/proj/portage.git/commit/?id=825db01b91a37dcd9890ee 5bf9f462ea524ac5cc

    "Add merge-wait FEATURES setting enabled by default"

    From the changelog:

    portage-3.0.62 (2024-02-22)
    --------------

    * FEATURES: Add FEATURES="merge-wait", enabled by default, to control
    whether we do parallel merges of images to the live filesystem (bug
    #663324).

    If enabled, we serialize these merges.

    For now, this makes FEATURES="parallel-install" a no-op, but in
    future, it will be improved to allow parallel merges, just not while
    any packages are compiling.

    That isn't what i saw, though. Parallel make jobs ran happily, but none of
    them were installed. Portage then stopped and waited until they had been. Nothing was happening at all (other than background OS tasks: kworker etc.),
    as confirmed by top. I waited a few minutes, then CTRL-C'd it. Instantly, the whole batch of 32 install tasks was released, appearing on the terminal as
    fast as it could display them; they ran in parallel until they'd all finished. I could then restart the emerging of @plasma, which is my set of plasma packages to be built on top of xorg; it causes >600 package emerges.

    prh@wstn ~ $ ls /etc/portage/sets
    apps base core plasma utils xorg
    prh@wstn ~ $ wc -l /etc/portage/sets/*
    20 /etc/portage/sets/apps
    32 /etc/portage/sets/base
    10 /etc/portage/sets/core
    34 /etc/portage/sets/plasma
    9 /etc/portage/sets/utils
    13 /etc/portage/sets/xorg
    118 total

    The same thing happened twice more before the whole @plasma set was finished.

    8

    In https://bugs.gentoo.org/934382 portage is adding additional options:

    --jobs-merge-wait-threshold=X will cause portage to stop starting new
    jobs when X number of packages are in pending-merge state, and portage
    will copy the installed package files from the image to the root
    filesystem. Otherwise, portage will get there eventually but it might
    take a bit longer

    I wonder if that's the culprit. I could test it by starting another system build, if I can disable that threshold and if you need me to - but remember
    the "too many jobs" and "could not read job counter".

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)