• continuing GPU woes

    From Eben King@21:1/5 to All on Mon Jan 27 19:40:02 2025
    Hi. I have one of these:
    eben@cerberus:~$ nvidia-detect
    Detected NVIDIA GPUs:
    01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce
    GTX 970] [10de:13c2] (rev a1)

    Checking card: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1)
    Your card is supported by all driver versions.
    Your card is also supported by the Tesla 470 drivers series.
    It is recommended to install the
    nvidia-driver
    package.

    on here
    eben@cerberus:~$ cat /etc/debian_version
    12.9

    Now, the fan doesn't usually come on on this card, so the only way to keep
    it at a reasonable temperature is to have the side cover off and disable a monitor. And sometimes that's not enough and I have to go to the console
    until it cools down. so, definitely suboptimal. If replacing it were the
    only solution and I could afford to do it, I would. But anyhow.

    The installed nvidia-* packages are:
    eben@cerberus:~$ apt list --installed nvidia\*
    Listing... Done
    nvidia-alternative/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic] nvidia-detect/stable,now 535.216.01-1~deb12u1 amd64 [installed] nvidia-driver-bin/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic] nvidia-driver-libs/stable,now 535.216.01-1~deb12u1 amd64 [installed] nvidia-driver/stable,now 535.216.01-1~deb12u1 amd64 [installed] nvidia-egl-common/stable,now 535.216.01-1~deb12u1 amd64 [installed] nvidia-egl-icd/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic] nvidia-installer-cleanup/stable,now 20220217+3~deb12u1 amd64 [installed,automatic]
    nvidia-kernel-common/stable,now 20220217+3~deb12u1 amd64 [installed,automatic] nvidia-kernel-dkms/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic] nvidia-kernel-support/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic]
    nvidia-legacy-check/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic] nvidia-modprobe/stable,now 535.161.07-1~deb12u1 amd64 [installed,automatic] nvidia-persistenced/stable,now 535.171.04-1~deb12u1 amd64 [installed,automatic] nvidia-settings/stable,now 535.171.04-1~deb12u1 amd64 [installed,automatic] nvidia-smi/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic] nvidia-support/stable,now 20220217+3~deb12u1 amd64 [installed,automatic] nvidia-suspend-common/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic]
    nvidia-vdpau-driver/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic] nvidia-vulkan-common/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic]
    nvidia-vulkan-icd/stable,now 535.216.01-1~deb12u1 amd64 [installed,automatic]

    Is there anything on that list that looks suspicious, or anything *not* on
    that list that would explain my problem?

    The times I have seen it act normally (i.e., run the fan at 10-20% to hold
    the temperature in the mid-60s, and allow me to use all monitors) are after I've been away from the computer long enough for the monitors to go to
    sleep. My hypothesis is that something the OS does in that sleep-wake cycle makes the nvidia driver shape up and fly right. Is it just setting DPMS to some higher value (4 or 5) after some time and setting it back to 1 when I return? I've done that using ddccontrol and the driver's behavior doesn't change. Maybe there's a minimum time for the monitors to be asleep or the
    GPU has to cool to below a certain temperature?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anssi Saari@21:1/5 to All on Tue Jan 28 10:00:01 2025
    Eben King <eben@gmx.us> writes:

    I don't know if there's more history to this issue but a couple of
    things come to mind.

    Checking card: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1)
    Your card is supported by all driver versions.
    Your card is also supported by the Tesla 470 drivers series.

    What about this as an alternative? nvidia-tesla-470-driver instead of nvidia-driver?

    And what about controlling the fan(s) manually? See for example https://askubuntu.com/questions/42494/how-can-i-change-the-nvidia-gpu-fan-speed

    Note the possible caveats though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From eben@gmx.us@21:1/5 to Anssi Saari on Tue Jan 28 17:30:01 2025
    On 1/28/25 03:50, Anssi Saari wrote:
    Eben King <eben@gmx.us> writes:

    I don't know if there's more history to this issue but a couple of
    things come to mind.

    Checking card: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1)
    Your card is supported by all driver versions.
    Your card is also supported by the Tesla 470 drivers series.

    What about this as an alternative? nvidia-tesla-470-driver instead of nvidia-driver?

    And what about controlling the fan(s) manually? See for example https://askubuntu.com/questions/42494/how-can-i-change-the-nvidia-gpu-fan-speed

    Good guess, but that was weird. The GPU was working today
    for reasons unknown, but I figured "what could be the harm in setting the
    fan to a higher value?". So I tried.

    eben@cerberus:~$ sudo nvidia-xconfig --cool-bits=4
    [sudo] password for eben:
    sudo: nvidia-xconfig: command not found

    Apparently nvidia-xconfig is not in root's $PATH. Just for kicks I went
    into nvidia-xconfig (as me), checked "Enable GPU Fan Setting", went to
    40-some% and hit Apply. The fan immediately went wonky, returning random values while presumably not actually running because the GPU temp slowly rose.

    Well, the command's for a different OS so YMMV. But one of the comments
    said "This will generate a completely new xorg.conf and also adds
    Option "Coolbits" "4" to Section "Screen"." Since I currently have no xorg.conf I wondered why manual fan speed setting was enabled. I used to
    have an xorg.conf to which I'd added that, but things happened. A bit of poking around showed it got renamed to

    -rw-r--r-- 1 root root 1226 Aug 31 16:47 /etc/X11/xorg.conf.nvidia

    so I did

    lrwxrwxrwx 1 root root 16 Jan 28 10:13 /etc/X11/xorg.conf -> xorg.conf.nvidia

    and restarted X. No dice, fan's still returning crazy values.
    FTR, this is what I mean by "crazy values":

    eben@cerberus:~$ ./monitor_fan
    2025-01-28 11:23:57 71%
    2025-01-28 11:23:58 26%
    2025-01-28 11:23:59 0%
    2025-01-28 11:24:00 58%
    2025-01-28 11:24:01 57%
    2025-01-28 11:24:02 103%
    2025-01-28 11:24:03 0%
    2025-01-28 11:24:04 74%
    2025-01-28 11:24:05 18%
    2025-01-28 11:24:06 0%
    2025-01-28 11:24:07 30%
    2025-01-28 11:24:08 75%
    2025-01-28 11:24:09 0%
    2025-01-28 11:24:11 43%
    2025-01-28 11:24:12 34%
    2025-01-28 11:24:13 49%
    2025-01-28 11:24:14 7%

    eben@cerberus:~$ cat monitor_fan
    #! /bin/sh
    prevSpeed=-1
    nvidia-smi --format=csv,noheader,nounits --query-gpu=fan.speed --loop=1 \
    | while read currentSpeed ; do
    if [ $currentSpeed -ne $prevSpeed ] ; then
    echo $(date "+%F %T") ${currentSpeed}%
    prevSpeed=$currentSpeed
    fi
    done

    Note the possible caveats though.

    Yeah. Worst case, restore /usr from backup.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)