• Bug#1104269: linux-image-6.12.22-amd64: amdgpu flip_done and commit wai

    From Salvatore Bonaccorso@21:1/5 to Abdurahman Elmawi on Mon Apr 28 06:30:01 2025
    Control: tags -1 + moreinfo
    Control: severity -1 important

    Hi,

    On Sun, Apr 27, 2025 at 09:23:09PM -0500, Abdurahman Elmawi wrote:
    Package: src:linux
    Version: 6.12.22-1
    Severity: critical
    Tags: upstream
    Justification: breaks unrelated software

    Dear Maintainer,

    While running Debian Trixie (Testing) with kernel 6.12.22-1, I encountered a serious amdgpu driver issue resulting in a full system lockup.

    What led up to the situation:
    - System is running Debian Trixie, with KDE Plasma desktop environment (installed via tasksel).
    - I had a large number of windows open and was heavily multitasking.
    - Suddenly, the system completely froze.
    - Attempting to switch to a TTY (Ctrl+Alt+F4) took a very long time before any
    text appeared.
    - I eventually killed the KDE session and re-logged in, after which the system
    returned to normal.

    What exactly did you do (or not do) that was effective (or ineffective):
    - Switching to TTY eventually worked (after long delay).
    - Killing the KDE session recovered the system without a reboot.

    What was the outcome of this action:
    - I lost all unsaved data
    - After re-logging in, system worked normally again.
    - The issue has not reoccurred so far.

    What outcome did you expect instead:
    - The system should not lock up during normal multitasking.
    - GPU/display updates should not cause hard stalls or flip timeouts.

    Relevant logs from dmesg:
    ```
    amdgpu 0000:0b:00.0: [drm] *ERROR* [CRTC:85:crtc-0] flip_done timed out amdgpu 0000:0b:00.0: [drm] *ERROR* [CRTC:85:crtc-0] commit wait timed out amdgpu 0000:0b:00.0: [drm] *ERROR* [PLANE:82:plane-7] commit wait timed out WARNING: CPU: 3 PID: 994 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8646 amdgpu_dm_atomic_commit_tail
    ```
    Tracebacks point to amdgpu_dm_atomic_commit_tail

    Hardware details:
    - Motherboard: ASUS TUF GAMING X570-PLUS (Wi-Fi)
    - GPU: AMD Radeon RX 6600
    - Kernel: 6.12.22-amd64
    - Firmware: Up-to-date firmware-amd-graphics package installed.

    Notes:
    - This problem may be related to other recent AMDGPU instability reports on 6.12.x, but this is distinct: no PCIe AER errors were present, and no VRAM leaks were observed prior to the crash.
    - The bug appears in normal operation, not just after suspend/resume.

    We have uploaded 6.12.25-1 to unstable which contains further amdgpu
    related fixes recently (and the kernel is targeting for trixie). Might
    you be able to update to 6.12.25 please and see if you still encounter
    the amdgpu instability?

    The problem here will be likely if you cannot repoduce the issue. The
    behaviour you described and the long switching time to a tty might as
    well indicate other components in unserspace with a memory hog, and
    system swapping, maybe finally invoking the OOM.

    If you still have the system running after this recovery having the
    full kernel log attached would be helpful. At least from the excerpt
    you posted it seems that the system came online after a suspend. But
    having the logs form around when the lockup happended would be (maybe)
    helpful.

    Regards,
    Salvatore

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)