• Bug#1107521: ath12k_pci errors and loss of connectivity in 6.12.y branc

    From Robin Murphy@1:229/2 to Baochen Qiang on Fri Jun 27 12:40:01 2025
    XPost: linux.debian.bugs.dist
    From: robin.murphy@arm.com

    +Vasant

    On 2025-06-27 6:39 am, Baochen Qiang wrote:
    [+ IOMMU list]

    On 6/27/2025 12:21 AM, Matt Mower wrote:
    Dear maintainer,

    I have been experiencing lost network connection with the ath12k_pci driver >> in the linux-6.12.y kernel branch. Often, when the issue occurs, the
    network does not recover until I reboot the computer. A full report of the >> errors I encounter, the symptoms that arise, and several dmesg attachments >> are in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1107521 . I have
    attached a dmesg from 6.12.34 for convenience. The short summary is:

    1. I started noticing log lines like the following soon after boot when I
    updated from 6.12.22 to 6.12.27. After these events occur, the network goes >> down and often does not come back up.
    ath12k_pci 0000:c2:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
    domain=0x0010 address=0xfea00000 flags=0x0020]
    2. I was able to reproduce this issue very rarely in 6.12.12 and 6.12.22.
    The issue always occurs soon after boot in 6.12.27, 6.12.30, 6.12.33, and
    6.12.34.
    3. I have not reproduced the issue in 6.15.2 or 6.15.3.
    4. In some cases, when shutting down the computer, a kernel bug caused my
    computer to hang. I haven't determined whether this is related to the issue >> above or an independent issue. Search the bug report
    for PXL_20250611_140820085.jpg to see a picture of the kernel bug on my
    laptop screen.
    5. I have tested two firmware versions:
    a. fw_version 0x1108811c fw_build_timestamp 2025-05-17 00:21 fw_build_id >> QC_IMAGE_VERSION_STRING=WLAN.HMT.1.1.c5-00284.1-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
    b. fw_version 0x100301e1 fw_build_timestamp 2023-12-06 04:05 fw_build_id >> QC_IMAGE_VERSION_STRING=WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3

    Thanks,
    Matt


    I had a quick test with 6.12.27 kernel on both my Intel desktop and AMD RD but didn't hit
    the issue. And I am using WLAN.HMT.1.1.c5-00284.1-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3.

    As mentioned in the Debian bug report, since reverting ath12k patches does not fix this
    issue, maybe it comes from the IOMMU subsystem?

    Faults are usually still indicative of the client driver/subsystem doing something not quite right - racily performing dma_unmap before the
    device has actually finished making accesses; mapping the wrong size
    such that the device accesses off the end of the mapping (this can often
    run into another valid mapping so not necessarily fault); mapping the
    wrong DMA direction such that the device then tries to write to a
    read-only page. However I suppose it's not impossible that some fix to amd-iommu in that period might have changed its behaviour in a way that exacerbates things - Vasant, does this strike a chord with anything
    you're aware of?

    A couple more things I'd try on the ath12k side: firstly, boot with "iommu.strict=1" and see if that makes the faults any more frequent/reproducible; if a fault is fairly easily reproducible, then
    use the DMA API and/or IOMMU API tracepoints to compare the fault
    address to prior DMA mapping activity - that can usually reveal the
    nature of the bug enough to then know what to go looking for.

    I wouldn't put much significance in whatever happens *after* the fault - presumably the driver is assuming the blocked DMA write has completed,
    so then goes on to read some incomplete descriptor as if it were valid,
    and thus may fall over in all manner of entertaining ways on bogus data.

    Thanks,
    Robin.

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)
  • From Matt Mower@1:229/2 to All on Wed Jul 2 07:30:01 2025
    XPost: linux.debian.bugs.dist
    From: mowerm@gmail.com

    A couple more things I'd try on the ath12k side: firstly, boot with "iommu.strict=1" and see if that makes the faults any more frequent/reproducible;

    The issue is easy enough to reproduce in 6.12.27 onward and I may be
    mistaken about the rarity in 6.12.22; I reproduced it relatively
    quickly in .22 today, so if this was the primary purpose for setting iommu.strict=1, then testing with or without strict works. FWIW, I did
    test iommu.strict=1 with 6.15.3 and still have not reproduced this
    issue there.

    if a fault is fairly easily reproducible, then
    use the DMA API and/or IOMMU API tracepoints to compare the fault
    address to prior DMA mapping activity - that can usually reveal the
    nature of the bug enough to then know what to go looking for.

    This is unfamiliar territory for me, so I hope the following is at
    least close to what you requested. If not, happy to provide more test
    results based on a set of instructions. Here's what I did:

    1. Set CONFIG_DMA_API_DEBUG=y
    2. Set kernel command line to: iommu.strict=1 log_buf_len=100M dma_debug_driver=ath12k_pci trace_event=dma:*,iommu:*
    3. Booted and waited for page fault, then cat'd
    /sys/kernel/tracing/trace to a file.

    Additionally, though I'm pretty sure this is irrelevant now, I added
    logging after each dma_map_single() in the ath12k driver to print the
    function name and resultant address to the kernel log.

    Comparing the addresses of several io_page_fault lines in the trace
    and in the kernel log, they line up. So, I'm hopeful this is on the
    right track.

    DMA/IOMMU trace: https://cmphys.com/ath12k/iommu_dma_trace-20250701.log
    Kernel log with additional logging: https://cmphys.com/ath12k/dmesg-6.12.35-20250701.log
    Diff showing extra logging added to v6.12.35: https://cmphys.com/ath12k/ath12k-extra-logging-6.12.35-20250701.diff

    Thanks,
    Matt

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)
  • From Matt Mower@1:229/2 to All on Wed Jul 2 17:00:02 2025
    XPost: linux.debian.bugs.dist
    From: mowerm@gmail.com

    Matt, could you help enable verbose ath12k log to verify my guess?

    Here are kernel logs with ath12k debugging enabled:
    1. WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
    https://cmphys.com/ath12k/dmesg-6.12.35-ath12kdebug-fw0x100301e1-20250702.log
    2. WLAN.HMT.1.1.c5-00284.1-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
    https://cmphys.com/ath12k/dmesg-6.12.35-ath12kdebug-fw0x1108811c-20250702.log

    I captured these after setting CONFIG_ATH12K_DEBUG=y and running "echo 0xffffffff > /sys/module/ath12k/parameters/debug_mask" during boot
    (using @reboot in crontab).

    --- SoupGate-Win32 v1.05
    * Origin: you cannot sedate... all the things you hate (1:229/2)