• Re: Uninterruptible sleep apache process while aceessing nfs on debian

    From Carlos Llamas@21:1/5 to All on Mon Oct 7 12:40:01 2024
    Hello,

    We are facing this problem too with 3 backend VMs and a NFS VM.

    Both servers are Google Cloud Compute Engine instances with plenty of CPU
    and RAM. We don't see any correlation with high CPU or RAM usage whenever
    this happens on any machine. When this happens, apache2 processes on a
    backend VM (NFS client machine) wait in state D for a long time (I was only able to trace a file, and lasted 90s until file unlocked). In a normal situation, this doesn't happen and requests flow between backend VMs
    totally fine.

    But when the problem happens, rebooting the client machine doesn't work, as soon as the load balancer gives it traffic, apache gets stuck again. We got
    the machine out of the load balancer to try to isolate it and see logs, but again as soon as we added it to the load balancer it got stuck. Only
    rebooting all VMs and NFS server VM works.

    Because we only face this problem on production machines, we need to
    quickly restart it and have no time to gather further information to try to troubleshoot this issue, so we don't have much idea what is going on.

    We have the same software versions across all machines and they get updated every morning at 6 am. We have been facing this problem for some time.
    About the time of posting the original question, we migrated our machines
    from a very old Debian version to Bookworm by new installation, after that
    we started to see this problem occur.

    Kernel: Linux web04 6.1.0-26-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux
    nfs-kernel-server: 1:2.6.2-4
    apache2: 2.4.62-1~deb12u1
    php 8.3 from deb.sury.org: 8.3.12-1+0~20240927.43+debian12~1.gbpad3b8c

    So I would like to know if you could troubleshoot it. Or if you can give me steps to test the problem or give some information to file a bug report, I would happily do it.

    Thank you.

    <div dir="ltr">Hello,<div><br></div><div>We are facing this problem too with 3 backend VMs and a NFS VM.</div><div><br></div><div>Both servers are Google Cloud Compute Engine instances with plenty of CPU and RAM. We don&#39;t see any correlation with
    high CPU or RAM usage whenever this happens on any machine. When this happens, apache2 processes on a backend VM (NFS client machine) wait in state D for a long time (I was only able to trace a file, and lasted 90s until file unlocked). In a normal
    situation, this doesn&#39;t happen and requests flow between backend VMs totally fine.<br></div><div><br></div><div>But when the problem happens, rebooting the client machine doesn&#39;t work, as soon as the load balancer gives it traffic, apache gets
    stuck again. We got the machine out of the load balancer to try to isolate it and see logs, but again as soon as we added it to the load balancer it got stuck. Only rebooting all VMs and NFS server VM works. </div><div><br></div><div>Because we only
    face this problem on production machines, we need to quickly restart it and have no time to gather further information to try to troubleshoot this issue, so we don&#39;t have much idea what is going on.</div><div><br></div><div>We have the same software
    versions across all machines and they get updated every morning at 6 am. We have been facing this problem for some time.</div><div>About the time of posting the original question, we migrated our machines from a very old Debian version to Bookworm by
    new installation, after that we started to see this problem occur.</div><div><br></div><div>Kernel: Linux web04 6.1.0-26-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux</div><div>nfs-kernel-server: 1:2.6.2-4</div><div>
    apache2: 2.4.62-1~deb12u1</div><div>php 8.3 from <a href="http://deb.sury.org">deb.sury.org</a>: 8.3.12-1+0~20240927.43+debian12~1.gbpad3b8c</div><div><br></div><div>So I would like to know if you could troubleshoot it. Or if you can give me steps to
    test the problem or give some information to file a bug report, I would happily do it.</div><div><br></div><div>Thank you.</div></div>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Kupfer@21:1/5 to Carlos Llamas on Tue Oct 8 00:10:02 2024
    Carlos Llamas wrote:

    When this happens, apache2 processes on a backend VM (NFS client
    machine) wait in state D for a long time (I was only able to trace a
    file, and lasted 90s until file unlocked).

    It sounds like the process is trying to unlock a file, and the system
    call hangs for 90 seconds. Do I understand you correctly?

    What version of the NFS protocol are you using for the mounts?

    We have the same software versions across all machines and they get
    updated every morning at 6 am.

    Which machines get updated? Just the NFS clients, or the NFS server,
    too? Does the update involve a reboot?

    mike

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Kupfer@21:1/5 to Carlos Llamas on Tue Oct 8 06:00:01 2024
    Carlos Llamas wrote:

    I made a PHP script which locks a specific file in a NFS directory, measuring every step involved. Then exported the results to CSV and graphed it on ELK because I am more familiar with it.

    Thanks for the additional details. It sounds like you have a
    reproducible test case, which will help a lot for tracking down the
    problem.

    The bar chart that you included just shows the total time. Do your
    results show that some operations (e.g., lock/unlock) are particularly
    slow on the web05 (green) system? Or is it more that all the operations
    are slower when compared to the web04 (blue) system?

    I see that Debian includes sosreport, which I *think* reports log
    messages (among other things). I'd try to get log messages from both
    web05 and the NFS server, and look for warning messages that might be
    relevant.

    mike

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)