Forum: >>> Magnum BBS <<<

Re: Uninterruptible sleep apache process while aceessing nfs on debian

From Carlos Llamas@21:1/5 to All on Mon Oct 7 12:40:01 2024

Hello,

We are facing this problem too with 3 backend VMs and a NFS VM.

Both servers are Google Cloud Compute Engine instances with plenty of CPU
and RAM. We don't see any correlation with high CPU or RAM usage whenever
this happens on any machine. When this happens, apache2 processes on a
backend VM (NFS client machine) wait in state D for a long time (I was only able to trace a file, and lasted 90s until file unlocked). In a normal situation, this doesn't happen and requests flow between backend VMs
totally fine.

But when the problem happens, rebooting the client machine doesn't work, as soon as the load balancer gives it traffic, apache gets stuck again. We got
the machine out of the load balancer to try to isolate it and see logs, but again as soon as we added it to the load balancer it got stuck. Only
rebooting all VMs and NFS server VM works.

Because we only face this problem on production machines, we need to
quickly restart it and have no time to gather further information to try to troubleshoot this issue, so we don't have much idea what is going on.

We have the same software versions across all machines and they get updated every morning at 6 am. We have been facing this problem for some time.
About the time of posting the original question, we migrated our machines
from a very old Debian version to Bookworm by new installation, after that
we started to see this problem occur.

Kernel: Linux web04 6.1.0-26-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux
nfs-kernel-server: 1:2.6.2-4
apache2: 2.4.62-1~deb12u1
php 8.3 from deb.sury.org: 8.3.12-1+0~20240927.43+debian12~1.gbpad3b8c

So I would like to know if you could troubleshoot it. Or if you can give me steps to test the problem or give some information to file a bug report, I would happily do it.

Thank you.

<div dir="ltr">Hello,<div><br></div><div>We are facing this problem too with 3 backend VMs and a NFS VM.</div><div><br></div><div>Both servers are Google Cloud Compute Engine instances with plenty of CPU and RAM. We don't see any correlation with
high CPU or RAM usage whenever this happens on any machine. When this happens, apache2 processes on a backend VM (NFS client machine) wait in state D for a long time (I was only able to trace a file, and lasted 90s until file unlocked). In a normal
situation, this doesn't happen and requests flow between backend VMs totally fine.<br></div><div><br></div><div>But when the problem happens, rebooting the client machine doesn't work, as soon as the load balancer gives it traffic, apache gets
stuck again. We got the machine out of the load balancer to try to isolate it and see logs, but again as soon as we added it to the load balancer it got stuck. Only rebooting all VMs and NFS server VM works. </div><div><br></div><div>Because we only
face this problem on production machines, we need to quickly restart it and have no time to gather further information to try to troubleshoot this issue, so we don't have much idea what is going on.</div><div><br></div><div>We have the same software
versions across all machines and they get updated every morning at 6 am. We have been facing this problem for some time.</div><div>About the time of posting the original question, we migrated our machines from a very old Debian version to Bookworm by
new installation, after that we started to see this problem occur.</div><div><br></div><div>Kernel: Linux web04 6.1.0-26-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux</div><div>nfs-kernel-server: 1:2.6.2-4</div><div>
apache2: 2.4.62-1~deb12u1</div><div>php 8.3 from <a href="http://deb.sury.org">deb.sury.org</a>: 8.3.12-1+0~20240927.43+debian12~1.gbpad3b8c</div><div><br></div><div>So I would like to know if you could troubleshoot it. Or if you can give me steps to
test the problem or give some information to file a bug report, I would happily do it.</div><div><br></div><div>Thank you.</div></div>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Kupfer@21:1/5 to Carlos Llamas on Tue Oct 8 00:10:02 2024

Carlos Llamas wrote:

When this happens, apache2 processes on a backend VM (NFS client
machine) wait in state D for a long time (I was only able to trace a
file, and lasted 90s until file unlocked).

It sounds like the process is trying to unlock a file, and the system
call hangs for 90 seconds. Do I understand you correctly?

What version of the NFS protocol are you using for the mounts?

We have the same software versions across all machines and they get
updated every morning at 6 am.

Which machines get updated? Just the NFS clients, or the NFS server,
too? Does the update involve a reboot?

mike

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Kupfer@21:1/5 to Carlos Llamas on Tue Oct 8 06:00:01 2024

Carlos Llamas wrote:

I made a PHP script which locks a specific file in a NFS directory, measuring every step involved. Then exported the results to CSV and graphed it on ELK because I am more familiar with it.

Thanks for the additional details. It sounds like you have a
reproducible test case, which will help a lot for tracking down the
problem.

The bar chart that you included just shows the total time. Do your
results show that some operations (e.g., lock/unlock) are particularly
slow on the web05 (green) system? Or is it more that all the operations
are slower when compared to the web04 (blue) system?

I see that Debian includes sosreport, which I *think* reports log
messages (among other things). I'd try to get log messages from both
web05 and the NFS server, and look for warning messages that might be
relevant.

mike

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (0 / 16)
Uptime:	163:30:49
Calls:	10,385
Calls today:	2
Files:	14,057
Messages:	6,416,510

Re: Uninterruptible sleep apache process while aceessing nfs on debian

Who's Online

Recent Visitors

System Info