Hello,
We are facing this problem too with 3 backend VMs and a NFS VM.
Both servers are Google Cloud Compute Engine instances with plenty of CPU
and RAM. We don't see any correlation with high CPU or RAM usage whenever
this happens on any machine. When this happens, apache2 processes on a
backend VM (NFS client machine) wait in state D for a long time (I was only able to trace a file, and lasted 90s until file unlocked). In a normal situation, this doesn't happen and requests flow between backend VMs
totally fine.
But when the problem happens, rebooting the client machine doesn't work, as soon as the load balancer gives it traffic, apache gets stuck again. We got
the machine out of the load balancer to try to isolate it and see logs, but again as soon as we added it to the load balancer it got stuck. Only
rebooting all VMs and NFS server VM works.
Because we only face this problem on production machines, we need to
quickly restart it and have no time to gather further information to try to troubleshoot this issue, so we don't have much idea what is going on.
We have the same software versions across all machines and they get updated every morning at 6 am. We have been facing this problem for some time.
About the time of posting the original question, we migrated our machines
from a very old Debian version to Bookworm by new installation, after that
we started to see this problem occur.
Kernel: Linux web04 6.1.0-26-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux
nfs-kernel-server: 1:2.6.2-4
apache2: 2.4.62-1~deb12u1
php 8.3 from deb.sury.org: 8.3.12-1+0~20240927.43+debian12~1.gbpad3b8c
So I would like to know if you could troubleshoot it. Or if you can give me steps to test the problem or give some information to file a bug report, I would happily do it.
Thank you.
<div dir="ltr">Hello,<div><br></div><div>We are facing this problem too with 3 backend VMs and a NFS VM.</div><div><br></div><div>Both servers are Google Cloud Compute Engine instances with plenty of CPU and RAM. We don't see any correlation with
high CPU or RAM usage whenever this happens on any machine. When this happens, apache2 processes on a backend VM (NFS client machine) wait in state D for a long time (I was only able to trace a file, and lasted 90s until file unlocked). In a normal
situation, this doesn't happen and requests flow between backend VMs totally fine.<br></div><div><br></div><div>But when the problem happens, rebooting the client machine doesn't work, as soon as the load balancer gives it traffic, apache gets
stuck again. We got the machine out of the load balancer to try to isolate it and see logs, but again as soon as we added it to the load balancer it got stuck. Only rebooting all VMs and NFS server VM works. </div><div><br></div><div>Because we only
face this problem on production machines, we need to quickly restart it and have no time to gather further information to try to troubleshoot this issue, so we don't have much idea what is going on.</div><div><br></div><div>We have the same software
versions across all machines and they get updated every morning at 6 am. We have been facing this problem for some time.</div><div>About the time of posting the original question, we migrated our machines from a very old Debian version to Bookworm by
new installation, after that we started to see this problem occur.</div><div><br></div><div>Kernel: Linux web04 6.1.0-26-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux</div><div>nfs-kernel-server: 1:2.6.2-4</div><div>
apache2: 2.4.62-1~deb12u1</div><div>php 8.3 from <a href="
http://deb.sury.org">deb.sury.org</a>: 8.3.12-1+0~20240927.43+debian12~1.gbpad3b8c</div><div><br></div><div>So I would like to know if you could troubleshoot it. Or if you can give me steps to
test the problem or give some information to file a bug report, I would happily do it.</div><div><br></div><div>Thank you.</div></div>
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)