1

I do not understand, what is the reason for very high "load average" reported by top.

This is RHEL 7. The problem is repeatable. We have one remote NAS and when one user-space process start to write extra large files there (e.g. size of 15 GB) then very often (but not always) we get this:

top - 19:04:38 up 43 days, 11:39,  3 users,  load average: 54,92, 53,82, 47,17
Tasks: 302 total,   1 running, 301 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0,3 us,  0,3 sy,  0,0 ni, 99,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu1  :  0,3 us,  0,3 sy,  0,0 ni, 99,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem :  1881412 total,   145084 free,   762644 used,   973684 buff/cache
KiB Swap:  4194300 total,  4127484 free,    66816 used,   840376 avail Mem

Load average is over 50. Both processors are idle, no statistics are reported. This is not just short-time situation. We can have this for 10 minutes and more. We added "ionice -n 7" in the front of the copy command and problem seems to appear a little bit less often, but still it persist.

Question is in fact like this: what else can be observed that will help us to explain and solve the problem. What could be the problem?

EDIT: In fact, this high load is on server A, but problematic copy to NAS is being executed on server B. Both servers have mounted the same NAS.

EDIT 2: On server B (where copy is performed), NAS is mounted as:

//10.105.10.123/abc on /abc type cifs (rw,relatime,vers=3.0,cache=strict,username=nas,domain=myorg.com,uid=1346600026,forceuid,gid=1346600027,forcegid,addr=10.105.10.123,file_mode=0775,dir_mode=0775,soft,persistenthandles,nounix,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=1)

On server A (where high load is observed during attempt to use NAS):

//10.105.10.123/abc/ABC on /ABC type cifs (rw,relatime,vers=3.0,cache=strict,username=nas,domain=myorg.com,uid=1346600026,forceuid,gid=1346600027,forcegid,addr=10.105.10.123,file_mode=0775,dir_mode=0775,soft,persistenthandles,nounix,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=1)
meolic
  • 133
  • 6
  • try `htop`? `top` tends to exclude some system processes. How is the nas mounted? For example, I found a bug with nfs4 that does exactly this in certain situations: https://access.redhat.com/solutions/2142081. But generally if the NAS is getting maxed out and delays responding to requests, then your other clients are going to have a bad time. – Cpt.Whale Mar 14 '23 at 19:59
  • 1
    yeah for a cifs mount, definitely try checking for high iowait (and some other troubleshooting here): https://tanelpoder.com/posts/high-system-load-low-cpu-utilization-on-linux/. IF that's the cause, then there's not much you can do on ServerA about it other than limiting how often it checks the NAS for stuff (loose caching can help). You may need to further limit the copy speed on ServerB, or increase the NAS performance – Cpt.Whale Mar 15 '23 at 15:32

0 Answers0