0

we have cluster of RHEL 7.9 machines, we are using this server as kafka client producers.

each machine have the following spec (DELL physical machines)

48 CORES
128G memory

on most machines we saw very low %idle ( from sar command ) and values are around 2-6

some times we also see that machine are HANG for few seconds about the CPU load average the values are between 40-60 but seems to be ok

so the only one point that we are worry about is how to know if idle of 2 - 6 is still normal or its something that we cant allow

can we set threshold value that gives alerts when idle is low ? but the question how to set the threshold value ?

for example can we defined threshold value of 10% or 20%? or some other value?

vmstat 1 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
60  0      0 65249020   3364 1082656    0    0     2     1  115   43 86  3 11  0  0
46  0      0 65240956   3364 1082656    0    0     0     0 167113 10096 92  3  5  0  0
53  0      0 65248888   3364 1082656    0    0     0     0 208360 9795 92  4  4  0  0



sar 5 5

09:46:10 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:46:15 AM     all     91.93      0.00      4.03      0.00      0.00      4.04
09:46:20 AM     all     91.90      0.00      3.48      0.00      0.00      4.62
09:46:25 AM     all     91.76      0.00      3.21      0.00      0.00      5.04
09:46:30 AM     all     91.69      0.00      2.84      0.00      0.00      5.47
09:46:35 AM     all     92.17      0.00      4.50      0.00      0.00      3.34
Average:        all     91.89      0.00      3.61      0.00      0.00      4.50


top -bn2 | grep '%Cpu' | tail -1 | grep -P '(....|...) id,'|awk '{print "CPU Usage: " 100-$8 "%"}'
CPU Usage: 96.2%
    

sar -P ALL 1 1

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all     91.94      0.00      4.75      0.00      0.00      3.31
Average:          0     12.24      0.00     51.02      0.00      0.00     36.73
Average:          1     17.35      0.00     41.84      0.00      0.00     40.82
Average:          2    100.00      0.00      0.00      0.00      0.00      0.00
Average:          3     98.02      0.00      1.98      0.00      0.00      0.00
Average:          4     99.00      0.00      1.00      0.00      0.00      0.00
Average:          5     98.00      0.00      2.00      0.00      0.00      0.00
Average:          6     98.02      0.00      1.98      0.00      0.00      0.00
Average:          7     98.00      0.00      2.00      0.00      0.00      0.00
Average:          8     98.00      0.00      2.00      0.00      0.00      0.00
Average:          9     99.00      0.00      1.00      0.00      0.00      0.00
Average:         10     99.00      0.00      1.00      0.00      0.00      0.00
Average:         11     98.02      0.00      1.98      0.00      0.00      0.00
Average:         12     98.00      0.00      2.00      0.00      0.00      0.00
Average:         13     98.00      0.00      2.00      0.00      0.00      0.00
Average:         14     98.99      0.00      1.01      0.00      0.00      0.00
Average:         15     99.00      0.00      1.00      0.00      0.00      0.00
Average:         16     98.99      0.00      1.01      0.00      0.00      0.00
Average:         17     99.00      0.00      1.00      0.00      0.00      0.00
Average:         18     98.00      0.00      2.00      0.00      0.00      0.00
Average:         19     99.00      0.00      1.00      0.00      0.00      0.00
Average:         20     99.00      0.00      1.00      0.00      0.00      0.00
Average:         21     97.06      0.00      2.94      0.00      0.00      0.00
Average:         22     98.00      0.00      2.00      0.00      0.00      0.00
Average:         23     98.02      0.00      1.98      0.00      0.00      0.00
Average:         24     20.20      0.00     41.41      0.00      0.00     38.38
Average:         25     31.31      0.00     23.23      0.00      0.00     45.45
Average:         26     99.01      0.00      0.99      0.00      0.00      0.00
Average:         27     98.02      0.00      1.98      0.00      0.00      0.00
Average:         28     98.02      0.00      1.98      0.00      0.00      0.00
Average:         29     98.02      0.00      1.98      0.00      0.00      0.00
Average:         30     98.99      0.00      1.01      0.00      0.00      0.00
Average:         31     98.02      0.00      1.98      0.00      0.00      0.00
Average:         32     98.99      0.00      1.01      0.00      0.00      0.00
Average:         33     98.02      0.00      1.98      0.00      0.00      0.00
Average:         34     98.02      0.00      1.98      0.00      0.00      0.00
Average:         35     99.00      0.00      1.00      0.00      0.00      0.00
Average:         36     97.06      0.00      2.94      0.00      0.00      0.00
Average:         37     98.00      0.00      2.00      0.00      0.00      0.00
Average:         38     97.06      0.00      2.94      0.00      0.00      0.00
Average:         39     98.99      0.00      1.01      0.00      0.00      0.00
Average:         40     98.00      0.00      2.00      0.00      0.00      0.00
Average:         41     98.00      0.00      2.00      0.00      0.00      0.00
Average:         42     98.99      0.00      1.01      0.00      0.00      0.00
Average:         43     98.00      0.00      2.00      0.00      0.00      0.00
Average:         44     98.00      0.00      2.00      0.00      0.00      0.00
Average:         45     98.99      0.00      1.01      0.00      0.00      0.00
Average:         46     98.00      0.00      2.00      0.00      0.00      0.00
Average:         47     98.00      0.00      2.00      0.00      0.00      0.00



 uptime
 09:53:23 up  2:07,  4 users,  load average: 49.94, 49.17, 49.17


 free -g
              total        used        free      shared  buff/cache   available
Mem:            125          61          62           0           1          62
Swap:            15           0          15



iostat
Linux 4.18.0-305.el8.x86_64 (dragon12)      06/29/2023      _x86_64_        (48 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          86.30    0.00    3.27    0.00    0.00   10.42

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               2.59        91.45        60.94     754657     502839
dm-0              1.95        83.37        54.04     687962     445952
dm-1              0.01         0.27         0.00       2220          0
dm-2              0.57         2.76         6.65      22810      54839



 sar -B 2 5
Linux 4.18.0-305.el8.x86_64 (dragon12)      06/29/2023      _x86_64_        (48 CPU)

10:05:30 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
10:05:32 AM      0.00      1.50  23815.00      0.00  43641.00      0.00      0.00      0.00      0.00
10:05:34 AM      0.00      0.00  27231.50      0.00  45495.00      0.00      0.00      0.00      0.00
10:05:36 AM      0.00      0.00  28570.50      0.00  47603.50      0.00      0.00      0.00      0.00
10:05:38 AM      0.00      0.00  27766.50      0.00  48434.50      0.00      0.00      0.00      0.00
10:05:40 AM      0.00     14.00  28007.00      0.00  48733.50      0.00      0.00      0.00      0.00
Average:         0.00      3.10  27078.10      0.00  46781.50      0.00      0.00      0
King David
  • 781
  • 2
  • 14
  • 27
  • You have a lot of processes running in the queue (first column). Also you have a lot of interrupts. 160k looks very much for me (11 column). – Romeo Ninov Jun 29 '23 at 10:04
  • yes but I to defined the idle threshold? , lets say I want to create smoke test that gives alert when idle threshold is low then how to set it ? 10% 5% ? – King David Jun 29 '23 at 10:13
  • This depend of the software you use. The standard consideration is to set to 30% (idle). If machine have bigger CPU usage you should consider scaling up. Or implement cluster (scale out) – Romeo Ninov Jun 29 '23 at 10:20
  • No, if you have user/kernel usage on such scale this is average for all CPUs. So you do not have only few CPUs loaded but all of them – Romeo Ninov Jun 29 '23 at 10:24
  • also from sar -P ALL 1 1 , how it can be that all cores are idle with 0 exclude 4 that are diff – King David Jun 29 '23 at 10:24
  • Do not forget some CPUs exec OS processes/kernel – Romeo Ninov Jun 29 '23 at 10:25
  • so per your summary do you recommended to increase CORES from 48 to 96? , for example ( because we can't minimize the number of PID that are running) – King David Jun 29 '23 at 10:31

1 Answers1

1

Your problem is that this machine is on the edge of overloading. From sar and vmstat outputs you see in the processors queue you have usually more running processes than CPU. Also almost every CPU is loaded more than 80% Which is clear indication you need to scale.

You have two ways to scale:

  • scale up - add more CPU (and memory) to this machine
  • scale out - add another machine and create kafka cluster to spread the load in the cluster.

My personal recommendation will be to add machine to smooth future extends.

Romeo Ninov
  • 5,319
  • 5
  • 20
  • 20
  • when you said "future extends." you meaning to add CPU and add more memory ? why memory if you look on free -g command then you see that we not have memory problem – King David Jun 29 '23 at 11:03
  • @KingDavid, IMHO you have some memory problem. I see cache and buffers have very low values.... – Romeo Ninov Jun 29 '23 at 11:24
  • but the available is 62G from total 125G thenI not understand. , in sptie cache is very low – King David Jun 29 '23 at 11:40
  • This is exactly what I mean. Usually OS what to use memory for software. And as much as possible for cache and buffers (to speedup disk access). And having 64GB free and 3MB buffers is very odd. – Romeo Ninov Jun 29 '23 at 11:51
  • from your understanding so far according to my case , what your recommendation to increase CPU to 96? – King David Jun 29 '23 at 12:02
  • @KingDavid, please check my answer, scale out option. :) – Romeo Ninov Jun 29 '23 at 12:10
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/146940/discussion-between-king-david-and-romeo-ninov). – King David Jun 29 '23 at 12:18
  • BTW , we are not talking about kafka I am talking about client of kafka that used as producer . so no data is actually on disks ( beside the OS ) , maybe this is the reason that machine not used cache memory ? – King David Jun 29 '23 at 12:36