1

I'm running a hundred Ubuntu 16.04 LTS servers with basically identical hardware distributed worldwide. (I'm working on upgrading them to 20.04 LTS but certain unfortunate design decisions on the part of Ubuntu are still blocking this.) Each of these servers is running a KVM VM with Windows 10 Enterprise. Three of them show the following problem:

Without any apparent cause, monitoring shows the server's Linux load average jumping up to above 2. top shows the CPU load from the qemu-system-x86 process running the Windows VM solidly at 200%, matching the 2 cores assigned to the VM. The Windows desktop accessed through VNC appears extremely sluggish. Windows Task Manager shows a process "System interrupts" consuming 100% CPU.

Rebooting the Windows VM does not fix the situation. It persists for several hours or even days and then goes back to normal by its own, again without any apparent cause or reason.

Researching reasons for high CPU usage by "System interrupts" in Windows turns up a general consensus that this is a hardware issue. The hardware running Windows in this case is virtual, namely the KVM hypervisor. The physical hardware of the hosts did not change before or after the high load episodes, nor does it differ significantly between the servers that show these episodes and those that don't. The Linux host system does not show any signs of malfunction except the excessive load from the Windows guest. Inspection of the Linux logs on the affected systems has turned up nothing unusual. The Windows event logs show the obvious heaps of secondary errors during the high load episodes, such as services not responding, but nothing indicating a possible cause.

Where would I begin to look for possible causes of that behaviour?

For the sake of completeness, this is my KVM invocation:

kvm \
        -daemonize \
        -name "$vmname64-$(hostname)" \
        -drive file="/srv/kvm/${vmname64}.qcow2",if=virtio \
        -net nic,model=virtio,macaddr=$macaddr64 -net tap \
        -vga std \
        -rtc base=localtime \
        -usb -usbdevice tablet \
        -nodefaults \
        -runas srvadmin \
        -chroot /home/srvadmin \
        -k de \
        -smp 2 \
        -m 4096 \
        -vnc :1,password \
        -monitor mon:telnet:127.0.0.1:4445,server,nowait
Tilman
  • 3,325
  • 18
  • 25
  • "unfortunate design decisions on the part of Ubuntu" is a bit harsh, I don't remember anyone saying "oh yeah let us break windows" :-) To make random assumptions - maybe the new virtualization stack now supports features that are enabled by default now and those are having issues on your Hardware. Formerly those might just have not been used and therefore the effect not visible. – Christian Ehrhardt Apr 15 '21 at 07:40
  • If you see "System interrupts" in the guest those must come from somewhere - as a first stab at the problem I'd recomment using perf kvm to trace what exits and Interrupts those are. – Christian Ehrhardt Apr 15 '21 at 07:41
  • In regard to Windows Guests I've often seen people experiment with enlightments https://github.com/qemu/qemu/blob/master/docs/hyperv.txt so that might be worth a try (depends a bit on what you've seen with perf kvm). – Christian Ehrhardt Apr 15 '21 at 07:41
  • The unfortunate (by which I really mean: unfortunate for us) design decisions are unrelated to this problem. They just prevent my migration to 20.04LTS because they broke our automated installation process (which is absolutely essential in our scenario) and so far we haven't found a way to make it work again. I just mentioned that point in order to ward off the standard recommendation: "Upgrade to a current version before trying anything else." – Tilman Apr 15 '21 at 13:48
  • I found the `perf` command in the `linux-tools-generic` package but so far couldn't get any useful output from it. `perf kvm top` just displays a black screen with a blue title line showing an ever increasing "Event count". Is there a tutorial for `perf kvm` somewhere? – Tilman Apr 16 '21 at 09:19
  • Hiho, while slightly aged even https://lwn.net/Articles/510923/ still mostly applies. Also the quick finds of https://www.ibm.com/docs/en/linux-on-systems?topic=troubleshooting-performance-metrics https://www.linux-kvm.org/page/Perf_events and even the man page http://manpages.ubuntu.com/manpages/focal/man1/perf-kvm.1.html LGTM to get things started. – Christian Ehrhardt Apr 19 '21 at 05:35
  • Looking around a bit for similar issues I found https://askubuntu.com/questions/1033985/kvm-high-host-cpu-load-after-upgrading-vm-to-windows-10-1803/1047397 and https://bugzilla.redhat.com/show_bug.cgi?id=1610461 The fix for the latter should be in 20.04 already, but many people seems to have needed tweaking the timer config for windows guests. So those hv_* I already hinted at together with timer config in general might really be worth a look. – Christian Ehrhardt Apr 19 '21 at 05:36
  • I'm afraid I do not know nearly enough about the internal architecture of KVM to be able to use `perf kvm`. I do not even know what a vmexit event is. – Tilman Apr 20 '21 at 07:09
  • Ok, ignoring perf kvm than is fine, did the links maybe help you to spot workarounds that apply to your case as well? There are a few examples in those links that set up timers in a way that is better for some windows guest versions. – Christian Ehrhardt Apr 21 '21 at 13:33
  • I didn't find a case ressembling mine (100% CPU load specifically from System Interrupts) but I'll test the timer / hv tweaks as soon as the problem reoccurs. (Un)fortunately it hasn't in the past week. – Tilman Apr 23 '21 at 18:48

0 Answers0