2

Summary:

I recently migrated a Kubernetes cluster from bare metal to a Proxmox VM on the same physical host. Since the migration, the VM experiences intermittent high CPU spikes and network latency, causing significant performance issues.

Workload & Topology:

I have three identical servers (see Configuration below), each running Proxmox with two VMs: one worker node that uses a majority of the resources available on the physical host, and another system node that's relatively light. Only the worker nodes show the problematic behavior. The worker loads each have ten disks passed through to form two ZFS pools, which OpenEBS then uses to provide persistent storage to Kubernetes.

A noteworthy workload on the Kubernetes is an ECK-managed Elasticsearch cluster. It tends to have the highest CPU utilization of anything in my Kubernetes cluster. Each Kubernetes host runs one Elasticsearch master and two data nodes.

Proxmox resources are not overallocated, and CPU consumption generally sits at 10-20% for the VMs, 20-40% overall for the host.

Symptoms:

  • Intermittent High CPU Spikes: When running htop on the Proxmox host, I see the kvm process associated with the troublesome VM spike up to ~6,000% (near capacity, I believe) during the freezes.
  • Network Latency: Continuous local network pings to the worker nodes show that pings stop altogether for 3 to 10 seconds. Following these periods, replies come in quick succession, showing high round-trip times. Round-trip times might look like: 0.5ms, 0.4ms, 0.6ms, 3000ms, 2500ms, 1200ms, 300ms, 0.4ms, 0.5ms, ....
  • Various Kubernetes Issues: inter-pod timeouts, nodes going NotReady arbitrarily, and dozens of Prometheus alerts all point to fundamental issues with the cluster itself.
  • Diagnostic Observations:
    • Both atop and iotop freeze during the ping freeze, suggesting an issue at the CPU level (and making it very difficult to troubleshoot specifics).
    • I wrote a script that loops indefinitely, printing the time to a file and then waiting a second. Run with chrt -f 99 ./crit_test.sh, it stops updating during the ping freeze as well.

Additional Context:

  • This issue did not occur when the Kubernetes cluster was running on bare metal.
  • A similar issue occurred with an OPNsense VM on a different Proxmox host when Zenarmor was enabled (which managed its own Elasticsearch internally). The issue resolved when the Zenarmor process was stopped.
  • When I say I "migrated" from bare metal to Proxmox, I mean I cloned the physical disk onto a virtual one and called it a day when it booted successfully; given I didn't actually install the OS onto the virtualized hardware, I've wondered whether there could be an issue with a driver or configuration that needs to be updated.

Configuration:

Host:

  • OS: Proxmox VE 8.2.2 (6.5.13-5-pve)
  • Hardware: Dell R730xd
  • CPUs: 72 x Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz (2 Sockets)
  • RAM: 316 GB
  • Boot Mode: Legacy BIOS
  • Networking setup: OVS Bridges

K8s Worker VM:

  • OS: Ubuntu 22.04.04 (5.15.0-107-generic) (minimal install)
  • CPUs: 64 (2 sockets, 32 cores) [host] [numa=1]
  • RAM: 245 GiB
  • Kubernetes Version: v1.28.6 (kubeadm deployed with Kubespray)
  • Container Runtime Version: containerd://1.7.13
  • Configuration:
    agent: 1 boot: order=scsi0 cores: 32 cpu: host machine: q35 memory: 250880 meta: creation-qemu=8.1.2,ctime=1712944757 name: home-k8s-2 net0: virtio=BC:...:6E,bridge=vmbr1,firewall=1,queues=63 net1: virtio=BC:...:E5,bridge=vmbr2,firewall=1,queues=63,tag=4 numa: 1 onboot: 1 ostype: l26 scsi0: data_zfs:vm-802-disk-0,discard=on,iothread=1,size=256G,ssd=1 scsi1: /dev/disk/by-id/ata-..._90Y7T077,backup=0,discard=on,iothread=1,size=937692504K,ssd=1 scsi10: /dev/disk/by-id/scsi-...57c,backup=0,discard=on,iothread=1,size=1172112984K,ssd=1 scsi11: data_zfs:vm-802-disk-1,backup=0,discard=on,iothread=1,size=100G,ssd=1 scsi2: /dev/disk/by-id/ata-..._90Y7O01K,backup=0,discard=on,iothread=1,size=937692504K,ssd=1 scsi3: /dev/disk/by-id/ata-..._90Y7I2MA,backup=0,discard=on,iothread=1,size=937692504K,ssd=1 scsi4: /dev/disk/by-id/ata-..._90Y7I0C4,backup=0,discard=on,iothread=1,size=937692504K,ssd=1 scsi5: /dev/disk/by-id/ata-..._0Y7T0JM,backup=0,discard=on,iothread=1,size=937692504K,ssd=1 scsi6: /dev/disk/by-id/scsi-...b9c,backup=0,discard=on,iothread=1,size=1172112984K scsi7: /dev/disk/by-id/scsi-...944,backup=0,discard=on,iothread=1,size=1172112984K scsi8: /dev/disk/by-id/scsi-...f60,backup=0,discard=on,iothread=1,size=1172112984K scsi9: /dev/disk/by-id/scsi-...5fc,backup=0,discard=on,iothread=1,size=1172112984K scsihw: virtio-scsi-single smbios1: uuid=cc8c0d42-...-08645523ceac sockets: 2 startup: order=50 vmgenid: 0e318124-...-18b887322e4 

Questions:

  1. Proxmox Configuration: Is there a potential misconfiguration in my Proxmox setup that's causing this issue?
  2. Elasticsearch Impact: Could the Elasticsearch cluster running on Kubernetes be causing these intermittent spikes? Are there known issues with JVM GC or other Elasticsearch components causing such behavior in a VM?
  3. NUMA Configuration: Given the dual-socket, NUMA-supporting hardware, could incorrect NUMA configurations be causing these issues?
2
  • off topic, do you notice the overhead of this setup? Commented Oct 19, 2024 at 17:55
  • 1
    @maxisam sorry for the very late response! But nope, it's been all golden for me. The advantages of virtualization have far-outweighed the disadvantages for my setup. Commented Jan 16 at 22:27

1 Answer 1

2

Got it! I believe this is the correct solution, but the most I can really say is that it makes sense, and seems to have solved my issue.

From the Proxmox documentation for KVM VMs:

VMs can, depending on their configuration, use additional threads, such as for networking or IO operations but also live migration. Thus a VM can show up to use more CPU time than just its virtual CPUs could use. To ensure that a VM never uses more CPU time than vCPUs assigned, set the cpulimit to the same value as the total core count.

I had thought that setting cores did set a limit of sorts, but my hypothesis based on this is that when my K8s cluster had a lot of IO going on, it was reaching outside the cores I had set. Without adequate overhead, this was causing issues in Proxmox and causing instability in the VM.

I'm guessing this is a bit of an edge-case, likely the result of the way Kubernetes distributes work, the amount of IO incurred by ZFS, and perhaps some other factors, but I went through and set cpulimit to equal cores on the affected VMs, and they've all been much more stable since.

It's possible I'm wrong on some of the specifics, but as I say, I made this change, and haven't had any issues in the several months since.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.