I have Zen 3 based server on WRX80 platform (24c/48t). I'm using it as KVM/libvirt hypervisor with statically pinned cores to VMs (nohz, isolcpus). I recently noticed issues with thermal tripping. At first I thought it may be related to excessive heat generated by GPUs during AI workloads but today CPU reached 94.2°C during full software video encoding (so only CPU was under load). 94.8° is critical temp for this CPU and I believe it's point where machine reaches emergency shutdown and few times it did that recently.
When I was running synthetic benchmarks on baremetal system (not in VMs) when I was verifying cooling performance of machine it was reaching 92° and then starting to drop clocks and throttle, maintaining 91-92° so it looks to me as if throttling doesn't work properly with VMs induced load.
Hence my question - how throttling is handled? Is it purely CPU controlled feature and OS has nothing to say in it or is it something that Linux should handle and it could misbehave due to cpu isolation and some form of Linux inability to force throttling on VM claimed cores?
Also - I'm aware that ideally CPU shouldn't throttle at all but please correct me if I'm wrong - I believe modern CPUs shouldn't just THERMTRIP during normal operation as long as cooling is at least somewhat reasonable. In this case it's not instantaneous temperature spike impossible to counter but gradual rising of CPU temp over 15 minutes up to point where it trips - without noticeable throttling attempts.
Load is not balanced - VM is pinned to 2 out of 4 CCDs