- Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch Version
master
Installed Plugins
No response
Java Version
bundled
OS Version
5.15.0-1036-azure
Problem Description
When Elasticsearch is ran inside a cgroup v2, the node stats output for "https://elasticsearch:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" is actually in microseconds, cgroup v1 correctly reports these in the nanosecond unit:
| cgroupCpuAcctUsageNanos = cpuStatsMap.get("usage_usec"); |
We collect these stats in Rally's node-stats telemetry device and it became clear that formula we use to derive CPU usage out of the available time is off by a factor of 1000 (i.e. the difference between nanoseconds and microseconds) for any container running inside a cgroup v2.
The below screenshots show the difference between cgroup v1 running on Google Kuberentes Engine (GKE), and cgroup v2 running on Azure Kubernetes Service (AKS):


GKE output (cgroup v1)
$ uname -a Linux es-es-search-7b66d98c5b-fs28n 5.15.89+ #1 SMP Sat Mar 18 09:27:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux $ mount -l | grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory) # nanoseconds $ cat /sys/fs/cgroup/cpu,cpuacct/cpuacct.usage 63277158346752 AKS output (cgroup v2):
$ uname -a Linux es-es-index-6f49648d8-jhm9s 5.15.0-1036-azure #43-Ubuntu SMP Wed Mar 29 16:11:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux $ mount -l | grep cgroup cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime) # microseconds $ cat /sys/fs/cgroup/cpu.stat usage_usec 104036485036 user_usec 98419994704 system_usec 5616490332 nr_periods 164357 nr_throttled 143842 throttled_usec 9516539086Steps to Reproduce
I encountered the bug when running a cluster inside an Azure Kubernetes Service (AKS) cluster, but that's not exactly practical for reproductions.
We can repro this using ECK and Minikube with the Docker driver on macOs.
Note that for Linux users Minikube automatically detects whether or not cgroup v1 or v2 is in use by your workstation (i.e. where you invoke minikube start from), whereas for Docker Desktop users on macOS (which actually creates a Linux VM in the background) we need to adjust the cgroup version via modifying the engine's settings (more on this below).
Testing with cgroup v2:
# minikube > 1.23 defaults to cgroup v2 $ minikube version minikube version: v1.26.1 commit: 62e108c3dfdec8029a890ad6d8ef96b6461426dc # start minikube $ minikube start # create eck operator $ minikube kubectl -- create -f https://download.elastic.co/downloads/eck/2.7.0/crds.yaml $ minikube kubectl -- apply -f https://download.elastic.co/downloads/eck/2.7.0/operator.yaml $ minikube kubectl -- -n elastic-system logs -f statefulset.apps/elastic-operator # deploy elasticsearch $ cat <<EOF | minikube kubectl -- apply -f - apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: quickstart spec: version: 8.7.1 nodeSets: - name: default count: 1 config: node.store.allow_mmap: false EOF # check that es is using cgroup v2 $ minukube kubectl -- -n elastic-system exec -it local-es-default-0 -- /bin/sh # inside es pod/container sh-5.0$ mount -l | grep cgroup cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime) # get pass $ PASSWORD=$(minikube kubectl -- get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}') # make service avail $ minikube kubectl -- port-forward service/quickstart-es-http 9200 # check output $ curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" | jq . { "nodes": { "WA7ANuASRiGF7xgO3dYp_w": { "os": { "cgroup": { "cpuacct": { "usage_nanos": 107115513 } } } } } }I'm on macOS Monterey 12.6 using the docker driver for minikube , which is actually a Linux VM running behind the scenes. In order to force it to use cgroup v2 I had to configure "deprecatedCgroupv1": true, in $HOME/Library/Group\ Containers/group.com.docker/settings.json and then restart Docker Desktop before following these steps:
$ minukube kubectl -- -n elastic-system exec -it local-es-default-0 -- /bin/sh # inside es pod/container sh-5.0$ mount -l | grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu type cgroup (ro,nosuid,nodev,noexec,relatime,cpu) cgroup on /sys/fs/cgroup/cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpuacct) cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls) cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_prio) cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma) systemd on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,name=systemd) # get pass $ PASSWORD=$(minikube kubectl -- get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}') # make service avail $ minikube kubectl -- port-forward service/quickstart-es-http 9200 # check output $ curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" | jq . { "nodes": { "WA7ANuASRiGF7xgO3dYp_w": { "os": { "cgroup": { "cpuacct": { "usage_nanos": 35975828297 } } } } } }Logs (if relevant)
No response