Skip to content

os.cgroup.cpuacct.usage_nanos is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

@b-deam

Description

@b-deam

Elasticsearch Version

master

Installed Plugins

No response

Java Version

bundled

OS Version

5.15.0-1036-azure

Problem Description

When Elasticsearch is ran inside a cgroup v2, the node stats output for "https://elasticsearch:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" is actually in microseconds, cgroup v1 correctly reports these in the nanosecond unit:

cgroupCpuAcctUsageNanos = cpuStatsMap.get("usage_usec");

We collect these stats in Rally's node-stats telemetry device and it became clear that formula we use to derive CPU usage out of the available time is off by a factor of 1000 (i.e. the difference between nanoseconds and microseconds) for any container running inside a cgroup v2.

The below screenshots show the difference between cgroup v1 running on Google Kuberentes Engine (GKE), and cgroup v2 running on Azure Kubernetes Service (AKS):
image
image

GKE output (cgroup v1)

$ uname -a Linux es-es-search-7b66d98c5b-fs28n 5.15.89+ #1 SMP Sat Mar 18 09:27:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux $ mount -l | grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory) # nanoseconds $ cat /sys/fs/cgroup/cpu,cpuacct/cpuacct.usage 63277158346752 

AKS output (cgroup v2):

$ uname -a Linux es-es-index-6f49648d8-jhm9s 5.15.0-1036-azure #43-Ubuntu SMP Wed Mar 29 16:11:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux $ mount -l | grep cgroup cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime) # microseconds $ cat /sys/fs/cgroup/cpu.stat usage_usec 104036485036 user_usec 98419994704 system_usec 5616490332 nr_periods 164357 nr_throttled 143842 throttled_usec 9516539086

Steps to Reproduce

I encountered the bug when running a cluster inside an Azure Kubernetes Service (AKS) cluster, but that's not exactly practical for reproductions.

We can repro this using ECK and Minikube with the Docker driver on macOs.

Note that for Linux users Minikube automatically detects whether or not cgroup v1 or v2 is in use by your workstation (i.e. where you invoke minikube start from), whereas for Docker Desktop users on macOS (which actually creates a Linux VM in the background) we need to adjust the cgroup version via modifying the engine's settings (more on this below).

Testing with cgroup v2:

# minikube > 1.23 defaults to cgroup v2 $ minikube version minikube version: v1.26.1 commit: 62e108c3dfdec8029a890ad6d8ef96b6461426dc # start minikube $ minikube start # create eck operator $ minikube kubectl -- create -f https://download.elastic.co/downloads/eck/2.7.0/crds.yaml $ minikube kubectl -- apply -f https://download.elastic.co/downloads/eck/2.7.0/operator.yaml $ minikube kubectl -- -n elastic-system logs -f statefulset.apps/elastic-operator # deploy elasticsearch $ cat <<EOF | minikube kubectl -- apply -f - apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata:  name: quickstart spec:  version: 8.7.1  nodeSets:  - name: default  count: 1  config:  node.store.allow_mmap: false EOF # check that es is using cgroup v2 $ minukube kubectl -- -n elastic-system exec -it local-es-default-0 -- /bin/sh # inside es pod/container sh-5.0$ mount -l | grep cgroup cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime) # get pass $ PASSWORD=$(minikube kubectl -- get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}') # make service avail  $ minikube kubectl -- port-forward service/quickstart-es-http 9200 # check output $ curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" | jq . { "nodes": { "WA7ANuASRiGF7xgO3dYp_w": { "os": { "cgroup": { "cpuacct": { "usage_nanos": 107115513 } } } } } }

I'm on macOS Monterey 12.6 using the docker driver for minikube , which is actually a Linux VM running behind the scenes. In order to force it to use cgroup v2 I had to configure "deprecatedCgroupv1": true, in $HOME/Library/Group\ Containers/group.com.docker/settings.json and then restart Docker Desktop before following these steps:

$ minukube kubectl -- -n elastic-system exec -it local-es-default-0 -- /bin/sh # inside es pod/container sh-5.0$ mount -l | grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu type cgroup (ro,nosuid,nodev,noexec,relatime,cpu) cgroup on /sys/fs/cgroup/cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpuacct) cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls) cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_prio) cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma) systemd on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,name=systemd) # get pass $ PASSWORD=$(minikube kubectl -- get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}') # make service avail  $ minikube kubectl -- port-forward service/quickstart-es-http 9200 # check output $ curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" | jq . { "nodes": { "WA7ANuASRiGF7xgO3dYp_w": { "os": { "cgroup": { "cpuacct": { "usage_nanos": 35975828297 } } } } } }

Logs (if relevant)

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions