- Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
It seems that the rest_client_request_latency_seconds histogram metric exposed on the Prometheus metrics endpoint has unbounded cardinality. At the minimum, it seems to include a url label which contains URIs of all API versions registered in a given Kubernetes API server including it's query parameters. In my clusters, this is 742 time series per controller instance.
An example of the metrics including labels as exposed by the metrics endpoint to the Prometheus:
... rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.001"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.002"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.004"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.008"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.016"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.032"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.064"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.128"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.256"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.512"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="+Inf"} 1 rest_client_request_latency_seconds_sum{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET"} 0.001278107 rest_client_request_latency_seconds_count{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET"} 1 ... rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.001"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.002"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.004"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.008"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.016"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.032"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.064"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.128"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.256"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.512"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="+Inf"} 1 rest_client_request_latency_seconds_sum{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET"} 0.005832849 rest_client_request_latency_seconds_count{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET"} 1 ... rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.001"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.002"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.004"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.008"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.016"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.032"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.064"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.128"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.256"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.512"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="+Inf"} 1 rest_client_request_latency_seconds_sum{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET"} 0.004987509 rest_client_request_latency_seconds_count{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET"} 1 ... See
- https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
- https://prometheus.io/docs/practices/naming/
on why this is an issue.
Steps to reproduce
Deploy the aws-load-balancer-controller using the Helm Chart with the ServiceMonitor enabled (serviceMonitor.enabled=true Chart value). Get metrics from the exposed Prometheus endpoint (Chart default, :8080/metrics).
Expected outcome
The rest_client_request_latency_seconds metric either not being present at in the exposed metrics or filterable via the ServiceMonitor deployed by the Helm Chart.
Environment
- AWS Load Balancer controller v2.4.4.
- Kubernetes version v1.21
- EKS v1.21.14-eks-6d3986b
Additional Context:
This can be solved on multiple levels:
- If this is intended behavior
- introduce a cli flag to disable this metric or
- allow filtering out (dropping) the metric in
ServiceMonitordeployed through the Helm Chart using thespec.endpoints.relabelingsfield.
- If this is not intended remove the metric from the exposed metrics endpoint.