Skip to content

Prometheus endpoint is exposing high cardinality unbounded metrics #2823

@fkrestan

Description

@fkrestan

Describe the bug

It seems that the rest_client_request_latency_seconds histogram metric exposed on the Prometheus metrics endpoint has unbounded cardinality. At the minimum, it seems to include a url label which contains URIs of all API versions registered in a given Kubernetes API server including it's query parameters. In my clusters, this is 742 time series per controller instance.

An example of the metrics including labels as exposed by the metrics endpoint to the Prometheus:

... rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.001"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.002"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.004"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.008"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.016"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.032"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.064"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.128"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.256"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="0.512"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET",le="+Inf"} 1 rest_client_request_latency_seconds_sum{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET"} 0.001278107 rest_client_request_latency_seconds_count{url="https://172.20.0.1:443/apis/batch/v1?timeout=32s",verb="GET"} 1 ... rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.001"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.002"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.004"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.008"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.016"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.032"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.064"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.128"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.256"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="0.512"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET",le="+Inf"} 1 rest_client_request_latency_seconds_sum{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET"} 0.005832849 rest_client_request_latency_seconds_count{url="https://172.20.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s",verb="GET"} 1 ... rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.001"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.002"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.004"} 0 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.008"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.016"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.032"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.064"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.128"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.256"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="0.512"} 1 rest_client_request_latency_seconds_bucket{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET",le="+Inf"} 1 rest_client_request_latency_seconds_sum{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET"} 0.004987509 rest_client_request_latency_seconds_count{url="https://172.20.0.1:443/apis/networking.k8s.io/v1/ingressclasses?limit=%7Bvalue%7D&resourceVersion=%7Bvalue%7D",verb="GET"} 1 ... 

See

on why this is an issue.

Steps to reproduce

Deploy the aws-load-balancer-controller using the Helm Chart with the ServiceMonitor enabled (serviceMonitor.enabled=true Chart value). Get metrics from the exposed Prometheus endpoint (Chart default, :8080/metrics).

Expected outcome

The rest_client_request_latency_seconds metric either not being present at in the exposed metrics or filterable via the ServiceMonitor deployed by the Helm Chart.

Environment

  • AWS Load Balancer controller v2.4.4.
  • Kubernetes version v1.21
  • EKS v1.21.14-eks-6d3986b

Additional Context:

This can be solved on multiple levels:

  • If this is intended behavior
    • introduce a cli flag to disable this metric or
    • allow filtering out (dropping) the metric in ServiceMonitor deployed through the Helm Chart using the spec.endpoints.relabelings field.
  • If this is not intended remove the metric from the exposed metrics endpoint.

Metadata

Metadata

Assignees

Labels

good first issueDenotes an issue ready for a new contributor, according to the "help wanted" guidelines.help wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions