DEV Community

Cover image for OpenTelemetry in Action on Kubernetes: Part 9 - Cluster-Level Observability with OpenTelemetry Agent + Gateway
Kartik Dudeja
Kartik Dudeja

Posted on • Edited on

OpenTelemetry in Action on Kubernetes: Part 9 - Cluster-Level Observability with OpenTelemetry Agent + Gateway

Welcome to the grand finale of our observability series! So far, we’ve added visibility into our application through logs, metrics, and traces — all flowing beautifully into Grafana via OpenTelemetry Collector.

But there’s still one big puzzle piece left: the Kubernetes cluster itself.

In this final part, we’ll:

  • Collect host and node-level metrics using hostmetrics
  • Deploy a centralized Collector in Deployment mode (gateway)
  • Introduce ServiceAccount for permissions
  • Collect Kubernetes control plane metrics using k8s_cluster
  • Use the debug exporter to troubleshoot data pipelines
  • And finally, conclude the series with a high-level recap

OTel-k8s


Why Cluster-Level Observability Matters

While we've focused on application telemetry so far, it's just one piece of the puzzle. For full visibility, we must also observe the Kubernetes cluster itself — the infrastructure running our apps.

Cluster observability helps us:

  • Monitor node health and resource usage
  • Track control plane performance (API server, scheduler, etc.)
  • Understand pod scheduling and evictions
  • Improve scaling decisions
  • Troubleshoot infrastructure-level issues
  • Strengthen security and governance

In short, without visibility into the cluster, you're flying blind. This part of the series ensures you're watching not just the app, but the platform beneath it.

Add hostmetrics Receiver in the Agent

We’ll start by updating our otel-collector-agent (running as DaemonSet) to use the hostmetrics receiver. This receiver scrapes system-level metrics from each node, such as CPU, memory, disk, filesystem, and load.

Config – otel-collector-agent-configmap.yaml

receivers: hostmetrics: collection_interval: 1m scrapers: cpu: {} memory: {} disk: {} load: {} filesystem: {} network: {} system: {} processors: memory_limiter: check_interval: 1s limit_percentage: 80 spike_limit_percentage: 15 batch: send_batch_size: 1000 timeout: 5s exporters: prometheus: endpoint: "0.0.0.0:8889" enable_open_metrics: true resource_to_telemetry_conversion: enabled: true service: pipelines: # collect metrics from otlp and hostmetrics receiver and expose in prometheus compatible format metrics: receivers: [otlp, hostmetrics] processors: [memory_limiter, batch] exporters: [prometheus] 
Enter fullscreen mode Exit fullscreen mode

Each hostmetrics receiver runs inside the agent pod on every node, giving us node-specific insights.

Deploy the OpenTelemetry Gateway

1. Why Deployment Mode?

  • Deployment Mode is used for centralized collection, aggregation, and export of telemetry data.
  • Unlike the DaemonSet agent, which runs on each node, a Deployment collector can scrape and process cluster-wide metrics.

2. Create a ServiceAccount, ClusterRole, and ClusterRoleBinding

To use the k8s_cluster receiver, the collector must have permission to access Kubernetes objects like nodes, pods, namespaces, etc.

What is a ServiceAccount in Kubernetes?

A ServiceAccount in Kubernetes is an identity used by pods to authenticate and interact securely with the Kubernetes API. While every pod gets a default ServiceAccount, you often need to create custom ones with specific RBAC (Role-Based Access Control) permissions for security and least privilege.

In our case, the OpenTelemetry Collector needs to read cluster state—like nodes, pods, and namespaces—to collect metrics using the k8s_cluster receiver. So, we create a dedicated ServiceAccount and bind it to a ClusterRole with read-only access to those resources. This ensures our collector can operate properly without over-privileging it.

# otel-collector-gateway-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: otel-collector-gateway-sa namespace: observability labels: app: otel-collector-gateway --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: otel-collector-gateway-role labels: app: otel-collector-gateway rules: - apiGroups: - "" resources: - events - namespaces - namespaces/status - nodes - nodes/spec - pods - pods/status - replicationcontrollers - replicationcontrollers/status - resourcequotas - services verbs: - get - list - watch - apiGroups: - apps resources: - daemonsets - deployments - replicasets - statefulsets verbs: - get - list - watch - apiGroups: - extensions resources: - daemonsets - deployments - replicasets verbs: - get - list - watch - apiGroups: - batch resources: - jobs - cronjobs verbs: - get - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - get - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: otel-collector-gateway-binding labels: app: otel-collector-gateway subjects: - kind: ServiceAccount name: otel-collector-gateway-sa namespace: observability roleRef: kind: ClusterRole name: otel-collector-gateway-role apiGroup: rbac.authorization.k8s.io 
Enter fullscreen mode Exit fullscreen mode

Apply it:

kubectl -n observability apply -f otel-collector-gateway-serviceaccount.yaml 
Enter fullscreen mode Exit fullscreen mode

3. OpenTelemetry Collector Config with k8s_cluster Receiver

Create the config file as a ConfigMap.

# otel-collector-gateway-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: otel-collector-gateway-config namespace: observability labels: app: otel-collector-gateway data: otel-collector-config.yaml: | receivers: k8s_cluster: auth_type: "serviceAccount" collection_interval: 30s processors: memory_limiter: check_interval: 1s limit_percentage: 80 spike_limit_percentage: 15 batch: send_batch_size: 1000 timeout: 5s exporters: debug: verbosity: detailed prometheus: endpoint: "0.0.0.0:8889" enable_open_metrics: true resource_to_telemetry_conversion: enabled: true service: pipelines: metrics: receivers: [k8s_cluster] processors: [memory_limiter, batch] exporters: [prometheus] 
Enter fullscreen mode Exit fullscreen mode

Apply it:

kubectl -n observability apply -f otel-collector-gateway-configmap.yaml 
Enter fullscreen mode Exit fullscreen mode

4. Deploy the OpenTelemetry Collector

# otel-collector-gateway-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: otel-collector-gateway namespace: observability labels: app: otel-collector-gateway spec: replicas: 1 revisionHistoryLimit: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 25% # Allow 25% more pods than desired during update maxUnavailable: 25% # Allow 25% of desired pods to be unavailable during update selector: matchLabels: app: otel-collector-gateway template: metadata: labels: app: otel-collector-gateway spec: serviceAccountName: otel-collector-gateway-sa containers: - name: otel-collector image: otel/opentelemetry-collector-contrib:latest args: ["--config=/conf/otel-collector-config.yaml"] volumeMounts: - name: config-volume mountPath: /conf resources: requests: cpu: 10m memory: 32Mi limits: cpu: 50m memory: 128Mi volumes: - name: config-volume configMap: name: otel-collector-gateway-config 
Enter fullscreen mode Exit fullscreen mode

Apply it:

kubectl -n observability apply -f otel-collector-gateway-deployment.yaml 
Enter fullscreen mode Exit fullscreen mode

5. Expose Collector to Prometheus

# otel-collector-gateway-service.yaml apiVersion: v1 kind: Service metadata: name: otel-collector-gateway namespace: observability labels: app: otel-collector-gateway spec: selector: app: otel-collector-gateway ports: - name: otlp-grpc port: 4317 targetPort: 4317 protocol: TCP - name: otlp-http port: 4318 targetPort: 4318 protocol: TCP - name: prometheus port: 8889 targetPort: 8889 protocol: TCP type: ClusterIP 
Enter fullscreen mode Exit fullscreen mode

Apply:

kubectl -n observability apply -f otel-collector-gateway-service.yaml 
Enter fullscreen mode Exit fullscreen mode

Then add this to your Prometheus scrape_configs:

- job_name: 'otel-collector-gateway' static_configs: - targets: ['otel-collector-gateway.observability.svc.cluster.local:8889'] 
Enter fullscreen mode Exit fullscreen mode

Test and Verify

Check deployment status:

kubectl -n observability get all -l app=otel-collector-gateway 
Enter fullscreen mode Exit fullscreen mode

otel-collector-deployment-get-all

Special Mention: Debug Exporter - Your Observability Wingman

The debug exporter in OpenTelemetry Collector is a lightweight and incredibly helpful tool for developers and DevOps engineers when building or troubleshooting telemetry pipelines.

Instead of exporting telemetry data (like logs, metrics, and traces) to a backend system like Prometheus or Jaeger, the debug exporter simply prints the data to the Collector's stdout. This means:

  • You can see exactly what telemetry data is being received and processed—live in the logs.
  • It helps validate instrumentation quickly, without setting up full observability backends.
  • It's especially useful when you're testing new receivers, processors, or pipelines, and want a quick look at the output.

When to Use

  • Local testing or dev environments.
  • Debugging broken data flow—if Prometheus or Jaeger isn’t showing what you expect.
  • Learning how OpenTelemetry transforms and routes telemetry data.

Example Configuration Snippet

exporters: debug: verbosity: detailed # outputs full content of each signal 
Enter fullscreen mode Exit fullscreen mode

Then, reference it in your pipeline like this:

service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [jaeger, debug] 
Enter fullscreen mode Exit fullscreen mode

This ensures traces are sent to Jaeger and also printed to the console—great for double-checking what's going in.


Conclusion: You Now Have Full Observability!

Over the past 9 parts, you’ve:

  • Containerized a real ML application
  • Instrumented it with OpenTelemetry
  • Collected traces, logs, and metrics
  • Deployed observability tools in Kubernetes
  • Visualized everything in Grafana
  • Monitored the entire Kubernetes cluster with Agent + Gateway mode

You’ve essentially built a production-grade observability platform from scratch — without cloud vendor lock-in.


Missed the previous article?

Check out Part 8: Visualize Everything, Building a Unified Observability Dashboard with Grafana to see how we got here.


{ "author" : "Kartik Dudeja", "email" : "kartikdudeja21@gmail.com", "linkedin" : "https://linkedin.com/in/kartik-dudeja", "github" : "https://github.com/Kartikdudeja" } 
Enter fullscreen mode Exit fullscreen mode

Top comments (0)