- Notifications
You must be signed in to change notification settings - Fork 58
Closed
Description
Add OpenTelemetry-based observability to Newt
Reference: fosrl/gerbil#25
Summary / Goal
Instrument Newt with OpenTelemetry Metrics (OTel) following CNCF / industry standards so that:
- Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API).
- Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
- Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
- Focus is metrics first; design should allow adding traces and logs later.
- Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.
Why OpenTelemetry (OTel)
- OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
- Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
- OTel Collector enables enrichment, normalization, batching, and flexible export pipelines (OTLP, remote_write).
Requirements & Constraints
- Use the OpenTelemetry Go SDK (modules) and follow OTel semantic conventions for relevant signals (HTTP, RPC, network).
- Provide a /metrics endpoint in Prometheus exposition format via the OTel Prometheus exporter.
- All durations in seconds and sizes in bytes. Metric names should carry units where applicable (_seconds, _bytes) and counters use _total.
- Labels must be low-cardinality and stable (e.g.,
site_id,tunnel_id,transport). - Exporters configurable at runtime through environment variables (no code change required to switch).
- Provide an example OTel Collector config demonstrating attribute promotion and remote_write.
Recommended Newt Metrics
| Category | Metric Name | Type | Labels | Units / Notes |
|---|---|---|---|---|
| Site / Registration | newt_site_registrations_total | Counter | site_id, region, result | count |
newt_site_online | Gauge | site_id, transport | bool (0/1) | |
newt_site_last_heartbeat_seconds | Gauge | site_id | seconds since last heartbeat | |
| Tunnel / Sessions | newt_tunnel_sessions_total | Gauge | site_id, tunnel_id, transport | active sessions |
newt_tunnel_bytes_total | Counter | site_id, tunnel_id, direction | bytes (in/out) | |
newt_tunnel_latency_seconds | Histogram | site_id, tunnel_id, transport | seconds | |
newt_tunnel_reconnects_total | Counter | site_id, tunnel_id, reason | count | |
| Connection / NAT | newt_connection_attempts_total | Counter | site_id, transport, result | count |
newt_connection_errors_total | Counter | site_id, transport, error_type | count | |
newt_nat_mapping_active | Gauge | site_id, mapping_type | bool/count | |
| Peer / Health | newt_peer_heartbeat_latency_seconds | Histogram | site_id, peer_id | seconds |
newt_peer_last_handshake_seconds | Gauge | site_id, peer_id | seconds | |
| Operational / Ops | newt_config_reloads_total | Counter | result | count |
newt_restart_count_total | Counter | count | ||
| Runtime | newt_go_goroutines | Gauge | count | |
newt_go_mem_alloc_bytes | Gauge | bytes |
Implementation Plan
-
Dependencies (example packages)
- Add OpenTelemetry Go modules to
go.mod:go.opentelemetry.io/otelgo.opentelemetry.io/otel/sdk/metricgo.opentelemetry.io/otel/exporters/prometheusgo.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc(or OTLP HTTP variant)- Optional contrib instrumentation:
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttpgo.opentelemetry.io/contrib/instrumentation/runtime
- ...
- Add OpenTelemetry Go modules to
-
Central metrics package
- Create
internal/metrics/that:- Initializes OTel
MeterProvider. - Registers Prometheus exporter (when enabled) and exposes a handler on
/metrics(or mounts to existing server route). - Optionally registers OTLP exporter when enabled via env vars.
- Pre-registers all Newt metric instruments with names, descriptions and label keys.
- Exposes a singleton
metricsAPI with helper functions:Inc(name string, labels ...attribute.KeyValue)Observe(name string, value float64, labels ...attribute.KeyValue)SetGauge(name string, value float64, labels ...attribute.KeyValue)
- Implements
Shutdown(ctx)to flush and stop providers/exporters.
- Initializes OTel
- Create
-
Instrumentation approach
- Site registration & heartbeats:
- Increment registration counters and set
site_online/site_last_heartbeat.
- Increment registration counters and set
- Tunnels & sessions:
- Update session counts, bytes in/out, latency histograms, reconnect counters.
- Connection & NAT logic:
- Record connection attempts, successes/failures, NAT mapping states.
- Peer health & handshakes:
- Observe heartbeat latency and last handshake timestamps.
- Operational flows:
- Config reloads and restarts.
- Runtime metrics:
- Register basic Go runtime metrics (goroutines, mem) via contrib or runtime package and export them.
- Site registration & heartbeats:
-
Histograms & buckets
- Duration buckets:
[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30] - Byte-size buckets:
[512, 1024, 4096, 16384, 65536, 262144, 1048576] - Always use seconds for durations and bytes for sizes.
- Duration buckets:
-
Exporter configuration (runtime)
- Environment variables (suggested defaults):
NEWT_METRICS_PROMETHEUS_ENABLED=trueNEWT_METRICS_OTLP_ENABLED=falseOTEL_EXPORTER_OTLP_ENDPOINT(when OTLP enabled)OTEL_EXPORTER_OTLP_PROTOCOL(http/protobuforgrpc)OTEL_SERVICE_NAME=newtOTEL_RESOURCE_ATTRIBUTES(e.g.,service.instance.id=...)OTEL_METRIC_EXPORT_INTERVAL(ms)
- Environment variables (suggested defaults):
-
Local testing
- Provide
docker-compose.metrics.ymlwith:- Newt (local build)
- OpenTelemetry Collector (example config)
- Prometheus (scraping
/metricsor scraping Collector) - Grafana (optional)
- Validate direct Prometheus scrape and OTLP → Collector → remote_write flows.
- Provide
-
Collector example
- Include
examples/collector.yamldemonstrating:- OTLP receiver
- Transform processor to promote resource attributes (e.g.,
wg_interface,peer,site_id) - Prometheus remote_write exporter (generic endpoint)
- Notes on:
- Metric name normalization for Prometheus
out_of_order_time_windowif sending OTLP to Prometheus
- Include
-
Documentation
observability.md:- Metric catalog (name, type, labels, units, description)
- How to enable/disable Prometheus exporter and OTLP exporter via env vars
- How to run Docker Compose test stack
- How to add a new metric (naming, labels, buckets)
-
Testing & validation
- Manual test: start compose, generate traffic, curl
/metrics, verify metrics names, units, labels and histogram buckets. - Include sample
/metricsoutput in the PR. - ...
- Manual test: start compose, generate traffic, curl
Acceptance Criteria
/metricsendpoint exposes OTel metrics in Prometheus format with correct naming and units.- Newt metrics cover site registration/heartbeats, tunnel sessions/throughput/latency, connections/NAT, peer health, certificates and operational events.
- Exporter backends can be swapped via environment variables without code changes.
- Example OTel Collector config provided and tested in local compose flow.
docs/observability.mdadded with metric catalog and run instructions.
🔗 References & Best Practices
- Traefik - Metrics (observability) -- Traefik metrics configuration and exporter options.
- OpenTelemetry - Go: Getting Started / Instrumentation Guide -- How to instrument Go applications with OpenTelemetry.
- OpenTelemetry - Go: Exporters -- Exporter options for Go (OTLP, Prometheus, etc.).
Guides & integrations
- Prometheus - OpenTelemetry guide -- Guidance for integrating Prometheus with OpenTelemetry.
- Prometheus blog - Commitment to OpenTelemetry (Mar 2024) -- Prometheus project notes and recommended OTLP ingestion patterns.
Practical walkthroughs & blog posts
- OpenTelemetry blog - Prometheus + OpenTelemetry (2024) - Practical notes on combining Prometheus and OpenTelemetry.
- Grafana Blog - A practical guide to data collection with OpenTelemetry and Prometheus (Jul 2023) -- Hands-on examples and best practices for OTEL + Prometheus.
- BetterStack - OpenTelemetry for Go -- Practical guide for instrumenting Go apps with OpenTelemetry.
- BetterStack - OpenTelemetry metrics vs Prometheus metrics -- Comparison and guidance when to use OTEL vs Prometheus metric
Metadata
Metadata
Assignees
Labels
No labels