DEV Community

Cover image for Building Enterprise-Level Monitoring: From Prometheus to Grafana Dashboards
Aleksandr Murzin
Aleksandr Murzin

Posted on

Building Enterprise-Level Monitoring: From Prometheus to Grafana Dashboards

Cover Image: Main Grafana dashboard with HTTP and performance metrics

Introduction

Once your web application hits production, the most critical question becomes: how is it performing right now? Logs tell you what happened, but you want to spot problems before users start complaining.

In this article, I'll share how I built a complete monitoring system for Peakline — a FastAPI application for Strava data analysis that processes thousands of requests daily from athletes worldwide.

What's Inside:

  • Metrics architecture (HTTP, API, business metrics)
  • Prometheus + Grafana setup from scratch
  • 50+ production-ready metrics
  • Advanced PromQL queries
  • Reactive dashboards
  • Best practices and pitfalls

Architecture: Three Monitoring Levels

Modern monitoring isn't just "set up Grafana and look at graphs." It's a well-thought-out architecture with several layers:

┌─────────────────────────────────────────────────┐ │ FastAPI Application │ │ ├── HTTP Middleware (auto-collect metrics) │ │ ├── Business Logic (business metrics) │ │ └── /metrics endpoint (Prometheus format) │ └──────────────────┬──────────────────────────────┘ │ scrape every 5s ┌──────────────────▼──────────────────────────────┐ │ Prometheus │ │ ├── Time Series Database (TSDB) │ │ ├── Storage retention: 200h │ │ └── PromQL Engine │ └──────────────────┬──────────────────────────────┘ │ query data ┌──────────────────▼──────────────────────────────┐ │ Grafana │ │ ├── Dashboards │ │ ├── Alerting │ │ └── Visualization │ └─────────────────────────────────────────────────┘ 
Enter fullscreen mode Exit fullscreen mode

Why This Stack?

Prometheus — the de-facto standard for metrics. Pull model, powerful PromQL query language, excellent Kubernetes integration.

Grafana — the best visualization tool. Beautiful dashboards, alerting, templating, rich UI.

FastAPI — async Python framework with native metrics support via prometheus_client.

Basic Infrastructure Setup

Docker Compose: 5-Minute Quick Start

First, let's spin up Prometheus and Grafana in Docker:

# docker-compose.yml version: '3.8' services: prometheus: image: prom/prometheus:latest container_name: prometheus ports: - "9090:9090" volumes: - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=200h' # 8+ days of history - '--web.enable-lifecycle' networks: - monitoring extra_hosts: - "host.docker.internal:host-gateway" # Access host machine grafana: image: grafana/grafana:latest container_name: grafana ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} # Use .env! - GF_SERVER_ROOT_URL=/grafana # For nginx reverse proxy volumes: - grafana_data:/var/lib/grafana - ./monitoring/grafana/provisioning:/etc/grafana/provisioning depends_on: - prometheus networks: - monitoring volumes: prometheus_data: grafana_data: networks: monitoring: driver: bridge 
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • storage.tsdb.retention.time=200h — keep metrics for 8+ days (for weekly analysis)
  • extra_hosts: host.docker.internal — allows Prometheus to reach the app on the host
  • Volumes for data persistence

Prometheus Configuration

# monitoring/prometheus.yml global: scrape_interval: 15s # How often to collect metrics evaluation_interval: 15s # How often to check alerts scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'webapp' static_configs: - targets: ['host.docker.internal:8000'] # Your app port scrape_interval: 5s # More frequent for web apps metrics_path: /metrics 
Enter fullscreen mode Exit fullscreen mode

Important: scrape_interval: 5s for web apps is a balance between data freshness and system load. In production, typically 15-30s.

Grafana Datasource Provisioning

To avoid manual Prometheus setup in Grafana, use provisioning:

# monitoring/grafana/provisioning/datasources/prometheus.yml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: true 
Enter fullscreen mode Exit fullscreen mode

Now Grafana automatically connects to Prometheus on startup.

docker-compose up -d 
Enter fullscreen mode Exit fullscreen mode

Level 1: HTTP Metrics

The most basic but critically important layer — monitoring HTTP requests. Middleware automatically collects metrics for all HTTP requests.

Metrics Initialization

# webapp/main.py from prometheus_client import Counter, Histogram, CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST from fastapi import FastAPI, Request from fastapi.responses import PlainTextResponse import time app = FastAPI(title="Peakline", version="2.0.0") # Create separate registry for metrics isolation registry = CollectorRegistry() # Counter: monotonically increasing value (request count) http_requests_total = Counter( 'http_requests_total', 'Total number of HTTP requests', ['method', 'endpoint', 'status_code'], # Labels for grouping  registry=registry ) # Histogram: distribution of values (execution time) http_request_duration_seconds = Histogram( 'http_request_duration_seconds', 'HTTP request duration in seconds', ['method', 'endpoint'], registry=registry ) # API call counters api_calls_total = Counter( 'api_calls_total', 'Total number of API calls by type', ['api_type'], registry=registry ) # Separate error counters http_errors_4xx_total = Counter( 'http_errors_4xx_total', 'Total number of 4xx HTTP errors', ['endpoint', 'status_code'], registry=registry ) http_errors_5xx_total = Counter( 'http_errors_5xx_total', 'Total number of 5xx HTTP errors', ['endpoint', 'status_code'], registry=registry ) 
Enter fullscreen mode Exit fullscreen mode

Middleware for Automatic Collection

The magic happens in middleware — it wraps every request:

@app.middleware("http") async def metrics_middleware(request: Request, call_next): start_time = time.time() # Execute request  response = await call_next(request) duration = time.time() - start_time # Path normalization: /api/activities/12345 → /api/activities/{id}  path = request.url.path if path.startswith('/api/'): parts = path.split('/') if len(parts) > 3 and parts[3].isdigit(): parts[3] = '{id}' path = '/'.join(parts) # Record metrics  http_requests_total.labels( method=request.method, endpoint=path, status_code=str(response.status_code) ).inc() http_request_duration_seconds.labels( method=request.method, endpoint=path ).observe(duration) # Track API calls  if path.startswith('/api/'): api_type = path.split('/')[2] if len(path.split('/')) > 2 else 'unknown' api_calls_total.labels(api_type=api_type).inc() # Track errors separately  status_code = response.status_code if 400 <= status_code < 500: http_errors_4xx_total.labels(endpoint=path, status_code=str(status_code)).inc() elif status_code >= 500: http_errors_5xx_total.labels(endpoint=path, status_code=str(status_code)).inc() return response 
Enter fullscreen mode Exit fullscreen mode

Key Techniques:

  1. Path normalization — critically important! Without this, you'll get thousands of unique metrics for /api/activities/1, /api/activities/2, etc.

  2. Labels — allow filtering and grouping metrics in PromQL

  3. Separate error counters — simplifies alert writing

Metrics Endpoint

@app.get("/metrics") async def metrics(): """Prometheus metrics endpoint""" return PlainTextResponse( generate_latest(registry), media_type=CONTENT_TYPE_LATEST ) 
Enter fullscreen mode Exit fullscreen mode

Now Prometheus can collect metrics from http://localhost:8000/metrics.

What We Get in Prometheus

# Metrics format in /metrics endpoint: http_requests_total{method="GET",endpoint="/api/activities",status_code="200"} 1543 http_requests_total{method="POST",endpoint="/api/activities",status_code="201"} 89 http_request_duration_seconds_bucket{method="GET",endpoint="/api/activities",le="0.1"} 1234 
Enter fullscreen mode Exit fullscreen mode

Level 2: External API Metrics

Web applications often integrate with external APIs (Stripe, AWS, etc.). It's important to track not only your own requests but also dependencies.

External API Metrics

# External API metrics external_api_calls_total = Counter( 'external_api_calls_total', 'Total number of external API calls by endpoint type', ['endpoint_type'], registry=registry ) external_api_errors_total = Counter( 'external_api_errors_total', 'Total number of external API errors by endpoint type', ['endpoint_type'], registry=registry ) external_api_latency_seconds = Histogram( 'external_api_latency_seconds', 'External API call latency in seconds', ['endpoint_type'], registry=registry ) 
Enter fullscreen mode Exit fullscreen mode

API Call Tracking Helper

Instead of duplicating code everywhere you call the API, create a universal wrapper:

async def track_external_api_call(endpoint_type: str, api_call_func, *args, **kwargs): """ Universal wrapper for tracking API calls Usage: result = await track_external_api_call( 'athlete_activities', client.get_athlete_activities, athlete_id=123 ) """ start_time = time.time() try: # Increment call counter  external_api_calls_total.labels(endpoint_type=endpoint_type).inc() # Execute API call  result = await api_call_func(*args, **kwargs) # Record latency  duration = time.time() - start_time external_api_latency_seconds.labels(endpoint_type=endpoint_type).observe(duration) # Check for API errors (status >= 400)  if isinstance(result, Exception) or (hasattr(result, 'status') and result.status >= 400): external_api_errors_total.labels(endpoint_type=endpoint_type).inc() return result except Exception as e: # Record latency and error  duration = time.time() - start_time external_api_latency_seconds.labels(endpoint_type=endpoint_type).observe(duration) external_api_errors_total.labels(endpoint_type=endpoint_type).inc() raise e 
Enter fullscreen mode Exit fullscreen mode

Usage in Code

@app.get("/api/activities") async def get_activities(athlete_id: int): # Instead of direct API call:  # activities = await external_client.get_athlete_activities(athlete_id)  # Use wrapper with tracking:  activities = await track_external_api_call( 'athlete_activities', external_client.get_athlete_activities, athlete_id=athlete_id ) return activities 
Enter fullscreen mode Exit fullscreen mode

Now we can see:

  • How many calls to each external API endpoint
  • How many returned errors
  • Latency for each call type

Level 3: Business Metrics

This is the most valuable part of monitoring — metrics that reflect actual application usage.

Business Metrics Types

# === Authentication === user_logins_total = Counter( 'user_logins_total', 'Total number of user logins', registry=registry ) user_registrations_total = Counter( 'user_registrations_total', 'Total number of new user registrations', registry=registry ) user_deletions_total = Counter( 'user_deletions_total', 'Total number of user deletions', registry=registry ) # === File Operations === fit_downloads_total = Counter( 'fit_downloads_total', 'Total number of FIT file downloads', registry=registry ) gpx_downloads_total = Counter( 'gpx_downloads_total', 'Total number of GPX file downloads', registry=registry ) gpx_uploads_total = Counter( 'gpx_uploads_total', 'Total number of GPX file uploads', registry=registry ) # === User Actions === settings_updates_total = Counter( 'settings_updates_total', 'Total number of user settings updates', registry=registry ) feature_requests_total = Counter( 'feature_requests_total', 'Total number of feature requests', registry=registry ) feature_votes_total = Counter( 'feature_votes_total', 'Total number of votes for features', registry=registry ) # === Reports === manual_reports_total = Counter( 'manual_reports_total', 'Total number of manually created reports', registry=registry ) auto_reports_total = Counter( 'auto_reports_total', 'Total number of automatically created reports', registry=registry ) failed_reports_total = Counter( 'failed_reports_total', 'Total number of failed report creation attempts', registry=registry ) 
Enter fullscreen mode Exit fullscreen mode

Incrementing in Code

@app.post("/api/auth/login") async def login(credentials: LoginCredentials): user = await authenticate_user(credentials) if user: # Increment successful login counter  user_logins_total.inc() return {"token": generate_token(user)} return {"error": "Invalid credentials"} @app.post("/api/activities/report") async def create_report(activity_id: int, is_auto: bool = False): try: report = await generate_activity_report(activity_id) # Different counters for manual and automatic reports  if is_auto: auto_reports_total.inc() else: manual_reports_total.inc() return report except Exception as e: failed_reports_total.inc() raise e 
Enter fullscreen mode Exit fullscreen mode

Level 4: Performance and Caching

Cache Metrics

Cache is a critical part of performance. Need to track hit rate:

cache_hits_total = Counter( 'cache_hits_total', 'Total number of cache hits', ['cache_type'], registry=registry ) cache_misses_total = Counter( 'cache_misses_total', 'Total number of cache misses', ['cache_type'], registry=registry ) # In caching code: async def get_from_cache(key: str, cache_type: str = 'generic'): value = await cache.get(key) if value is not None: cache_hits_total.labels(cache_type=cache_type).inc() return value else: cache_misses_total.labels(cache_type=cache_type).inc() return None 
Enter fullscreen mode Exit fullscreen mode

Background Task Metrics

If you have background tasks (Celery, APScheduler), track them:

background_task_duration_seconds = Histogram( 'background_task_duration_seconds', 'Background task execution time', ['task_type'], registry=registry ) async def run_background_task(task_type: str, task_func, *args, **kwargs): start_time = time.time() try: result = await task_func(*args, **kwargs) return result finally: duration = time.time() - start_time background_task_duration_seconds.labels(task_type=task_type).observe(duration) 
Enter fullscreen mode Exit fullscreen mode

PromQL: Metrics Query Language

Prometheus uses its own query language — PromQL. Not SQL, but very powerful.

Basic Queries

# 1. Just get metric (instant vector) http_requests_total # 2. Filter by labels http_requests_total{method="GET"} http_requests_total{status_code="200"} http_requests_total{method="GET", endpoint="/api/activities"} # 3. Regular expressions in labels http_requests_total{status_code=~"5.."} # All 5xx errors http_requests_total{endpoint=~"/api/.*"} # All API endpoints # 4. Time interval (range vector) http_requests_total[5m] # Data for last 5 minutes 
Enter fullscreen mode Exit fullscreen mode

Rate and irate: Rate of Change

Counter constantly grows, but we need rate of change — RPS (requests per second):

# Rate - average rate over interval rate(http_requests_total[5m]) # irate - instantaneous rate (between last two points) irate(http_requests_total[5m]) 
Enter fullscreen mode Exit fullscreen mode

When to use what:

  • rate() — for alerts and trend graphs (smooths spikes)
  • irate() — for detailed analysis (shows peaks)

Aggregation with sum, avg, max

# Total app RPS sum(rate(http_requests_total[5m])) # RPS by method sum(rate(http_requests_total[5m])) by (method) # RPS by endpoint, sorted sort_desc(sum(rate(http_requests_total[5m])) by (endpoint)) # Average latency avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])) 
Enter fullscreen mode Exit fullscreen mode

Histogram and Percentiles

For Histogram metrics (latency, duration) use histogram_quantile:

# P50 (median) latency histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m])) # P95 latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # P99 latency (99% of requests faster than this) histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # P95 per endpoint histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint) 
Enter fullscreen mode Exit fullscreen mode

Complex Queries

1. Success Rate (percentage of successful requests)

( sum(rate(http_requests_total{status_code=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) ) * 100 
Enter fullscreen mode Exit fullscreen mode

2. Error Rate (percentage of errors)

( sum(rate(http_requests_total{status_code=~"4..|5.."}[5m])) / sum(rate(http_requests_total[5m])) ) * 100 
Enter fullscreen mode Exit fullscreen mode

3. Cache Hit Rate

( sum(rate(cache_hits_total[5m])) / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) ) * 100 
Enter fullscreen mode Exit fullscreen mode

4. Top-5 Slowest Endpoints

topk(5, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) by (endpoint) ) 
Enter fullscreen mode Exit fullscreen mode

5. API Health Score (0-100)

( ( sum(rate(external_api_calls_total[5m])) - sum(rate(external_api_errors_total[5m])) ) / sum(rate(external_api_calls_total[5m])) ) * 100 
Enter fullscreen mode Exit fullscreen mode

Grafana Dashboards: Visualization

Now the fun part — turning raw metrics into beautiful and informative dashboards.

HTTP and Performance

Dashboard 1: HTTP & Performance

Panel 1: Request Rate

sum(rate(http_requests_total[5m])) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Time series
  • Color: Blue gradient
  • Unit: requests/sec
  • Legend: Total RPS

Panel 2: Success Rate

( sum(rate(http_requests_total{status_code=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) ) * 100 
Enter fullscreen mode Exit fullscreen mode
  • Type: Stat
  • Color: Green if > 95%, yellow if > 90%, red if < 90%
  • Unit: percent (0-100)
  • Value: Current (last)

Panel 3: Response Time (P50, P95, P99)

# P50 histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m])) # P95 histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # P99 histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Time series
  • Unit: seconds (s)
  • Legend: P50, P95, P99

Panel 4: Errors by Type

sum(rate(http_requests_total{status_code=~"4.."}[5m])) by (status_code) sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (status_code) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Bar chart
  • Colors: Yellow (4xx), Red (5xx)

Panel 5: Request Rate by Endpoint

sort_desc(sum(rate(http_requests_total[5m])) by (endpoint)) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Bar chart
  • Limit: Top 10

Dashboard 2: Business Metrics

This dashboard shows real product usage — what users do and how often.

Authentication and Users

Panel 1: User Activity (24h)

# Logins increase(user_logins_total[24h]) # Registrations increase(user_registrations_total[24h]) # Deletions increase(user_deletions_total[24h]) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Stat
  • Layout: Horizontal

Panel 2: Downloads by Type

sum(rate({__name__=~".*_downloads_total"}[5m])) by (__name__) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Pie chart
  • Legend: Right side

Panel 3: Feature Usage Timeline

rate(gpx_fixer_usage_total[5m]) rate(search_usage_total[5m]) rate(manual_reports_total[5m]) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Time series
  • Stacking: Normal

Dashboard 3: External API

Critical to monitor dependencies on external services — they can become bottlenecks.

External API

Panel 1: API Health Score

( sum(rate(external_api_calls_total[5m])) - sum(rate(external_api_errors_total[5m])) ) / sum(rate(external_api_calls_total[5m])) * 100 
Enter fullscreen mode Exit fullscreen mode
  • Type: Gauge
  • Min: 0, Max: 100
  • Thresholds: 95 (green), 90 (yellow), 0 (red)

Panel 2: API Latency by Endpoint

histogram_quantile(0.95, rate(external_api_latency_seconds_bucket[5m])) by (endpoint_type) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Bar chart
  • Sort: Descending

Panel 3: Error Rate by Endpoint

sum(rate(external_api_errors_total[5m])) by (endpoint_type) 
Enter fullscreen mode Exit fullscreen mode
  • Type: Bar chart
  • Color: Red

Variables: Dynamic Dashboards

Grafana supports variables for interactive dashboards:

Creating a Variable

  1. Dashboard Settings → Variables → Add variable
  2. Name: endpoint
  3. Type: Query
  4. Query:
label_values(http_requests_total, endpoint) 
Enter fullscreen mode Exit fullscreen mode

Using in Panels

# Filter by selected endpoint sum(rate(http_requests_total{endpoint="$endpoint"}[5m])) # Multi-select sum(rate(http_requests_total{endpoint=~"$endpoint"}[5m])) by (endpoint) 
Enter fullscreen mode Exit fullscreen mode

Useful Variables

# Time interval Variable: interval Type: Interval Values: 1m,5m,10m,30m,1h # HTTP method Variable: method Query: label_values(http_requests_total, method) # Status code Variable: status_code Query: label_values(http_requests_total, status_code) 
Enter fullscreen mode Exit fullscreen mode

Alerting: System Reactivity

Monitoring without alerts is like a car without brakes. Let's set up smart alerts.

Grafana Alerting

Alert 1: High Error Rate

( sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) * 100 > 1 
Enter fullscreen mode Exit fullscreen mode
  • Condition: > 1 (more than 1% errors)
  • For: 5m (for 5 minutes)
  • Severity: Critical
  • Notification: Slack, Email, Telegram

Alert 2: High Latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2 
Enter fullscreen mode Exit fullscreen mode
  • Condition: P95 > 2 seconds
  • For: 10m
  • Severity: Warning

Alert 3: External API Down

sum(rate(external_api_errors_total[5m])) / sum(rate(external_api_calls_total[5m])) > 0.5 
Enter fullscreen mode Exit fullscreen mode
  • Condition: More than 50% API errors
  • For: 2m
  • Severity: Critical

Alert 4: No Data

absent_over_time(http_requests_total[10m]) 
Enter fullscreen mode Exit fullscreen mode
  • Condition: No metrics for 10 minutes
  • Severity: Critical
  • Means: app crashed or Prometheus can't collect metrics

Best Practices: Battle-Tested Experience

1. Labels: Don't Overdo It

Bad:

# Too detailed labels = cardinality explosion http_requests_total.labels( method=request.method, endpoint=request.url.path, # Every unique URL!  user_id=str(user.id), # Thousands of users!  timestamp=str(time.time()) # Infinite values! ).inc() 
Enter fullscreen mode Exit fullscreen mode

Good:

# Normalized endpoints + limited label set http_requests_total.labels( method=request.method, endpoint=normalize_path(request.url.path), # /api/users/{id}  status_code=str(response.status_code) ).inc() 
Enter fullscreen mode Exit fullscreen mode

Rule: High-cardinality data (user_id, timestamps, unique IDs) should NOT be labels.

2. Naming Convention

Follow Prometheus naming conventions:

# Good names: http_requests_total # <namespace>_<name>_<unit> external_api_latency_seconds # Unit in name cache_hits_total # Clear it's a Counter  # Bad names: RequestCount # Not CamelCase api-latency # Don't use dashes request_time # Unit not specified 
Enter fullscreen mode Exit fullscreen mode

3. Rate() Interval

Rate() interval should be minimum 4x larger than scrape_interval:

# If scrape_interval = 15s rate(http_requests_total[1m]) # 4x = 60s ✅ rate(http_requests_total[30s]) # 2x = poor accuracy ❌ 
Enter fullscreen mode Exit fullscreen mode

4. Histogram Buckets

Proper buckets are critical for accurate percentiles:

# Default (bad for latency): Histogram('latency_seconds', 'Latency') # [.005, .01, .025, .05, .1, ...]  # Custom buckets for web latency: Histogram( 'http_request_duration_seconds', 'Request latency', buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10] ) 
Enter fullscreen mode Exit fullscreen mode

Principle: Buckets should cover the typical range of values.

5. Metrics Cost

Every metric costs memory. Let's calculate:

Memory = Series count × (~3KB per series) Series = Metric × Label combinations 
Enter fullscreen mode Exit fullscreen mode

Example:

# 1 metric × 5 methods × 20 endpoints × 15 status codes = 1,500 series http_requests_total{method, endpoint, status_code} # 1,500 × 3KB = ~4.5MB for one metric! 
Enter fullscreen mode Exit fullscreen mode

Tip: Regularly check cardinality:

# Top metrics by cardinality topk(10, count by (__name__)({__name__=~".+"})) 
Enter fullscreen mode Exit fullscreen mode

Production Checklist

Before launching in production, check:

  • [ ] Retention policy configured (storage.tsdb.retention.time)
  • [ ] Disk space monitored (Prometheus can take a lot of space)
  • [ ] Backups configured for Grafana dashboards
  • [ ] Alerts tested (create artificial error)
  • [ ] Notification channels work (send test alert)
  • [ ] Access control configured (don't leave Grafana with admin/admin!)
  • [ ] HTTPS configured for Grafana (via nginx reverse proxy)
  • [ ] Cardinality checked (topk(10, count by (__name__)({__name__=~".+"})))
  • [ ] Documentation created (what metric is responsible for what)
  • [ ] On-call process defined (who gets alerts and what to do)

Real Case: Finding a Problem

Imagine: users complain about slow performance. Here's how monitoring helped find and fix the problem in minutes.

Step 1: Open Grafana → HTTP Performance Dashboard

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) 
Enter fullscreen mode Exit fullscreen mode

We see: P95 latency jumped from 0.2s to 3s.

Step 2: Check latency by endpoint

topk(5, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint)) 
Enter fullscreen mode Exit fullscreen mode

Found: /api/activities — 5 seconds!

Step 3: Check external APIs

histogram_quantile(0.95, rate(external_api_latency_seconds_bucket[5m])) by (endpoint_type) 
Enter fullscreen mode Exit fullscreen mode

External API athlete_activities — 4.8 seconds. There's the problem!

Step 4: Check error rate

rate(external_api_errors_total{endpoint_type="athlete_activities"}[5m]) 
Enter fullscreen mode Exit fullscreen mode

No errors, just slow. So the problem isn't on our side — external service is lagging.

Solution:

  • Add aggressive caching for external API (TTL 5 minutes)
  • Set up alert for latency > 2s
  • Add timeout to requests

Step 5: After deploy, verify

# Cache hit rate (cache_hits_total / (cache_hits_total + cache_misses_total)) * 100 
Enter fullscreen mode Exit fullscreen mode

Hit rate 85% → latency dropped to 0.3s. Victory! 🎉

What's Next?

You've built a production-ready monitoring system. But this is just the beginning:

Next Steps:

  1. Distributed Tracing — add Jaeger/Tempo for request tracing
  2. Logging — integrate Loki for centralized logs
  3. Custom Dashboards — create dashboards for business (not just DevOps)
  4. SLO/SLI — define Service Level Objectives
  5. Anomaly Detection — use machine learning for anomaly detection
  6. Cost Monitoring — add cost metrics (AWS CloudWatch, etc.)

Useful Resources:

Conclusion

A monitoring system isn't "set it and forget it." It's a living organism that needs to evolve with your application. But the basic architecture we've built scales from startup to enterprise.

Key Takeaways:

  1. Three metric levels: HTTP (infrastructure) → API (dependencies) → Business (product)
  2. Middleware automates basic metrics collection
  3. PromQL is powerful — learn gradually
  4. Labels matter — but don't overdo cardinality
  5. Alerts are critical — monitoring without alerts is useless
  6. Document — in six months you'll forget what foo_bar_total means

Monitoring is a culture, not a tool. Start simple, iterate, improve. And your application will run stably, while you sleep peacefully 😴


About Peakline

This monitoring system was built for Peakline — a web application for Strava activity analysis. Peakline provides athletes with:

  • Detailed segment analysis with interactive maps
  • Historical weather data for every activity
  • Advanced FIT file generation for virtual races
  • Automatic GPX track error correction
  • Route planner

All these features require reliable monitoring to ensure quality user experience.


Questions? Leave them in the comments!

P.S. If you found this helpful — share with colleagues who might benefit!


About the Author

Solo developer building Peakline — tools for athletes. Athlete and enthusiast myself, believe in automation, observability, and quality code. Continuing to develop the project and share experience with the community in 2025.


Connect

  • 🌐 Peakline Website
  • 💬 Share your monitoring setup in comments
  • 📧 Questions? Drop a comment below!

Tags: #prometheus #grafana #monitoring #python #fastapi #devops #observability #sre #metrics #production

Top comments (0)