Posted on Nov 7, 2024

The Adventures of Blink S2e9: Gathering Metrics with Prometheus and Grafana

Hey friends, and welcome to the next Adventure of Blink! If you've been following along, we've done a ton of cool stuff this season:

We learned about Docker
We configured a MongoDB (in a Docker container with persistent storage)
We made a Flask API for the database (also in a Docker container)
We've explored Test-Driven Development practices with PyTest
We made our tests run every time we commit to our repository using GitHub Actions
We created a graphical interface for our program using Tkinter
We scanned our project for security vulnerabilities using Snyk

...suffice to say, we've been really busy. But we're not through yet! Today we're covering another oft-overlooked topic:

Observability!

Observability is a core component of the DevOps mindset... because it's a place where Dev and Ops can easily interact. Ops is usually on the receiving end of support tickets and user complaints... but it's hard to diagnose something like "the application is slow!" without firm evidence of what happened. But if your developers aren't considering metrics and observability behavior when they're coding, you're not going to have those metrics for Ops to confirm a user's complaint.

TL/DR: Youtube

When to apply metrics

The answer to this is... as early as possible! Build metrics into your code while you're writing it, and become accustomed to using them throughout the development process.

Why are they important?

Using metrics in the development process ensures that you understand from the beginning how the application behaves. You'll want to consider things like load testing as you complete your work, ensuring that you see how your code behaves when there are lots of users running it at once.

How metrics are created

We're going to introduce two products to our application environment: Prometheus and Grafana.

Prometheus is a collection mechanism for metrics that we establish in our code. It runs in a Docker container as part of our environment and listens for metrics to be sent by our code... yes, that means we have some code changes to make, but it should be pretty easy work.

What metrics are important to us?

In our Hangman game, there's not a lot of processing going on. As a result, adding metrics for the performance of the application itself? Probably not all that useful.

A place where metrics would be useful would be around the API calls. If anything's going to malfunction, it's going to be the data extraction code... after all, that's the place where multiple containers get involved and where data has to flow from one system to another seamlessly. So we'll add our metrics instrumentation to the API code.

Setting up the tools

First, let's add the Prometheus and Grafana containers to our docker-compose.yml file:

prometheus: image: prom/prometheus:latest volumes: # This prometheus.yml file we will create shortly 😉 - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090"  mongo-exporter: image: bitnami/mongodb-exporter:latest environment: # Note the name of our mongo container here MONGODB_URI: "mongodb://mongo:27017" depends_on: - mongo grafana: image: grafana/grafana:latest ports: - "3000:3000" depends_on: - prometheus

Next, let's build the prometheus.yml file that establishes the configuration for Prometheus:

global: scrape_interval: 15s scrape_configs: - job_name: 'flask-api' # We'll have to create a /metrics endpoint in the API... metrics_path: '/metrics' # Endpoint from Flask app static_configs: # This target is our api container and port - targets: ['hangman-api:5001'] - job_name: 'mongo' metrics_path: '/metrics' # Endpoint for the MongoDB Exporter static_configs: - targets: ['mongo-exporter:9216'] # Default port for MongoDB Exporter

That leads us to make our code changes in the API:

from flask import Flask, jsonify, request # Adding in prometheus_client to help us build the metrics additions from prometheus_client import generate_latest, Counter from prometheus_client import multiprocess, CollectorRegistry, Gauge, Histogram from prometheus_client import multiprocess from pymongo import MongoClient from pymongo.errors import PyMongoError from bson.objectid import ObjectId from datetime import datetime import os app = Flask(__name__) # MongoDB connection mongo_uri = os.getenv("MONGO_URI_API") db_name = os.getenv("DB_NAME") collection_name = os.getenv("COLLECTION_NAME") # When testing locally, we bypass the .env and load the variables manually # mongo_uri = "mongodb://blink:theadventuresofblink@localhost:27017/hangman?authSource=admin" # db_name = "hangman" # collection_name = "phrases"  client = MongoClient(mongo_uri) db = client[db_name] collection = db[collection_name] # Here's where we set up the metrics objects we're going to need: REQUEST_COUNT = Counter('flask_app_requests_total', 'Total number of requests to the app') REQUEST_LATENCY = Histogram('flask_app_request_latency_seconds', 'Latency of requests to the app') # This route is used by prometheus to extract the metrics. # generate_latest() is a library method that knows how to get # all prometheus_client objects and send them for the application # to pick up. @app.route('/metrics') def metrics(): return generate_latest() @app.route('/getall', methods=['GET']) def get_all_items(): # This is an example of how to instrument a method.  # Notice we increment the request count, and then  # the request latency is measured by putting the entire  # method's code inside a With statement that captures  # its timing  REQUEST_COUNT.inc() with REQUEST_LATENCY.time(): try: # Find all records in the collection  words = list(collection.find({}, {"_id": 0})) # Exclude _id field from the response  return jsonify(words), 200 except Exception as e: return jsonify({"error": str(e)}), 500

For brevity's sake I didn't include the rest of the API code... but each route needs to be instrumented individually. You can add more metrics if you'd like to observe different behaviors separately.

Another note: make sure you add prometheus into the API's requirements.txt!

blinker==1.8.2 click==8.1.7 dnspython==2.7.0 Flask==3.0.3 itsdangerous==2.2.0 Jinja2==3.1.4 MarkupSafe==3.0.2 prometheus_client==0.21.0 pymongo==4.10.1 Werkzeug==3.1.1

Validating that it all works

Now that we've finished setup, we can start up our application:

# Windows/Unix docker-compose up --build # Mac docker compose up --build

We can see our new containers on ports 9090 (Prometheus) and 3000 (Grafana). Let's start in Prometheus, and set up some metrics queries:

...

Then once we've got them created, we can head over to Grafana to visualize them:

...

Wrapping up

These examples are small and somewhat contrived, in keeping with our theme of exploring these concepts in a small app so we can see how they work without the distraction of scale and complexity. But hopefully you can see from even these examples how much power you have as a developer to see what's happening within your code! This may seem like a lot of work for something that doesn't actually make our game any better or more interesting to play, but the value of metrics is in being able to diagnose more easily when something isn't right.

I hope you've learned a lot this week! We are nearly to the end of Season 2, and I'll tell ya what... it has been such a ride. Our season finale is going to be the long-awaited AI integration - we're going to let a Large Language Model build hangman games for us to play! So tune in next week for another Adventure of Blink!