Posted on Jun 19, 2023 • Edited on Jun 28, 2023

Cracking the Code Monitoring Puzzle with Autometrics

#monitoring #prometheus #observability #python

In the realm of system monitoring and observability, logs present two prominent challenges: comprehensibility and cost-efficiency. When a system is operational, sifting through an abundance of collected logs to locate pertinent information can be an arduous task. The sheer volume of logs often generates excessive noise, making it difficult to discern meaningful signals. Numerous vendors offer services to extract valuable insights from log data, albeit at a considerable expense. The processing and storage of logs come with a significant price tag.

As an alternative, metrics emerge as a crucial component in the observability toolkit that many teams resort to. By focusing on quantitative measurements, the advantage of metrics lies in their lower cost compared to dealing with copious amounts of textual data. Additionally, metrics offer a concise and comprehensive understanding of system performance.

Nevertheless, utilizing metrics poses its own set of challenges for developers, who must navigate the following obstacles:

Determining which metrics to track and selecting the appropriate metric type, such as counters or histograms, often presents a dilemma.
Grappling with the intricacies of composing queries in PromQL or similar query languages to obtain the desired data can be a daunting task.
Ensuring that the retrieved data effectively addresses the intended question requires meticulous verification.

These hurdles underscore the difficulties developers encounter when leveraging metrics for system monitoring and analysis.

Autometrics offers a simplified approach to enhance code-level observability by implementing decorators and wrappers at the function level. This framework employs standardized metric names and a consistent tagging or labeling scheme for metrics.

Currently, Autometrics utilizes three essential metrics:

function.calls.count: This counter metric effectively monitors the rate of requests and errors.
function.calls.duration: A histogram metric designed to track latency, providing insights into performance.
(Optional) function.calls.concurrent: This gauge metric, if utilized, enables the tracking of concurrent requests, offering valuable concurrency-related information.

Additionally, Autometrics includes the following labels automatically for all three metrics:

function: This label identifies the specific function being monitored.
module: This label specifies the module associated with the function.

Furthermore, for the function.calls.count metric, the following labels are also incorporated:

caller: This label denotes the caller of the function.
result: This label captures the outcome or result of the function.

There are a lot of useful features supported to help build a whole metrics based observability platform.

Demo time

Let us instrument a simple python flask application for rolling dice. This application runs on localhost and is exposed on port 23456

Pre-requistes

Install and configure prometheus

❯ cat prometheus.yml scrape_configs: - job_name: my-app metrics_path: /metrics static_configs: # host.docker.internal is to refer the localhost which will run the flask app - targets: ['host.docker.internal:23456'] scrape_interval: 20s ❯ docker run \ --name=prometheus \ -d \ -p 9090:9090 \ -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus ❯ curl localhost:9090 <a href="/graph">Found</a>.

Create a sample flask application

❯ cat app.py from flask import Flask, jsonify, Response import random app = Flask(__name__) @app.route('/') def home(): return 'Welcome to the Dice Rolling API!' @app.route('/roll_dice') @autometrics def roll_dice(): roll = random.randint(1, 6) response = { 'roll': roll } return jsonify(response) if __name__ == '__main__': app.run(host='0.0.0.0', port=23456, debug=True)

Instrumentation for a flask application

Install the library
pip install autometrics
Create a .env file pointing to the prometheus endpoint

❯ cat .env PROMETHEUS_URL=http://localhost:9090/

Update the script to add @autometrics decorators and expose the metrics at /metrics

❯ cat app.py from flask import Flask, jsonify, Response import random from autometrics import autometrics from prometheus_client import generate_latest app = Flask(__name__) @app.route('/') def home(): """ default route """ return 'Welcome to the Dice Rolling API!' # order of decorators is important @app.route('/roll_dice') @autometrics def roll_dice(): """ business logic """ roll = random.randint(1, 6) response = { 'roll': roll } return jsonify(response) @app.get("/metrics") def metrics(): """ prometheus metrics for scraping """ return Response(generate_latest()) @app.route('/ping') @autometrics def pinger(): """ health check """ return {'message': 'pong'} if __name__ == '__main__': app.run(host='0.0.0.0', port=23456, debug=True)

Check for the auto-generated help and exposed metrics (VSCode will show it on hover)

❯ python Python 3.11.3 (main, Apr 7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import app.py Traceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'app.py'; 'app' is not a package >>> import app # >>> print(app.roll_dice.__doc__) Prometheus Query URLs for Function - roll_dice and Module - app: Request rate URL : http://localhost:9090/graph?g0.expr=sum%20by%20%28function%2C%20module%2C%20commit%2C%20version%29%20%28rate%20%28function_calls_count_total%7Bfunction%3D%22roll_dice%22%2Cmodule%3D%22app%22%7D%5B5m%5D%29%20%2A%20on%20%28instance%2C%20job%29%20group_left%28version%2C%20commit%29%20%28last_over_time%28build_info%5B1s%5D%29%20or%20on%20%28instance%2C%20job%29%20up%29%29&g0.tab=0 Latency URL : http://localhost:9090/graph?g0.expr=sum%20by%20%28le%2C%20function%2C%20module%2C%20commit%2C%20version%29%20%28rate%28function_calls_duration_bucket%7Bfunction%3D%22roll_dice%22%2Cmodule%3D%22app%22%7D%5B5m%5D%29%20%2A%20on%20%28instance%2C%20job%29%20group_left%28version%2C%20commit%29%20%28last_over_time%28build_info%5B1s%5D%29%20or%20on%20%28instance%2C%20job%29%20up%29%29&g0.tab=0 Error Ratio URL : http://localhost:9090/graph?g0.expr=sum%20by%20%28function%2C%20module%2C%20commit%2C%20version%29%20%28rate%20%28function_calls_count_total%7Bfunction%3D%22roll_dice%22%2Cmodule%3D%22app%22%2C%20result%3D%22error%22%7D%5B5m%5D%29%20%2A%20on%20%28instance%2C%20job%29%20group_left%28version%2C%20commit%29%20%28last_over_time%28build_info%5B1s%5D%29%20or%20on%20%28instance%2C%20job%29%20up%29%29%20/%20sum%20by%20%28function%2C%20module%2C%20commit%2C%20version%29%20%28rate%20%28function_calls_count_total%7Bfunction%3D%22roll_dice%22%2Cmodule%3D%22app%22%7D%5B5m%5D%29%20%2A%20on%20%28instance%2C%20job%29%20group_left%28version%2C%20commit%29%20%28last_over_time%28build_info%5B1s%5D%29%20or%20on%20%28instance%2C%20job%29%20up%29%29&g0.tab=0 -------------------------------------------

Run the application, generate some load

❯ for i in $(seq 1 10); do http http://127.0.0.1:23456 http http://127.0.0.1:23456/ping http http://127.0.0.1:23456/roll_dice sleep 1 done

Check if metrics are being exposed

❯ curl -q localhost:23456/metrics | grep function % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1471 100 1471 0 0 95687 0 --:--:-- --:--:-- --:--:-- 239k # HELP function_calls_count_total Autometrics counter for tracking function calls # TYPE function_calls_count_total counter function_calls_count_total{caller="dispatch_request",function="roll_dice",module="app",objective_name="",objective_percentile="",result="ok"} 30.0