DEV Community

Sylvain Hellegouarch for Reliably

Posted on

Bringing reliability closer to you with Reliably and DataDog

As engineers we care about our users, at least we ought to :) They depend on us and our services to run just fine. This is reliability in a nutshell.

Site Reliability Engineering, or SRE if you're casual, has gained momentum to codify this view on reliability. This article is not about detailing SRE but focusing on how we can use one of its tools, Service Level Objectives (or SLO for short), to signal loss of reliability as close to engineers as we can.

Let's say we have a web application like this one below:

from starlette.applications import Starlette from starlette.responses import JSONResponse from starlette.routing import Route async def homepage(request): return JSONResponse({'hello': 'world'}) app = Starlette(debug=True, routes=[ Route('/', homepage), ]) 
Enter fullscreen mode Exit fullscreen mode

Nothing fancy about it, just a Hello World example. When running it as follows:

$ uvicorn --reload server:app 
Enter fullscreen mode Exit fullscreen mode

where server is the name of the Python module containing that code: server.py. The --reload flag allows us to change the code and let uvicorn restart the server automatically.

We can access this server as follows:

$ curl localhost:8000/ 
Enter fullscreen mode Exit fullscreen mode

Now; let's run a basic load against this server using hey:

$ hey -c 3 -q 10 -z 20s http://localhost:8000/ Summary: Total: 20.0125 secs Slowest: 0.0164 secs Fastest: 0.0020 secs Average: 0.0046 secs Requests/sec: 29.9813 Total data: 10200 bytes Size/request: 17 bytes Response time histogram: 0.002 [1] | 0.003 [152] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.005 [184] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.006 [195] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.008 [53] |■■■■■■■■■■■ 0.009 [11] |■■ 0.011 [1] | 0.012 [1] | 0.014 [0] | 0.015 [0] | 0.016 [2] | Latency distribution: 10% in 0.0027 secs 25% in 0.0033 secs 50% in 0.0043 secs 75% in 0.0055 secs 90% in 0.0067 secs 95% in 0.0074 secs 99% in 0.0083 secs Details (average, fastest, slowest): DNS+dialup: 0.0000 secs, 0.0020 secs, 0.0164 secs DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0005 secs req write: 0.0000 secs, 0.0000 secs, 0.0002 secs resp wait: 0.0043 secs, 0.0017 secs, 0.0161 secs resp read: 0.0001 secs, 0.0001 secs, 0.0007 secs Status code distribution: [200] 600 responses 
Enter fullscreen mode Exit fullscreen mode

This will gently load our server without going overboard.

We likely want to monitor this server, why not use DataDog to do so, as follows:

from ddtrace import config, patch import ddtrace.profiling.auto from starlette.applications import Starlette from starlette.responses import JSONResponse from starlette.routing import Route async def homepage(request): return JSONResponse({'hello': 'world'}) patch(starlette=True) config.starlette['service_name'] = 'my-test-service' app = Starlette(debug=True, routes=[ Route('/', homepage), ]) 
Enter fullscreen mode Exit fullscreen mode

What differs is that we are importing DataDog ddtrace to push requests to the local DataDog agent. The agent is started as follows on a different terminal:

$ export DD_API_KEY=... $ export DD_SITE=datadoghq.eu $ docker run --rm -it --name dd-agent \ -v /var/run/docker.sock:/var/run/docker.sock:ro \ -v /proc/:/host/proc/:ro \ -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \ -e DD_API_KEY=${DD_API_KEY} \ -e DD_SITE=${DD_SITE} \ -e DD_APM_ENABLED=true \ -e DD_APM_NON_LOCAL_TRAFFIC=true \ -p 8126:8126/tcp \ gcr.io/datadoghq/agent:latest 
Enter fullscreen mode Exit fullscreen mode

After a couple of minutes, you'll be able to search for metrics from this application on DataDog. Look for metrics with starlette in the name.

Could we now trick the application into raising odd errors to fake a faulty service? Why yes of course! By simply returning a 4xx or 5xx class of errors at random from time to time:

import random from ddtrace import config, patch import ddtrace.profiling.auto from starlette.applications import Starlette from starlette.requests import Request from starlette.responses import JSONResponse from starlette.routing import Route async def index(request: Request) -> JSONResponse: if random.random() > 0.91: return JSONResponse({'error': 'boom'}, status_code=500) return JSONResponse({'hello': 'world'}) patch(starlette=True) config.starlette['distributed_tracing'] = True config.starlette['service_name'] = 'my-frontend-service' app = Starlette(debug=True, routes=[ Route('/', index), ]) 
Enter fullscreen mode Exit fullscreen mode

Let's see how this impacts our client now, run again our mild load:

$ hey -c 3 -q 10 -z 20s http://localhost:8000/ Summary: Total: 20.0120 secs Slowest: 0.0189 secs Fastest: 0.0018 secs Average: 0.0051 secs Requests/sec: 29.9820 Total data: 10142 bytes Size/request: 16 bytes Response time histogram: 0.002 [1] | 0.004 [146] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.005 [193] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.007 [146] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.009 [101] |■■■■■■■■■■■■■■■■■■■■■ 0.010 [10] |■■ 0.012 [0] | 0.014 [0] | 0.016 [0] | 0.017 [1] | 0.019 [2] | Latency distribution: 10% in 0.0029 secs 25% in 0.0036 secs 50% in 0.0050 secs 75% in 0.0065 secs 90% in 0.0074 secs 95% in 0.0079 secs 99% in 0.0092 secs Details (average, fastest, slowest): DNS+dialup: 0.0000 secs, 0.0018 secs, 0.0189 secs DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0005 secs req write: 0.0000 secs, 0.0000 secs, 0.0002 secs resp wait: 0.0048 secs, 0.0017 secs, 0.0187 secs resp read: 0.0002 secs, 0.0001 secs, 0.0011 secs Status code distribution: [200] 542 responses [500] 58 responses 
Enter fullscreen mode Exit fullscreen mode

Now notice how we get a summary that does show us some responses were in errors as per our change above. Yai we broke something!

Can we now ask DataDog about these recorded errors? Yes we can:

# datadog info (change them to fit your owns) export DD_API_KEY= export DD_APP_KEY= export DD_SITE=datadoghq.eu # your query data $ export query="(sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() - sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count()) / (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count())" $ export from=$(date "+%s" -d "15 min ago") $ export to=$(date "+%s") $ curl -G -s -X GET "https://api.${DD_SITE}/api/v1/query" \ --data-urlencode "from=${from}" \ --data-urlencode "to=${to}" \ --data-urlencode "query=${query}" \ -H "Content-Type: application/json" \ -H "DD-API-KEY: ${DD_API_KEY}" \ -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" | jq . 
Enter fullscreen mode Exit fullscreen mode

The query we are running may look daunting but is rather straightforward. We take the total number of requests and we remove the ones that were on error. We then divide by the total again and this should give us a ratio of good requests as a percentage.

Great, we now have a query we can use to create a service level object (SLO) that will tell us how our service is doing over time. Let's use Reliably for this.

$ reliably slo init ? What is the name of the service you want to declare SLOs for? my-frontend-service | Paste your 'numerator' (good events) datadog query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() - sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count()) | Paste your 'denominator' (total events) datadog query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count()) ? What is your target for this SLO (in %)? 99 ? What is your observation window for this SLO? custom ? Define your custom observation window PT5M ? What is the name of this SLO? 99% of frontend responses over last 5 minutes are 2xx SLO '99% of frontend responses over last 5 minutes are 2xx' added to Service 'my-frontend-service' ? Do you want to add another SLO? No Service 'my-frontend-service' added ? Do you want to add another Service? No ✓ Your manifest has been saved to ./reliably.yaml 
Enter fullscreen mode Exit fullscreen mode

In a nutshell, we created a file that contains the definition of the SLO:

apiVersion: reliably.com/v1 kind: Objective metadata: labels: name: 99% of requests over last 5 minutes service: my-test-service spec: indicatorSelector: datadog_denominator_query: sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() datadog_numerator_query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() / sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count()) objectivePercent: 99 window: 1h0m0s 
Enter fullscreen mode Exit fullscreen mode

Now we can make reliably know about it:

$ reliably slo sync 
Enter fullscreen mode Exit fullscreen mode

Finally, while the application is still running with some load injected into it, start fetching data from DataDog, using the query we saw earlier and let Reliably consolidate them over the window duration given in the objective:

$ reliably slo agent -i3 
Enter fullscreen mode Exit fullscreen mode

Open now a new terminal and run the following:

$ reliably slo report -w 
Enter fullscreen mode Exit fullscreen mode

This will show you the SLO report for your service as computed by Reliably.

So what happened exactly? Well, let's zoom in on a section of the SLO:

 indicatorSelector: datadog_denominator_query: sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() datadog_numerator_query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() / sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count()) 
Enter fullscreen mode Exit fullscreen mode

The indicatorSelector property is how the magic happens. These are used for the following purposes:

  • giving the reliably slo agent command the means to know what provider to use, here DataDog, and therefore how to fetch the required datapoints, here the two queries. These datapoints are stored under the name of indicators on Reliably
  • declaring how these objective and the indicators are mapped together

That second point is key. Indicators themselves are not declared as entities (or objects) as objectives are. Instead they are merely a stream of values consumed by Reliably when sent by a client (reliably slo agent or via the API directly). Upon receiving an indicator, Reliably looks at its labels and match this to any indicatorSelector of any objectives (in the current organization). This tells us that objectives and indicators are loosly coupled. The fact the reliably.yaml manifest contains the selector doesn't define the indicator, only how to match indicators to objectives.

At this stage, you have a simple declaration of a service level object that relies on DataDog's data to compute it. Since the SLO is a just a file, you can now store it alongside your code base and use it as part of your CI/CD pipeline to automate decision about releasing. We'll see this in a future article using GitHub actions.

The code for this article can be found on GitHub.

Top comments (0)