Understanding an example from Google SRE book

Question

I'm currently reading Google's SRE book, which is a very interesting read.

In Chapter 6 - Monitoring Distributed Systems, there is a section that explains how to choose appropriate measurement.

I didn't understand the example the author gave when explaining that the granularity of measurements is important. We can read:

Collecting per-second measurements of CPU load might yield interesting data, but such frequent measurements may be very expensive to collect, store, and analyze. If your monitoring goal calls for high resolution but doesn’t require extremely low latency, you can reduce these costs by performing internal sampling on the server, then configuring an external system to collect and aggregate that distribution over time or across servers.

You might:

Record the current CPU utilization each second.

Using buckets of 5% granularity, increment the appropriate CPU utilization bucket each second.

Aggregate those values every minute. This strategy allows you to observe brief CPU hotspots without incurring very high cost due to collection and retention.

Can someone explain the "5% granularity" part?

I do not understand why this question was closed. It's asking for clarification of a sentence in a book which is entirely about managing computer systems in a production environment. As far as I can tell this question is about as on-topic for this site as it could possibly be. — kasperd
– kasperd, Commented Feb 2, 2019 at 16:11
Measure once a second. Each measurement goes as a counter in one of 20 buckets with a 5% range. So, you measure 31%, you increase the counter for the 30-35% bucket, and for 97% you increase the 95-100% bucket. The monitoring system will collect these counters for the last minute. — Sven
– Sven, Commented Feb 2, 2019 at 16:11
Traditional rrd databases did something Similiar on Rollup. You would not carry over buckets but at least the max or a 99percentile when rolling up single second measurements into larger time buckets. — eckes
– eckes, Commented Feb 2, 2019 at 22:29

kasperd · Accepted Answer · 2019-02-02 16:41:17Z

The 5% granularity part means you have 20 different counters.

Each second you look at the kernel's accumulation of CPU seconds used. And if the usage was between 0 and 5 percent you'd increase the first counter. Between 5 and 10 the second counter. Etc. Between 95 and 100 percent you'd increase the last counter.

For example if the kernel says 810.91 CPU seconds has been used, and one second ago it said 810.83. You'd know the usage in the past second had been 8% which would be the bucket from 5 to 10 percent.

This is an example and the specifics will of course depend on the actual requirements, and if you wanted to do something like that on a multi CPU system you'd have to decide if you need per CPU measurements or a system-wide measurement.

Stack Exchange Network

Understanding an example from Google SRE book

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Understanding an example from Google SRE book

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions