I'm currently reading Google's SRE book, which is a very interesting read.
In Chapter 6 - Monitoring Distributed Systems, there is a section that explains how to choose appropriate measurement.
I didn't understand the example the author gave when explaining that the granularity of measurements is important. We can read:
Collecting per-second measurements of CPU load might yield interesting data, but such frequent measurements may be very expensive to collect, store, and analyze. If your monitoring goal calls for high resolution but doesn’t require extremely low latency, you can reduce these costs by performing internal sampling on the server, then configuring an external system to collect and aggregate that distribution over time or across servers.
You might:
- Record the current CPU utilization each second.
- Using buckets of 5% granularity, increment the appropriate CPU utilization bucket each second.
- Aggregate those values every minute. This strategy allows you to observe brief CPU hotspots without incurring very high cost due to collection and retention.
Can someone explain the "5% granularity" part?