Skip to content

Conversation

@xiaohansong
Copy link
Contributor

@xiaohansong xiaohansong commented Nov 16, 2022

What

In multi cloud context we tag some gauge metrics with their geography or their attempt queue. This could cause some problem in this scenario:

Say at 13:00 we send num_running_jobs[gcp] = 2 and num_running_jobs[aws] = 2;
at 14:00 we don't query any aws running jobs so we send num_running_jobs[gcp] = 3 only

At datadog dashboard - since these metrics are gauge metric; not sending aws data at 14:00 will NOT overwrite previous data - so on dashboard we are still seeing 2 running jobs at aws.

That's the gist of the fix. Also due to previous bug where attempt_queue got overwritten by null value, we have some gauge data with tag null. This PR attempts to fix that by instilling attempt_queue:null = 0 into datadog to get it reset.

(Also, as a separate fix for total number of running jobs not equal to aws + gcp, that's because the aggregate was using 'max' but in this case it should be sum because it adds metrics with all tags together. that has been done)

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge in after these are addressed!

@xiaohansong xiaohansong temporarily deployed to more-secrets November 16, 2022 22:36 Inactive
@xiaohansong xiaohansong temporarily deployed to more-secrets November 16, 2022 22:39 Inactive
@xiaohansong xiaohansong merged commit c9287cc into master Nov 16, 2022
@xiaohansong xiaohansong deleted the xiaohan/mfix branch November 16, 2022 23:57
akashkulk pushed a commit that referenced this pull request Dec 2, 2022
* multi cloud gauge metric fix * avoid casting Map * improve code comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants