Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
What
In multi cloud context we tag some gauge metrics with their geography or their attempt queue. This could cause some problem in this scenario:
Say at 13:00 we send
num_running_jobs[gcp] = 2andnum_running_jobs[aws] = 2;at 14:00 we don't query any aws running jobs so we send
num_running_jobs[gcp] = 3onlyAt datadog dashboard - since these metrics are gauge metric; not sending aws data at 14:00 will NOT overwrite previous data - so on dashboard we are still seeing 2 running jobs at aws.
That's the gist of the fix. Also due to previous bug where attempt_queue got overwritten by null value, we have some gauge data with tag null. This PR attempts to fix that by instilling attempt_queue:null = 0 into datadog to get it reset.
(Also, as a separate fix for total number of running jobs not equal to aws + gcp, that's because the aggregate was using 'max' but in this case it should be sum because it adds metrics with all tags together. that has been done)