Skip to content

Conversation

@BlakeWilliams
Copy link

Motivation / Background

At GitHub we're spiking out tooling that identifies queries in CI, highlighting them to the PR author, linting for index usage, etc. While building this functionality we discovered that the query logs attributed a significant number of queries to a common job that's run inline in our test suite. After some investigation, we discovered that the cause was due to :job (and :controller) context values not being cleared after the job or controller had completed execution.

The job in question is commonly run in our test suite and we saw a ~4x reduction in the number of queries attributed to it after a small monkey patch to clear ActiveSupport::ExecutionContext[:job] post perform.

The accuracy of this tag is valuable in this testing context since it's highlighting a lot of potential optimizations we can make to our test suite and avoids potential confusion why when queries are attributed incorrectly. This can also impact production in some circumstances, like mistakenly attributing queries to the wrong source (e.g. running a perform_now could incorrectly begin attributing queries to that job instead its parent job, or queries made in a controller post-perform).

Detail

Currently ActiveSupport::ExecutionContext#[] is used to set the :job and :controller attributes, which does not reset the values once the job or controller have completed execution. This can result in that context leaking, resulting in query logs that are misattributed.

This leaky behavior can be seen in production (e.g. performing a job via perform_now) but is much more noticeable in the test environment where it's not uncommon to run some combination of jobs/controller actions in a single test.

This updates both uses of ActiveSupport::ExecutionContext#[] to avoid leaking the context and generating inaccurate query logs.

Additional information

I wasn't 100% positive if there was a better place to put these tests but thought they'd be valuable as small/cheap regression tests.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Unrelated changes should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.
Currently `ActiveSupport::ExecutionContext#[]` is used to set the `:job` and `:controller` attributes, which does not reset the values once the job or controller have completed execution. This can result in that context leaking, resulting in query logs that are misattributed. This leaky behavior can be seen in production (e.g. performing a job via `perform_now`) but is much more noticeable in the test environment where it's not uncommon to run some combination of jobs/controller actions in a single test. This updates both uses of `ActiveSupport::ExecutionContext#[]` to avoid leaking the context and generating inaccurate query logs.
@byroot
Copy link
Member

byroot commented Mar 2, 2025

Unfortunately I don't think that's exactly the proper fix, because there's some expectation that :controller or :job remains in the context so that's it's still available when you are returning from the middleware chain.

But of course in test that's a bit different.

@BlakeWilliams
Copy link
Author

Unfortunately I don't think that's exactly the proper fix, because there's some expectation that :controller or :job remains in the context so that's it's still available when you are returning from the middleware chain.

I think for :controller that makes a bit more sense, although the behavior is still likely surprising to folks since you're only able to attribute the latter half of the middleware and completely miss out on attributing the initial half, which is (at least anecdotally) much more likely to run queries. So as-is consumers of this API get query logs that seemingly attribute queries to middleware with a significant and hidden gap in coverage.

In at least in certain job scenarios, the existing behavior leads to plain incorrect results. I wrote up a quick repro script to highlight that:

I, [2025-03-02T10:36:37.267797 #98625] INFO -- : ------------------------------- I, [2025-03-02T10:36:37.267858 #98625] INFO -- : BugTest: test_query_association I, [2025-03-02T10:36:37.267865 #98625] INFO -- : ------------------------------- I, [2025-03-02T10:36:37.268202 #98625] INFO -- : Performing OuterJob (Job ID: bc1ca609-45f8-4865-a77c-7c5a25710a88) from Async(default) D, [2025-03-02T10:36:37.272491 #98625] DEBUG -- : Post Load (0.0ms) SELECT "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 1 /*application='TestApp',job='OuterJob'*/ I, [2025-03-02T10:36:37.272641 #98625] INFO -- : Performing InnerJob (Job ID: 854efa1e-4fd3-4500-96ab-98e219fb7194) from Async(default) Hello world I, [2025-03-02T10:36:37.272711 #98625] INFO -- : Performed InnerJob (Job ID: 854efa1e-4fd3-4500-96ab-98e219fb7194) from Async(default) in 0.1ms D, [2025-03-02T10:36:37.272886 #98625] DEBUG -- : Post Load (0.0ms) SELECT "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 1 OFFSET 1 /*application='TestApp',job='InnerJob'*/ I, [2025-03-02T10:36:37.272913 #98625] INFO -- : Performed OuterJob (Job ID: bc1ca609-45f8-4865-a77c-7c5a25710a88) from Async(default) in 4.82ms 

Both queries occur in the OuterJob, but after performing the InnerJob all subsequent queries are associated with it. This is incorrect attribution in both test, and prod environments. We'll also see misattribution in controllers too, with queries being associated to the job after the job has already been performed.

As-is though, I'd argue that the existing behavior is often misleading, if not incorrect. I think there's a strong argument for changing the job behavior given how much more common it would be to run into it leaking query tags.

Unfortunately I don't think that's exactly the proper fix

Either way, I would definitely be interested in what you imagine the proper fix might be because I'd love to fix this behavior upstream vs just in our application(s).

@BlakeWilliams
Copy link
Author

BlakeWilliams commented Mar 2, 2025

For completeness sake, here's another repro script but using controllers and middleware highlighting some of what I mentioned above:

I, [2025-03-02T10:58:25.127208 #18849] INFO -- : ------------------------------------------ I, [2025-03-02T10:58:25.127245 #18849] INFO -- : BugTest: test_controller_query_association I, [2025-03-02T10:58:25.127251 #18849] INFO -- : ------------------------------------------ I, [2025-03-02T10:58:25.127924 #18849] INFO -- : Started GET "/" for 127.0.0.1 at 2025-03-02 10:58:25 -0500 D, [2025-03-02T10:58:25.133805 #18849] DEBUG -- : Post Load (0.0ms) SELECT "posts".* FROM "posts" WHERE "posts"."id" = 100 ORDER BY "posts"."id" ASC LIMIT 1 /*application='TestApp'*/ I, [2025-03-02T10:58:25.140603 #18849] INFO -- : Processing by TestController#index as HTML D, [2025-03-02T10:58:25.140950 #18849] DEBUG -- : Post Load (0.0ms) SELECT "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 1 OFFSET 2 /*action='index',application='TestApp',controller='test'*/ I, [2025-03-02T10:58:25.141101 #18849] INFO -- : Performing InnerJob (Job ID: bc4d5767-6b65-4b5d-a04d-ad7c287bb000) from Async(default) Hello world I, [2025-03-02T10:58:25.141172 #18849] INFO -- : Performed InnerJob (Job ID: bc4d5767-6b65-4b5d-a04d-ad7c287bb000) from Async(default) in 0.12ms D, [2025-03-02T10:58:25.143376 #18849] DEBUG -- : Rendering text template I, [2025-03-02T10:58:25.143420 #18849] INFO -- : Rendered text template (Duration: 0.0ms | GC: 0.0ms) D, [2025-03-02T10:58:25.143756 #18849] DEBUG -- : Post Load (0.1ms) SELECT "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 1 OFFSET 3 /*action='index',application='TestApp',controller='test',job='InnerJob'*/ I, [2025-03-02T10:58:25.143872 #18849] INFO -- : Completed 200 OK in 3ms (Views: 2.3ms | ActiveRecord: 0.1ms (2 queries, 0 cached) | GC: 0.0ms) D, [2025-03-02T10:58:25.144187 #18849] DEBUG -- : Post Load (0.0ms) SELECT "posts".* FROM "posts" WHERE "posts"."id" = 101 ORDER BY "posts"."id" ASC LIMIT 1 /*action='index',application='TestApp',controller='test',job='InnerJob'*/ F 
  • The :controller is missing in queries before the @app.call(env) call (which makes sense from an implementation perspective, but not a behavior perspective).
  • Once a job been run inline anywhere, all subsequent queries are attributed to the last job run since :job is not reset or designed to handle nested calls.
  • Queries after @app.call(env) in the middleware stack are attributed to both the controller, and the job.

Ultimately, the behavior as-is feels inconsistent, and surprising unless you know about these caveats and implementation details. This data can be so helpful in a number of scenarios (incidents, understanding query patterns, debugging, etc), but only if it's accurate, so I'd really love to help make this data more accurate and actionable.

@byroot
Copy link
Member

byroot commented Mar 3, 2025

although the behavior is still likely surprising to folks since you're only able to attribute the latter half of the middleware and completely miss out on attributing the initial half,

Yes, but it's not like you can identify the controller before going through the router, so it's a bit innevitable.

I would definitely be interested in what you imagine the proper fix might be

I haven't put much thoughts into it yet, I'm just saying that for the case of controllers at least, it's expected that the context remain. If we want to clear it in test environment, I think it's more the job of the test helpers to do it.

For jobs specifically, it might make sense to restore it right away.

@BlakeWilliams
Copy link
Author

I haven't put much thoughts into it yet, I'm just saying that for the case of controllers at least, it's expected that the context remain. If we want to clear it in test environment, I think it's more the job of the test helpers to do it.

That's fair, I've been thinking on that and I think I still disagree with that being the "correct" behavior (and happy to agree to disagree here) given the reasoning above. At least for query logs, I believe that intent changed at some point since it looks like it was behaving as desired before it was merged with (or into?) ActiveSupport::ExecutionContext. It's possible that other mechanisms use this data where this behavior is useful, but I would still assert that as-is, it's misleading.

For jobs specifically, it might make sense to restore it right away.

I think jobs are definitely more likely to cause issues with attribution given their usage patterns. I'd be happy to split the controller and job changes out into separate PRs if that's helpful.

@byroot
Copy link
Member

byroot commented Mar 3, 2025

I'd be happy to split the controller and job changes out into separate PRs if that's helpful.

Sure, let's go with that.

@byroot
Copy link
Member

byroot commented Mar 3, 2025

I believe that intent changed at some point since it looks like it was behaving as desired before it was merged with (or into?) ActiveSupport::ExecutionContext.

Here's the context: #43598 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants