Skip to content

Conversation

eyalkoren
Copy link
Contributor

When trying to use MDC correlation on threads with null context class loaders (like in the added testWithNullContextClassLoader test), errors are thrown:

2019-07-12 16:33:19.849 [ForkJoinPool.commonPool-worker-3] WARN co.elastic.apm.agent.impl.transaction.TraceContextHolder - Exception while calling Slf4JMdcActivationListener#beforeActivate java.lang.NullPointerException at co.elastic.apm.agent.shaded.weaklockfree.WeakConcurrentMap.get(WeakConcurrentMap.java:96) at co.elastic.apm.agent.shaded.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.get(WeakConcurrentMap.java:375) at co.elastic.apm.agent.cache.WeakKeySoftValueLoadingCache.get(WeakKeySoftValueLoadingCache.java:77) at co.elastic.apm.agent.slf4j.Slf4JMdcActivationListener.beforeActivate(Slf4JMdcActivationListener.java:89) at co.elastic.apm.agent.impl.transaction.TraceContextHolder.activate(TraceContextHolder.java:100) at co.elastic.apm.agent.impl.transaction.AbstractSpan.activate(AbstractSpan.java:203) at co.elastic.apm.agent.impl.async.SpanInScopeBaseWrapper.beforeDelegation(SpanInScopeBaseWrapper.java:47) at co.elastic.apm.agent.impl.async.SpanInScopeRunnableWrapper.run(SpanInScopeRunnableWrapper.java:61) at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) 2019-07-12 16:33:19.850 [ForkJoinPool.commonPool-worker-3] WARN co.elastic.apm.agent.impl.transaction.TraceContextHolder - Exception while calling Slf4JMdcActivationListener#afterDeactivate java.lang.NullPointerException at co.elastic.apm.agent.shaded.weaklockfree.WeakConcurrentMap.get(WeakConcurrentMap.java:96) at co.elastic.apm.agent.shaded.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.get(WeakConcurrentMap.java:375) at co.elastic.apm.agent.cache.WeakKeySoftValueLoadingCache.get(WeakKeySoftValueLoadingCache.java:77) at co.elastic.apm.agent.slf4j.Slf4JMdcActivationListener.afterDeactivate(Slf4JMdcActivationListener.java:108) at co.elastic.apm.agent.impl.transaction.TraceContextHolder.deactivate(TraceContextHolder.java:116) at co.elastic.apm.agent.impl.transaction.AbstractSpan.deactivate(AbstractSpan.java:209) at co.elastic.apm.agent.impl.async.SpanInScopeBaseWrapper.afterDelegation(SpanInScopeBaseWrapper.java:64) at co.elastic.apm.agent.impl.async.SpanInScopeRunnableWrapper.run(SpanInScopeRunnableWrapper.java:67) at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) 

And the logging correlation doesn't work. Instead, this is an attempt to load the slf4j MDC through the System class loader.
In addition, added a caching for loading errors, so that the MDC will not be attempted to get loaded through a ClassLoader where it already failed.

@eyalkoren eyalkoren self-assigned this Jul 14, 2019
@eyalkoren eyalkoren requested a review from felixbarny July 14, 2019 07:44
@felixbarny
Copy link
Member

felixbarny commented Jul 15, 2019

Seems like that's the case when using the default fork-join pool as that is created by the bootstrap class loader and not by a particular web application, for example.
This means the log correlation does not work when using the default fork-join pool.

@felixbarny
Copy link
Member

The way we could fix that is to use the transaction's initiating class loader to look up slf4j. But we should do that in a separate PR.

@eyalkoren
Copy link
Contributor Author

Seems like that's the case when using the default fork-join pool as that is created by the bootstrap class loader and not by a particular web application

Context class loader is either set explicitly or gets the one from the parent thread. It is not related to loading.

The way we could fix that is to use the transaction's initiating class loader to look up slf4j. But we should do that in a separate PR.

If the rationale is to try and resemble the class visibility of the span, why not use the span class loader? Why are we using the context class loader in the first place?

@felixbarny
Copy link
Member

If the rationale is to try and resemble the class visibility of the span, why not use the span class loader? Why are we using the context class loader in the first place?

For example, in an application server environment, there can be multiple applications for a single JVM. We want to lookup the slf4j classes which correspond to the application which is currently being monitored via the span/transaction.

When using the common fork-join pool (for example via java.util.stream.Stream#parallel), the threads in that pool don't have the same context class loader as the web application.

Using the context class loader is, however, not really the best way to determine the application class loader. That's why we already require the class loader when starting a transaction. In the case of a servlet transaction, that would be set to javax.servlet.ServletContext#getClassLoader which is the most reliable source.

We should store that class loader in the Transaction and use that to look up the slf4j classes. That makes it consistent no matter on which thread a span is active.

@eyalkoren eyalkoren merged commit 45857ce into elastic:master Jul 15, 2019
@eyalkoren eyalkoren deleted the fix-NPE-in-MDC-correlation branch July 15, 2019 10:39
@Lonzak
Copy link

Lonzak commented Sep 14, 2020

Did this fix made it into 1.18? Because I didn't see it in the change logs. (We have the same problem...)

@eyalkoren
Copy link
Contributor Author

@Lonzak This specific fix was merged long ago.
If you are experiencing something similar, please open a new issue with detailed description of the problem, environment, relevant logs etc. so we can look into.

@Lonzak
Copy link

Lonzak commented Sep 16, 2020

@eyalkoren Damn I overlooked the year (2019 vs. 2020) This would explain it...Ok then it must be another issue with the same error...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants