Skip to content

Conversation

joegallo
Copy link
Contributor

Rather than looping externally and calling the method multiple times, we can loop internally and only call the method once. This is primarily intended as a code cleanup and clearness change, so I went with >refactoring.

As a small additional optimization, I split the instanceof Map checks to first run an instanceof IngestCtxMap check -- these maps are IngestCtxMap at the top level of a path traversal, and then almost exclusively HashMaps at subsequent levels. By having separate paths, the JVM has the opportunity to learn that those concrete types are hot in each place, rather than them both being mixed up with each other (or at least that's my understanding).

There's no need to construct and return a ResolveResult in order to do the looping externally. Instead we can do the looping within resolve and only construct and return a ResolveResult on the last iteration of the loop.
Although an IngestCtxMap is a map, having a separate slot in the if/else chain for it gives the JVM the opportunity to optimize the IngestCtxMap and almost-certainly-a-HashMap cases separately, rather than needing to treat both of them the same.
@joegallo joegallo added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >refactoring Team:Data Management Meta label for data/management team v8.19.0 v9.1.0 labels Mar 17, 2025
@joegallo joegallo requested a review from masseyke March 17, 2025 18:41
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@joegallo
Copy link
Contributor Author

This is tangentially related to #123891 -- more efficient resolveing will help all processors, but it is especially helpful for the remove processor (since so much of what it does is 'just' resolve-ing things so that it can then remove them).

ResolveResult result = resolve(fieldPath.pathElements, fieldPath.pathElements.length, path, context);
if (result.wasSuccessful) {
return cast(path, result.resolvedObject, clazz);
} else if (ignoreMissing && hasField(path) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hasField(path) == false is no longer needed. I took a look at the commit that introduced it 6796464f, and the only reason I can think that it would have failed this check would be if the path resolved, but the type of the resolved object was wrong. I believe that now, to get to the else-if condition, the path cannot be in the object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, actually -- I think the code was wrong before this PR and maybe it's just a little more obvious now with this change. #20982 introduced it and would have been 'right', but #84659 would have changed those semantics, I think.

I'm going to see if I can magic up a test that would have been broken by this change just to confirm my thinking, but either way I'll get rid of the hasField check here.

Good eyes, Parker! Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, on further reading I think the semantics haven't changed (phew!). I dropped the check in 994744b and added some more tests in 3ed83bb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually going to be a nice tiny little speedup, too, btw -- there are places in the benchmarks where that hasField(path) check was hot enough to show up in a profile.

@joegallo joegallo requested a review from parkertimmins March 18, 2025 20:04
@joegallo joegallo added the auto-backport Automatically create backport pull requests when merged label Mar 18, 2025
@joegallo joegallo merged commit 3e2cdc7 into elastic:main Mar 18, 2025
17 checks passed
@joegallo joegallo deleted the refactor-ingest-document-resolve branch March 18, 2025 21:38
joegallo added a commit to joegallo/elasticsearch that referenced this pull request Mar 18, 2025
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x
@joegallo
Copy link
Contributor Author

Screenshot 2025-03-24 at 9 24 42 AM

Here's a screenshot from the nightly benchmarks -- there's a big drop in set processors early on from #120573 but then this PR ends up giving a nice additional speedup in set processors, too. Overall we're taking about 40% less time here on this benchmark than before these two changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >refactoring Team:Data Management Meta label for data/management team v8.19.0 v9.1.0

3 participants