- Notifications
You must be signed in to change notification settings - Fork 25.5k
ESQL: Push more ==
s on text fields to lucene #126641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If you do: ``` | WHERE text_field == "cat" ``` we can't push to the text field because it's search index is for individual words. But most text fields have a `.keyword` sub field and we *can* query it's index. EXCEPT! It's normal for these fields to have `ignore_above` in their mapping. In that case we don't push to the field. Very sad. With this change we can push down `==`, but only when the right hand side is shorter than the `ignore_above`. This has pretty much infinite speed gain. An example using a million documents: ``` Before: "took" : 391, After: "took" : 4, ``` But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.
Hi @nik9000, I've created a changelog YAML for you. |
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/stats/SearchContextStats.java Show resolved Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java Outdated Show resolved Hide resolved
I don't believe this works for |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
if (query instanceof SingleValueQuery) { | ||
// Already wrapped | ||
return query; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super proud of this. I kind of thing we should remove this and have folks wrap the query they build themselves. But not in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exists so Equals
can have a different behavior - it checks the value count of the synthetic source delegate.....
Wait. What if we remove one? Oh no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I've added a fix for this. I'll push a javadoc explaining it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 600257a.
I think this is ready for a real review! |
@luigidellaquila could you have a look at this one too? |
I've added this test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* </p> | ||
* <p> | ||
* You may be asking "how would the first {@code text_field.raw:foo} query work if the | ||
* value we're searching for is very long? In that case we never use this query at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we never use this query at all
Nit: I wonder if this should be a bigger warning at the top of the javdoc. I could imagine somebody (of us) trying to use this for something, and adding a bug because of it 👀 (Not perfect anyway)
// end::rlikeEscapingTripleQuotes-result[] | ||
; | ||
| ||
mvStringEquals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test for an literal over ignored_above
chars + MV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Literals don't have ignore_above
.
@timestamp:date ,message:text | ||
2023-10-23T13:55:01.543Z,[Connected to 10.1.0.1, Banana] | ||
2023-10-23T13:55:01.544Z,Connected to 10.1.0.1 | ||
2023-10-23T13:55:01.545Z,[Connected to 10.1.0.1, More than one hundred characters long so it isn't indexed by the sub keyword field with ignore_above:100] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding also a single-value over ignore_above
? So we have all the cases here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I should do that.
| ||
@Override | ||
public TransportVersion getMinimalSupportedVersion() { | ||
throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not serialized because it's always translated in the local node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! I'll leave a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always - how about using that instance directly in code instead of passing it around through TranslatorAware interface?
The serverless failure looks real. Digging into that. |
At the cost of basically an entire day I've discovered that the serverless test failure had nothing to do with serverless. It's actually a bug with the rewrite mechanism of our SingleValueMatchQuery - we think that the query is match_all when it shouldn't be. I'm able to reproduce with an index with two documents - one that contains "foo" and the other that contains ["foo", "bar"]. To hit this you have to have the same number of distinct terms as documents. If each doc has a distinct term we'd *correct* rewrite this to match_none. But if there are duplicates we will *still* rewrite it, this time incorrectly. I'll open a separate PR with this and backport it. RE the DEFAULT pushdown - it's not used in one critical place - during the last layer of rewrites. …On Mon, Apr 21, 2025 at 12:39 PM Costin Leau ***@***.***> wrote: ***@***.**** approved this pull request. LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always - how about using that instance directly in code instead of passing it around through TranslatorAware interface? — Reply to this email directly, view it on GitHub <#126641 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABUXISSP4NNO6BDQQYWLY322UNNXAVCNFSM6AAAAAB24RI6WCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOOBRG42DCNBWG4> . You are receiving this because you were mentioned.Message ID: ***@***.***> |
Also! It has to be on a comparison with `keyword` fields. Not number fields. …On Mon, Apr 21, 2025 at 5:49 PM Nikolas Everett ***@***.***> wrote: At the cost of basically an entire day I've discovered that the serverless test failure had nothing to do with serverless. It's actually a bug with the rewrite mechanism of our SingleValueMatchQuery - we think that the query is match_all when it shouldn't be. I'm able to reproduce with an index with two documents - one that contains "foo" and the other that contains ["foo", "bar"]. To hit this you have to have the same number of distinct terms as documents. If each doc has a distinct term we'd *correct* rewrite this to match_none. But if there are duplicates we will *still* rewrite it, this time incorrectly. I'll open a separate PR with this and backport it. RE the DEFAULT pushdown - it's not used in one critical place - during the last layer of rewrites. On Mon, Apr 21, 2025 at 12:39 PM Costin Leau ***@***.***> wrote: > ***@***.**** approved this pull request. > > LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always > - how about using that instance directly in code instead of passing it > around through TranslatorAware interface? > > — > Reply to this email directly, view it on GitHub > <#126641 (review)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AABUXISSP4NNO6BDQQYWLY322UNNXAVCNFSM6AAAAAB24RI6WCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOOBRG42DCNBWG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the fix for the serverless issue: #127146
While looking to extend this to |
The PR elastic#126641 has a bug with `!=`.
The PR #126641 has a bug with `!=`.
If you do: ``` | WHERE text_field == "cat" ``` we can't push to the text field because it's search index is for individual words. But most text fields have a `.keyword` sub field and we *can* query it's index. EXCEPT! It's normal for these fields to have `ignore_above` in their mapping. In that case we don't push to the field. Very sad. With this change we can push down `==`, but only when the right hand side is shorter than the `ignore_above`. This has pretty much infinite speed gain. An example using a million documents: ``` Before: "took" : 391, After: "took" : 4, ``` But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.
The PR elastic#126641 has a bug with `!=`.
Backporting with #128156 |
If you do:
we can't push to the text field because it's search index is for individual words. But most text fields have a
.keyword
sub field and we can query it's index. EXCEPT! It's normal for these fields to haveignore_above
in their mapping. In that case we don't push to the field. Very sad.With this change we can push down
==
, but only when the right hand side is shorter than theignore_above
.This has pretty much infinite speed gain. An example using a million documents:
But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.