- Notifications
You must be signed in to change notification settings - Fork 25.5k
Initial support for TEXT fields in LOOKUP JOIN condition #119473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is not sufficient, since we understand it does not consider the term query on the server side possibly performing a CONTAINS query. We should rather rewrite to the underlying keyword sub-field for exact matches.
BytesRef value = bytesRefBlock.getBytesRef(offset, new BytesRef()); | ||
if (field.typeName().equals("text")) { | ||
// Text fields involve case-insensitive contains queries, we need to use lowercase on the term query | ||
return new BytesRef(value.utf8ToString().toLowerCase(Locale.ROOT)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly exists to show that matching against text fields "doesn't work like you think it does" in the ENRICH infrastructure we've borrowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've deleted this now (or rather I'm throwing an exception, but might want to delete that too), and instead try get the planner to use an underlying KEYWORD sub-field, if it exists, using fa.exactAttribute().
| ||
private record MatchConfig(String fieldName, int channel, DataType type) { | ||
private MatchConfig(FieldAttribute match, Layout.ChannelAndType input) { | ||
// Note, this handles TEXT fields with KEYWORD subfields, and we assume tha has been validated earlier during planning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Note, this handles TEXT fields with KEYWORD subfields, and we assume tha has been validated earlier during planning | |
// Note, this handles TEXT field with KEYWORD subfields, and we assume that has been validated earlier during planning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify "validated" means here? Is it existance of the nested .keyword
field? If so could you please help me understand where such validation is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is still in draft, but I planned to find that location and add such validation myself. It does not yet exist.
String typeName = fieldType.typeName().equals("text") ? DataType.KEYWORD.typeName() : fieldType.typeName(); | ||
if (typeName.equals(inputDataType.noText().typeName()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is worth resolving typeName
wirh fromTypeName
here to avoid comparing strings?
Something like
String typeName = fieldType.typeName().equals("text") ? DataType.KEYWORD.typeName() : fieldType.typeName(); | |
if (typeName.equals(inputDataType.noText().typeName()) == false) { | |
if (Objects.equals(DataType.fromTypeName(fieldType.typeName()).noText(), inputDataType.noText()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually thinking of deleting this method entirely, since it is shadowed by the newer validation happening in the Validator. In the current draft I fixed both methods, but in reality we only need the one.
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it's looking good 👍
The only concern I have is the clarity of the error message (added a comment, more like a strong suggestion if you agree with it).
Also, on the same idea, I want to raise the question of documenting this limitation, if it's something we should do.
catch: "bad_request" | ||
| ||
- match: { error.type: "verification_exception" } | ||
- contains: { error.reason: "Found 1 problem\nline 1:55: JOIN left field [color] of type [TEXT] is incompatible with right field [color] of type [TEXT]" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an user, I would be confused about this message. As in "both types are the same, they are not unsupported, why is es|ql telling me they are not compatible".
The actual, more useful, error message would be true reason for the incompatibility: "unsupported [TEXT] data type as JOIN right field"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did think about that when coding this, but left the message the same between all scenarios, both where the types mismatch and where the failure is because the right type was TEXT for two reasons:
- Simple validation code
- More likely to be future compatible with the case where we decide to support TEXT on the right when it does have an exact KEYWORD subfield (which I understand is actually very common).
However, I personally have no objection to adding a clearer error now, and simply removing it again later if we do support that case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the end-user perspective it would be less confusion imo; and less explanation from us when someone asks :-)). I don't think it would be difficult to remove this afterwards (whenever that will be). +100 from me to have two different error messages.
Attribute leftField = config.leftFields().get(i); | ||
Attribute rightField = config.rightFields().get(i); | ||
if (leftField.dataType() != rightField.dataType()) { | ||
if (leftField.dataType().noText() != rightField.dataType().noText() || rightField.dataType().equals(TEXT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest treating the TEXT
data type on the hand right side of the join as a separate situation with a separate error message. See my comment from 191_lookup_join_text.yml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I did investigate this, and started writing a bit in the esql limitations page, before I discovered that there are no docs on LOOKUP JOIN at all, so it is not possible really to document a limitation to an undocumented feature. When we add these docs, we can document the limitation. |
💔 Backport failed
You can use sqren/backport to manually backport by running |
) When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right: * `FROM left` * FieldExtraction is done using `field.keyword` subfield if it exists, or from `_source` otherwise * Filtering is done by pushing down to Lucene `field.keyword` if it exists, or by not pushing down and filtering the value extracted from `_source` inside the compute engine itself * `LOOKUP JOIN right` * FieldExtraction is done simplistically, with no `_source` extraction * Filtering pushdown can be done with `field.keyword` if it exists, but we have no easy solution to filtering otherwise The decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query): | left type | right type | result | | --- | --- | --- | | KEYWORD | KEYWORD | ✅ Works | | TEXT | KEYWORD | ✅ Works | | KEYWORD | TEXT | ❌ Type mismatch error | | TEXT | TEXT | ❌ Type mismatch error | ``` FROM test | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | RENAME color AS x | EVAL color.keyword = x | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | EVAL color = color.keyword | LOOKUP JOIN `test-lookup` ON color ``` ``` FROM test | LOOKUP JOIN `test-lookup` ON color ```
…120732) When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right: * `FROM left` * FieldExtraction is done using `field.keyword` subfield if it exists, or from `_source` otherwise * Filtering is done by pushing down to Lucene `field.keyword` if it exists, or by not pushing down and filtering the value extracted from `_source` inside the compute engine itself * `LOOKUP JOIN right` * FieldExtraction is done simplistically, with no `_source` extraction * Filtering pushdown can be done with `field.keyword` if it exists, but we have no easy solution to filtering otherwise The decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query): | left type | right type | result | | --- | --- | --- | | KEYWORD | KEYWORD | ✅ Works | | TEXT | KEYWORD | ✅ Works | | KEYWORD | TEXT | ❌ Type mismatch error | | TEXT | TEXT | ❌ Type mismatch error | ``` FROM test | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | RENAME color AS x | EVAL color.keyword = x | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | EVAL color = color.keyword | LOOKUP JOIN `test-lookup` ON color ``` ``` FROM test | LOOKUP JOIN `test-lookup` ON color ```
- method: POST | ||
path: /_query | ||
parameters: [] | ||
capabilities: [lookup_join_text] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, I missed it, but this should also declare join_lookup_v11
.
This should help skipping this spec if something changes about basic lookup logic (such as grammar :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a pr for it #120771
When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right:
FROM left
field.keyword
subfield if it exists, or from_source
otherwisefield.keyword
if it exists, or by not pushing down and filtering the value extracted from_source
inside the compute engine itselfLOOKUP JOIN right
_source
extractionfield.keyword
if it exists, but we have no easy solution to filtering otherwiseThe decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query):
Examples
KEYWORD-KEYWORD ✅
TEXT-KEYWORD ✅
KEYWORD-TEXT ❌
TEXT-TEXT ❌
Fixes #119062
Checklist: