Skip to content

Conversation

craigtaverner
Copy link
Contributor

@craigtaverner craigtaverner commented Jan 2, 2025

When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right:

  • FROM left
    • FieldExtraction is done using field.keyword subfield if it exists, or from _source otherwise
    • Filtering is done by pushing down to Lucene field.keyword if it exists, or by not pushing down and filtering the value extracted from _source inside the compute engine itself
  • LOOKUP JOIN right
    • FieldExtraction is done simplistically, with no _source extraction
    • Filtering pushdown can be done with field.keyword if it exists, but we have no easy solution to filtering otherwise

The decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query):

left type right type result
KEYWORD KEYWORD ✅ Works
TEXT KEYWORD ✅ Works
KEYWORD TEXT ❌ Type mismatch error
TEXT TEXT ❌ Type mismatch error

Examples

KEYWORD-KEYWORD ✅

FROM test | LOOKUP JOIN `test-lookup` ON color.keyword 

TEXT-KEYWORD ✅

FROM test | RENAME color AS x | EVAL color.keyword = x | LOOKUP JOIN `test-lookup` ON color.keyword 

KEYWORD-TEXT ❌

FROM test | EVAL color = color.keyword | LOOKUP JOIN `test-lookup` ON color 

TEXT-TEXT ❌

FROM test | LOOKUP JOIN `test-lookup` ON color 

Fixes #119062

Checklist:

  • Add negative tests for expected error messages
  • Remove type validation from lookup operator, since we already have type validation in the validator
This is not sufficient, since we understand it does not consider the term query on the server side possibly performing a CONTAINS query. We should rather rewrite to the underlying keyword sub-field for exact matches.
@craigtaverner craigtaverner added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v9.0.0 v8.18.0 labels Jan 2, 2025
BytesRef value = bytesRefBlock.getBytesRef(offset, new BytesRef());
if (field.typeName().equals("text")) {
// Text fields involve case-insensitive contains queries, we need to use lowercase on the term query
return new BytesRef(value.utf8ToString().toLowerCase(Locale.ROOT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly exists to show that matching against text fields "doesn't work like you think it does" in the ENRICH infrastructure we've borrowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deleted this now (or rather I'm throwing an exception, but might want to delete that too), and instead try get the planner to use an underlying KEYWORD sub-field, if it exists, using fa.exactAttribute().


private record MatchConfig(String fieldName, int channel, DataType type) {
private MatchConfig(FieldAttribute match, Layout.ChannelAndType input) {
// Note, this handles TEXT fields with KEYWORD subfields, and we assume tha has been validated earlier during planning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note, this handles TEXT fields with KEYWORD subfields, and we assume tha has been validated earlier during planning
// Note, this handles TEXT field with KEYWORD subfields, and we assume that has been validated earlier during planning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify "validated" means here? Is it existance of the nested .keyword field? If so could you please help me understand where such validation is happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is still in draft, but I planned to find that location and add such validation myself. It does not yet exist.

Comment on lines 83 to 84
String typeName = fieldType.typeName().equals("text") ? DataType.KEYWORD.typeName() : fieldType.typeName();
if (typeName.equals(inputDataType.noText().typeName()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth resolving typeName wirh fromTypeName here to avoid comparing strings?
Something like

Suggested change
String typeName = fieldType.typeName().equals("text") ? DataType.KEYWORD.typeName() : fieldType.typeName();
if (typeName.equals(inputDataType.noText().typeName()) == false) {
if (Objects.equals(DataType.fromTypeName(fieldType.typeName()).noText(), inputDataType.noText()) == false) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking of deleting this method entirely, since it is shadowed by the newer validation happening in the Validator. In the current draft I fixed both methods, but in reality we only need the one.

@craigtaverner craigtaverner marked this pull request as ready for review January 17, 2025 09:58
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@astefan astefan self-requested a review January 21, 2025 15:32
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's looking good 👍
The only concern I have is the clarity of the error message (added a comment, more like a strong suggestion if you agree with it).
Also, on the same idea, I want to raise the question of documenting this limitation, if it's something we should do.

catch: "bad_request"

- match: { error.type: "verification_exception" }
- contains: { error.reason: "Found 1 problem\nline 1:55: JOIN left field [color] of type [TEXT] is incompatible with right field [color] of type [TEXT]" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an user, I would be confused about this message. As in "both types are the same, they are not unsupported, why is es|ql telling me they are not compatible".

The actual, more useful, error message would be true reason for the incompatibility: "unsupported [TEXT] data type as JOIN right field"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about that when coding this, but left the message the same between all scenarios, both where the types mismatch and where the failure is because the right type was TEXT for two reasons:

  • Simple validation code
  • More likely to be future compatible with the case where we decide to support TEXT on the right when it does have an exact KEYWORD subfield (which I understand is actually very common).

However, I personally have no objection to adding a clearer error now, and simply removing it again later if we do support that case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the end-user perspective it would be less confusion imo; and less explanation from us when someone asks :-)). I don't think it would be difficult to remove this afterwards (whenever that will be). +100 from me to have two different error messages.

Attribute leftField = config.leftFields().get(i);
Attribute rightField = config.rightFields().get(i);
if (leftField.dataType() != rightField.dataType()) {
if (leftField.dataType().noText() != rightField.dataType().noText() || rightField.dataType().equals(TEXT)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest treating the TEXT data type on the hand right side of the join as a separate situation with a separate error message. See my comment from 191_lookup_join_text.yml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Done.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@craigtaverner
Copy link
Contributor Author

Also, on the same idea, I want to raise the question of documenting this limitation, if it's something we should do

I did investigate this, and started writing a bit in the esql limitations page, before I discovered that there are no docs on LOOKUP JOIN at all, so it is not possible really to document a limitation to an undocumented feature. When we add these docs, we can document the limitation.

@craigtaverner craigtaverner merged commit ec546e3 into elastic:main Jan 23, 2025
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 119473

craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Jan 23, 2025
) When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right: * `FROM left` * FieldExtraction is done using `field.keyword` subfield if it exists, or from `_source` otherwise * Filtering is done by pushing down to Lucene `field.keyword` if it exists, or by not pushing down and filtering the value extracted from `_source` inside the compute engine itself * `LOOKUP JOIN right` * FieldExtraction is done simplistically, with no `_source` extraction * Filtering pushdown can be done with `field.keyword` if it exists, but we have no easy solution to filtering otherwise The decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query): | left type | right type | result | | --- | --- | --- | | KEYWORD | KEYWORD | ✅ Works | | TEXT | KEYWORD | ✅ Works | | KEYWORD | TEXT | ❌ Type mismatch error | | TEXT | TEXT | ❌ Type mismatch error | ``` FROM test | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | RENAME color AS x | EVAL color.keyword = x | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | EVAL color = color.keyword | LOOKUP JOIN `test-lookup` ON color ``` ``` FROM test | LOOKUP JOIN `test-lookup` ON color ```
elasticsearchmachine pushed a commit that referenced this pull request Jan 23, 2025
…120732) When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right: * `FROM left` * FieldExtraction is done using `field.keyword` subfield if it exists, or from `_source` otherwise * Filtering is done by pushing down to Lucene `field.keyword` if it exists, or by not pushing down and filtering the value extracted from `_source` inside the compute engine itself * `LOOKUP JOIN right` * FieldExtraction is done simplistically, with no `_source` extraction * Filtering pushdown can be done with `field.keyword` if it exists, but we have no easy solution to filtering otherwise The decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query): | left type | right type | result | | --- | --- | --- | | KEYWORD | KEYWORD | ✅ Works | | TEXT | KEYWORD | ✅ Works | | KEYWORD | TEXT | ❌ Type mismatch error | | TEXT | TEXT | ❌ Type mismatch error | ``` FROM test | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | RENAME color AS x | EVAL color.keyword = x | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | EVAL color = color.keyword | LOOKUP JOIN `test-lookup` ON color ``` ``` FROM test | LOOKUP JOIN `test-lookup` ON color ```
- method: POST
path: /_query
parameters: []
capabilities: [lookup_join_text]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I missed it, but this should also declare join_lookup_v11.
This should help skipping this spec if something changes about basic lookup logic (such as grammar :) )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a pr for it #120771

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0

6 participants