Initial support for TEXT fields in LOOKUP JOIN condition #119473

craigtaverner · 2025-01-02T18:19:39Z

When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right:

FROM left
- FieldExtraction is done using field.keyword subfield if it exists, or from _source otherwise
- Filtering is done by pushing down to Lucene field.keyword if it exists, or by not pushing down and filtering the value extracted from _source inside the compute engine itself
LOOKUP JOIN right
- FieldExtraction is done simplistically, with no _source extraction
- Filtering pushdown can be done with field.keyword if it exists, but we have no easy solution to filtering otherwise

The decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query):

left type	right type	result
KEYWORD	KEYWORD	✅ Works
TEXT	KEYWORD	✅ Works
KEYWORD	TEXT	❌ Type mismatch error
TEXT	TEXT	❌ Type mismatch error

Examples

KEYWORD-KEYWORD ✅

FROM test | LOOKUP JOIN `test-lookup` ON color.keyword

TEXT-KEYWORD ✅

FROM test | RENAME color AS x | EVAL color.keyword = x | LOOKUP JOIN `test-lookup` ON color.keyword

KEYWORD-TEXT ❌

FROM test | EVAL color = color.keyword | LOOKUP JOIN `test-lookup` ON color

TEXT-TEXT ❌

FROM test | LOOKUP JOIN `test-lookup` ON color

Fixes #119062

Checklist:

Add negative tests for expected error messages
Remove type validation from lookup operator, since we already have type validation in the validator

This is not sufficient, since we understand it does not consider the term query on the server side possibly performing a CONTAINS query. We should rather rewrite to the underlying keyword sub-field for exact matches.

nik9000 · 2025-01-02T18:21:23Z

...k/plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/lookup/QueryList.java

+ BytesRef value = bytesRefBlock.getBytesRef(offset, new BytesRef());
+ if (field.typeName().equals("text")) {
+ // Text fields involve case-insensitive contains queries, we need to use lowercase on the term query
+ return new BytesRef(value.utf8ToString().toLowerCase(Locale.ROOT));


This mostly exists to show that matching against text fields "doesn't work like you think it does" in the ENRICH infrastructure we've borrowed.

I've deleted this now (or rather I'm throwing an exception, but might want to delete that too), and instead try get the planner to use an underlying KEYWORD sub-field, if it exists, using fa.exactAttribute().

idegtiarenko · 2025-01-07T10:33:09Z

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/LocalExecutionPlanner.java


+ private record MatchConfig(String fieldName, int channel, DataType type) {
+ private MatchConfig(FieldAttribute match, Layout.ChannelAndType input) {
+ // Note, this handles TEXT fields with KEYWORD subfields, and we assume tha has been validated earlier during planning


Suggested change

// Note, this handles TEXT fields with KEYWORD subfields, and we assume tha has been validated earlier during planning

// Note, this handles TEXT field with KEYWORD subfields, and we assume that has been validated earlier during planning

Could you please clarify "validated" means here? Is it existance of the nested .keyword field? If so could you please help me understand where such validation is happening?

The PR is still in draft, but I planned to find that location and add such validation myself. It does not yet exist.

idegtiarenko · 2025-01-07T10:43:36Z

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/enrich/LookupFromIndexService.java

+ String typeName = fieldType.typeName().equals("text") ? DataType.KEYWORD.typeName() : fieldType.typeName();
+ if (typeName.equals(inputDataType.noText().typeName()) == false) {


I wonder if it is worth resolving typeName wirh fromTypeName here to avoid comparing strings?
Something like

Suggested change

String typeName = fieldType.typeName().equals("text") ? DataType.KEYWORD.typeName() : fieldType.typeName();

if (typeName.equals(inputDataType.noText().typeName()) == false) {

if (Objects.equals(DataType.fromTypeName(fieldType.typeName()).noText(), inputDataType.noText()) == false) {

I'm actually thinking of deleting this method entirely, since it is shadowed by the newer validation happening in the Validator. In the current draft I fixed both methods, but in reality we only need the one.

elasticsearchmachine · 2025-01-17T09:58:31Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

astefan

In general it's looking good 👍
The only concern I have is the clarity of the error message (added a comment, more like a strong suggestion if you agree with it).
Also, on the same idea, I want to raise the question of documenting this limitation, if it's something we should do.

astefan · 2025-01-22T09:53:02Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/191_lookup_join_text.yml

+ catch: "bad_request"
+
+ - match: { error.type: "verification_exception" }
+ - contains: { error.reason: "Found 1 problem\nline 1:55: JOIN left field [color] of type [TEXT] is incompatible with right field [color] of type [TEXT]" }


As an user, I would be confused about this message. As in "both types are the same, they are not unsupported, why is es|ql telling me they are not compatible".

The actual, more useful, error message would be true reason for the incompatibility: "unsupported [TEXT] data type as JOIN right field"

I did think about that when coding this, but left the message the same between all scenarios, both where the types mismatch and where the failure is because the right type was TEXT for two reasons:

Simple validation code

More likely to be future compatible with the case where we decide to support TEXT on the right when it does have an exact KEYWORD subfield (which I understand is actually very common).

However, I personally have no objection to adding a clearer error now, and simply removing it again later if we do support that case).

From the end-user perspective it would be less confusion imo; and less explanation from us when someone asks :-)). I don't think it would be difficult to remove this afterwards (whenever that will be). +100 from me to have two different error messages.

astefan · 2025-01-22T09:55:03Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/join/Join.java

 Attribute leftField = config.leftFields().get(i);
 Attribute rightField = config.rightFields().get(i);
- if (leftField.dataType() != rightField.dataType()) {
+ if (leftField.dataType().noText() != rightField.dataType().noText() || rightField.dataType().equals(TEXT)) {


I suggest treating the TEXT data type on the hand right side of the join as a separate situation with a separate error message. See my comment from 191_lookup_join_text.yml.

astefan

LGTM

craigtaverner · 2025-01-23T11:39:16Z

Also, on the same idea, I want to raise the question of documenting this limitation, if it's something we should do

I did investigate this, and started writing a bit in the esql limitations page, before I discovered that there are no docs on LOOKUP JOIN at all, so it is not possible really to document a limitation to an undocumented feature. When we add these docs, we can document the limitation.

elasticsearchmachine · 2025-01-23T13:06:11Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 119473

) When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right: * `FROM left` * FieldExtraction is done using `field.keyword` subfield if it exists, or from `_source` otherwise * Filtering is done by pushing down to Lucene `field.keyword` if it exists, or by not pushing down and filtering the value extracted from `_source` inside the compute engine itself * `LOOKUP JOIN right` * FieldExtraction is done simplistically, with no `_source` extraction * Filtering pushdown can be done with `field.keyword` if it exists, but we have no easy solution to filtering otherwise The decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query): | left type | right type | result | | --- | --- | --- | | KEYWORD | KEYWORD | ✅ Works | | TEXT | KEYWORD | ✅ Works | | KEYWORD | TEXT | ❌ Type mismatch error | | TEXT | TEXT | ❌ Type mismatch error | ``` FROM test | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | RENAME color AS x | EVAL color.keyword = x | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | EVAL color = color.keyword | LOOKUP JOIN `test-lookup` ON color ``` ``` FROM test | LOOKUP JOIN `test-lookup` ON color ```

…120732) When the join field on the right hand-side is a TEXT field, we cannot do an exact match. Since ES|QL treats TEXT fields as KEYWORD in all cases, ideally we would like to do the same for JOIN. However, this is achieved on the left-hand index in a way that is not easily achievable on the right-hand side. Comparing filtering and field extraction of left and right: * `FROM left` * FieldExtraction is done using `field.keyword` subfield if it exists, or from `_source` otherwise * Filtering is done by pushing down to Lucene `field.keyword` if it exists, or by not pushing down and filtering the value extracted from `_source` inside the compute engine itself * `LOOKUP JOIN right` * FieldExtraction is done simplistically, with no `_source` extraction * Filtering pushdown can be done with `field.keyword` if it exists, but we have no easy solution to filtering otherwise The decision taken is to disallow joining on TEXT fields, but allow explicit joining on the underlying keyword field (explicit in the query): | left type | right type | result | | --- | --- | --- | | KEYWORD | KEYWORD | ✅ Works | | TEXT | KEYWORD | ✅ Works | | KEYWORD | TEXT | ❌ Type mismatch error | | TEXT | TEXT | ❌ Type mismatch error | ``` FROM test | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | RENAME color AS x | EVAL color.keyword = x | LOOKUP JOIN `test-lookup` ON color.keyword ``` ``` FROM test | EVAL color = color.keyword | LOOKUP JOIN `test-lookup` ON color ``` ``` FROM test | LOOKUP JOIN `test-lookup` ON color ```

idegtiarenko · 2025-01-24T07:43:17Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/191_lookup_join_text.yml

+ - method: POST
+ path: /_query
+ parameters: []
+ capabilities: [lookup_join_text]


Uh, I missed it, but this should also declare join_lookup_v11.
This should help skipping this spec if something changes about basic lookup logic (such as grammar :) )

Opened a pr for it #120771

Initial support for TEXT fields in LOOKUP JOIN condition

a4f3b5e

This is not sufficient, since we understand it does not consider the term query on the server side possibly performing a CONTAINS query. We should rather rewrite to the underlying keyword sub-field for exact matches.

craigtaverner added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v9.0.0 v8.18.0 labels Jan 2, 2025

craigtaverner requested a review from idegtiarenko January 2, 2025 18:19

nik9000 reviewed Jan 2, 2025

View reviewed changes

craigtaverner added 6 commits January 3, 2025 17:26

Better approach, using KEYWORD subfield

c1744c9

Merge branch 'main' into lookup_join_text

8eb6554

Add new capability to block mixed-cluster tests

69572e5

Merge branch 'main' into lookup_join_text

417d3df

Merge branch 'main' into lookup_join_text

e799ef9

Merge branch 'main' into lookup_join_text

772659d

idegtiarenko reviewed Jan 7, 2025

View reviewed changes

craigtaverner added 6 commits January 16, 2025 18:50

Simplify lookup-join validation and remove keyword subfield validation

93779f9

Merge remote-tracking branch 'origin/main' into lookup_join_text

cbcd532

Added yaml tests for failing TEXT lookup and added validation for that

0146742

Fixed failing tests

e327808

Missing capability check

77fc5bf

Merge branch 'main' into lookup_join_text

2b2f0dc

craigtaverner marked this pull request as ready for review January 17, 2025 09:58

craigtaverner mentioned this pull request Jan 17, 2025

ESQL: Make LOOKUP more left-joiny #119475

Merged

craigtaverner added 3 commits January 17, 2025 12:02

Merge branch 'main' into lookup_join_text

4380656

Merge branch 'main' into lookup_join_text

6a56279

Simplify long comment

3491006

astefan self-requested a review January 21, 2025 15:32

astefan reviewed Jan 22, 2025

View reviewed changes

craigtaverner added 2 commits January 22, 2025 16:48

Use clearer error message when joining with right-side TEXT

d5d7cee

Merge branch 'main' into lookup_join_text

879ecd4

astefan approved these changes Jan 22, 2025

View reviewed changes

craigtaverner added 4 commits January 22, 2025 17:35

Fix failing test

724153c

Merge remote-tracking branch 'origin/main' into lookup_join_text

863cfc9

Merge remote-tracking branch 'origin/main' into lookup_join_text

26a76f9

Merge branch 'main' into lookup_join_text

f0ce28a

craigtaverner merged commit ec546e3 into elastic:main Jan 23, 2025
16 checks passed

elasticsearchmachine added the backport pending label Jan 23, 2025

craigtaverner mentioned this pull request Jan 23, 2025

[8.x] Initial support for TEXT fields in LOOKUP JOIN condition (#119473) #120732

Merged

idegtiarenko reviewed Jan 24, 2025

View reviewed changes

alex-spies removed the backport pending label Jan 27, 2025

ivancea mentioned this pull request Jan 28, 2025

ESQL: Ignore multivalued key columns in lookup index on JOIN #120726

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial support for TEXT fields in LOOKUP JOIN condition #119473

Initial support for TEXT fields in LOOKUP JOIN condition #119473

Uh oh!

craigtaverner commented Jan 2, 2025 •

edited

Loading

nik9000 Jan 2, 2025

craigtaverner Jan 3, 2025

idegtiarenko Jan 7, 2025

idegtiarenko Jan 7, 2025

craigtaverner Jan 7, 2025

idegtiarenko Jan 7, 2025

craigtaverner Jan 7, 2025

elasticsearchmachine commented Jan 17, 2025

astefan left a comment

astefan Jan 22, 2025

craigtaverner Jan 22, 2025

astefan Jan 22, 2025

astefan Jan 22, 2025

craigtaverner Jan 22, 2025

astefan left a comment

craigtaverner commented Jan 23, 2025

Uh oh!

elasticsearchmachine commented Jan 23, 2025

idegtiarenko Jan 24, 2025

idegtiarenko Jan 24, 2025

Labels

6 participants

	// Note, this handles TEXT fields with KEYWORD subfields, and we assume tha has been validated earlier during planning
	// Note, this handles TEXT field with KEYWORD subfields, and we assume that has been validated earlier during planning

		String typeName = fieldType.typeName().equals("text") ? DataType.KEYWORD.typeName() : fieldType.typeName();
		if (typeName.equals(inputDataType.noText().typeName()) == false) {

	String typeName = fieldType.typeName().equals("text") ? DataType.KEYWORD.typeName() : fieldType.typeName();
	if (typeName.equals(inputDataType.noText().typeName()) == false) {
	if (Objects.equals(DataType.fromTypeName(fieldType.typeName()).noText(), inputDataType.noText()) == false) {

Initial support for TEXT fields in LOOKUP JOIN condition #119473

Initial support for TEXT fields in LOOKUP JOIN condition #119473

Uh oh!

Conversation

craigtaverner commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Examples

KEYWORD-KEYWORD ✅

TEXT-KEYWORD ✅

KEYWORD-TEXT ❌

TEXT-TEXT ❌

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Jan 17, 2025

astefan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefan left a comment

Choose a reason for hiding this comment

craigtaverner commented Jan 23, 2025

Uh oh!

elasticsearchmachine commented Jan 23, 2025

💔 Backport failed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

6 participants

craigtaverner commented Jan 2, 2025 •

edited

Loading