Skip to content

Conversation

@mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Apr 25, 2023

Create .synonyms system index that is exposed
under es.synonyms_api_feature_flag.

This is the first task for creating Synonyms API management,
where synonyms will be stored in the .synonyms system index.

Relates to #38523

Create .synonyms system index that is exposed under es.synonyms_api_feature_flag.
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Apr 25, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @mayya-sharipova, I've created a changelog YAML for you.

@mayya-sharipova
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/part-2

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question about creating a separate plugin for managing synonyms - besides that, LGTM

@javanna javanna requested a review from cbuescher April 26, 2023 08:35
@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as a first step. I have a question regarding future plans to where the rest of the API is going to live. Are the endpoints supposed to end up in x-pack, the main server project or the common-analysis module? Since synonyms management is strongly tied to synonym filters and they currently are part of commons-analysis, it makes sense to me to put the system index there as well, but I wonder if putting the API there will work as well, especially for testing etc... Not a blocker, just interested in the more general plan here. Also I left a question around the proposed mapping for the index.

builder.startObject("synonyms");
{
builder.field("type", "object");
builder.field("enabled", "false");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, this looks like we're leaning towards storing synonyms source-only, that is e.g. in this format:

{	"00001" : "foo => bar",	"00002": "lol, laughing out loud",	"00003": "i-phone, i phone => iphone" } 

Or maybe with more fields, but basically it would mean nothing in there is searchable and we'd have to load the whole source at once and parse stuff from a map? Also for updates this means we'd need to modify the source for every rule that is changed, appended etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format, we are currently leaning towards is this:

{ "synonyms" : [ "foo => bar", "lol, laughing out loud", "i-phone, i phone => iphone" ] }

Where indeed synonyms are stored in source only inside a single field. And indeed for updates we modify the entire document (with its implications).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, so that would mean a slightly different mapping then I guess (probably synonyms as keyword type array then?)

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it looks more like keyword, but I chose object field type because of performance:

  1. enabled parameter can only be applied to object. This will skip parsing of the contents of the field entirely, which is very performant and what we need. While doing:
"synonyms": { "type": "keyword", "doc_values": false, "index" : false }

will still parse every synonym rule and does some extra work for indexing before retuning with empty index options.

  1. Event it is defined as object, by setting enabled:false, we accept any value even arrays, as we don't do any parsing.

  2. Having object with enabled:false, allows us to modify document structure later without modifying mapping. For example, in case we decide to adopt a format with fields.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, I didn't think about not parsing the docs here at all but its a nice idea as long as we don't have to implement any fast filtering or search on anything. So the parsing logic would only live in whatever piece of code is reading the synonyms from index instead of from file like right now and basically treats source like a blob. Sounds good to me given the current scope of the UI etc...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment: should we consider properly disabling parsing for fields that don't have any lucene footprint like in this example, rather than using type object? The current solution looks ok to me, but I wonder if it's solid enough given we discussed rethinking how to disable fields in the past (see #63369). This should not block progress on synonyms management API though.

@mayya-sharipova
Copy link
Contributor Author

@cbuescher Thanks for your review. Answering your questions:

Are the endpoints supposed to end up in x-pack, the main server project or the common-analysis module?

We have split opinion here.

@carlosdelest is in favour to put endpoints (and probably management of .synonym index) into a separate plugin.

I am in favour to keep everything under the common-analysis module for 2 reasons: 1) synonym management API going to be a general feature available to everyone 2) this feature is strongly tied to synonym filters which live in the common-analysis module.
But I am willing to reconsider.

@cbuescher cbuescher self-requested a review April 27, 2023 14:51
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayya-sharipova thanks for the additional explanations, LGTM

@mayya-sharipova mayya-sharipova merged commit 07adeb0 into elastic:main Apr 27, 2023
@elasticsearchmachine
Copy link
Collaborator

@mayya-sharipova according to this PR's labels, it should not have a changelog YAML, but I can't delete it because it is closed. Please either delete the changelog from the appropriate branch yourself, or adjust the labels.

@mayya-sharipova mayya-sharipova deleted the synonym_system_index branch June 13, 2023 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/Analysis How text is split into tokens Team:Search Meta label for search team v8.9.0

6 participants