Skip to content

Conversation

@Kiriakos1998
Copy link
Contributor

Add support for pattern_replace type of normalizer for a keyword. After these changes, this setting won't be giving an exception.
{ "settings": { "analysis": { "normalizer": { "my_normalizer": { "type": "pattern_replace", "pattern": "^0+", "replacement": "", "all": false } } } }, "mappings": { "properties": { "pagerank": { "type": "keyword", "normalizer": "my_normalizer" } } } }
The implementation used the CustomAnalyzerProvider adding two fields for the TokenFieldFactories and CharFieldFactories that are not pre-configured. So in buildMapping when the type pattern_replace is encountered its filters are retrieved from the charFilters and tokenFilters registered in AnalysisRegistry. Closes(#83005)

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.9.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 5, 2023
@pxsalehi pxsalehi added :Search Relevance/Analysis How text is split into tokens and removed needs:triage Requires assignment of a team area label labels Jun 6, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@cbuescher
Copy link
Member

Hi @Kiriakos1998, thanks for looking at this issue and opening a PR.
I had a quick look and I think most of the changes in the registry classes are not needed.
In theory, whats blocking using a "pattern_replace" filter in normalizers today is the check in CustomNormalizerProvider. Filters that are allowed in normalizers need to implement NormalizingTokenFilterFactory. After that an analysis configuration like this:

PUT /test { "settings": { "analysis": { "normalizer": { "my_normalizer": { "type": "custom", "filter": [ "replace_zeros" ] } }, "filter": { "replace_zeros": { "type": "pattern_replace", "pattern": "0+", "replacement": "", "all" : true } } } }, "mappings": { "properties": { "pagerank": { "type": "keyword", "normalizer": "my_normalizer" } } } } 

Should be accepted and working. This would need some additional tests of course, but again it should be sufficient to add a case for using the "pattern_replace" filter in the rest tests under modules/analysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/40_token_filters.yml
Take a look at how "elision" filter works in normalizers, that's a filter that is already allowed and has some own configuration.
Please let me know if you want to modify your PR and still work on this.

@Kiriakos1998
Copy link
Contributor Author

Hello @cbuescher thanks for the feedback. I want to modify this PR and work on this issue. I missconcepted what had to be done in the first place but I think it's clear now. We need to be able to implement a filter with pattern_replace type and add it in the custom type normalizer.

@cbuescher
Copy link
Member

Great, have a look at TrimTokenFilterTests and how it tests that everything is working in a normalizer. It should be possible do do a very similar test for PatternReplace.

@cbuescher cbuescher self-assigned this Jun 9, 2023
@cbuescher cbuescher self-requested a review June 9, 2023 12:25
@Kiriakos1998
Copy link
Contributor Author

Hi @cbuescher, just pushed the changes. Indeed, implementing the NormalizingTokenFilterFactory did the job.

@cbuescher
Copy link
Member

Looks great, thanks. I will run our CI tests next.
@elasticmachine test this please

@Kiriakos1998
Copy link
Contributor Author

@cbuescher I think it failed because it was a lot of commits behind the main(an 8.81. tar for Linux distribution was not created). I updated the branch with the latest commits.

@cbuescher
Copy link
Member

Yes, lets try again
@elasticmachine test this please

@cbuescher
Copy link
Member

Looks like the test are okay now. Sorry I forgot to ask for one more little change. Could you add the "pattern_replace" filter to the list of token filters allowed in normalizers in the docs at docs/reference/analysis/normalizers.asciidoc. That would be great. I will re-run tests and merge after that.

@Kiriakos1998
Copy link
Contributor Author

Sure no problem

@Kiriakos1998
Copy link
Contributor Author

Done

@cbuescher
Copy link
Member

@elasticmachine test this please

@cbuescher cbuescher merged commit a8cf4d6 into elastic:main Jun 9, 2023
@cbuescher
Copy link
Member

@Kiriakos1998, Thanks a lot for your PR, it is merged to the 8.9 line now.

@Kiriakos1998 Kiriakos1998 deleted the support_for_pattern_replace_filter_in_keyword_normalizer branch July 2, 2023 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :Search Relevance/Analysis How text is split into tokens Team:Search Meta label for search team v8.9.0

4 participants