- Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
At the moment, a field can be ignored and land in the _ignored metadata field either because of ignore_malformed or because of ignore_above.
@ckauf reported that he heard feedback from a user that really likes the new default setting for ignore_malformed but it's not trivial to find out what the reason was that a field has been ignored. It could either be due to ignore_malformed or ignore_above.
This ambiguity will get worse with #96235 where fields can also be _ignored if the field limit is hit.
In this issue, I'd like to discuss options on how we could add an indication for the reason a field ended up being _ignored.
A potential solution for the would be to store an additional _ignored_reason metadata field alongside the _ignored field. The two fields would both contain an array of strings. We can line up the indices/positions of the two arrays so that we exactly know the reason for why a field has been ignored.
For example, if field foo has been ignored because of ignore_malformed and bar has been ignored because of ignore_above, we can store something like this:
{ "_ignored": ["foo", "bar"], "_ignored_reason": ["ignore_malformed", "ignore_above"] }You might think, doesn't Lucene de-duplicate and sort keyword doc_values? Yes, it does, but the _ignored field isn't stored in doc_values but in a stored field. While we'll want to add doc_values to _ignored in the future (see #59946), we don't necessarily need to remove the stored field. This would come at the expense of storage but it would greatly simplify these troubleshooting workflows.