Skip to content

Conversation

@craigtaverner
Copy link
Contributor

@craigtaverner craigtaverner commented Mar 31, 2023

Part of the work described in #85274

Within time-series data, two optimisations are done:

Tasks to complete before merging:

  • Yaml tests for time-series
    • Basic tests sorting by tsid ASC and DESC
    • Basic test with terms agg on same field as tsid
    • test for multiple terms and TSID
    • Negative tests (disallowed sort field)
  • Unit tests for time-series
    • Test ASC and DESC for single TSID
    • test ASC and DESC for terms on TSID
    • Test for multiple TSID
    • test for multiple terms and TSID

Final decision on when to activate optimized geo_line:

  • geo_line inside time_series (required)
  • geo_line on position metric, decided to not require this, and instead:
    • Automatically default sort field to @timestamp
    • Throw exception on any other sort field setting

So now the switch between old and new geo_line behaviour is exclusively based on its nesting inside a time_series aggregation.

Work to be done in additional PRs:

@craigtaverner craigtaverner added >enhancement :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Mar 31, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @craigtaverner, I've created a changelog YAML for you.

@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
@craigtaverner craigtaverner force-pushed the geo_line_tsdb branch 2 times, most recently from 5dc6128 to d5fb222 Compare May 17, 2023 13:08
@craigtaverner craigtaverner marked this pull request as ready for review May 30, 2023 13:55
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@craigtaverner craigtaverner changed the title WIP Started geo_line for TSDB work Asset tracking: geo_line for TSDB May 30, 2023
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, it is looking good. My only concerns at the moment is about sort validation on the aggregation when running in time series mode. In particular we discussed the possibility on not defining it for time series so it is implicit?

@craigtaverner
Copy link
Contributor Author

Thanks for the review. I'll certainly add the missing sort-field checks and associated tests, and make other fixes. Probably only next week though.

Starting with YAML tests (which currently pass) and AggregatorTests (currently failing, likely due to mistake in the tests)
* Created TimeSeries version of GeoLineAggregator, and wired it in so that time-series aggregations use it, but current behavior is still identical to non-time-series. * Added both yaml and unit tests for testing that geo_line works with correct results in both time-series and non-time-series cases. * Added additional tests to verify the grouping behaviour of time-series vs. terms aggs, and the combination of the two.
* Started refactoring to re-use simplifier for all buckets
The bucket id can change within a segment, so we need to detect this and save the geo_line.
The original geo_line relied on the BucketedSort for all intelligence. The time-series geo_line uses none of that, and does its own memory management.
And enhanced unit tests to cover multiple groups
Only activate the time-series optimizations if the aggregation is both: * Within a time-series aggregation (ie. tsid and @timestamp ordered) * The geo_line sort field is @timestamp
Also disables the new geo_line for time-series even if the correct sort and point fields are used if the point field is not explicitly configured to be a position metric.
defaultValueSourceType()
);
configs.put(SORT_FIELD.getPreferredName(), sourceConfig);
} else if (false && sourceConfig.fieldContext().field().equals(TIMESTAMP_FIELD.getName()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add test on the builder for the different combinations, for example if the sort field is null and not a time series I expect it to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the geo_line builder only changes behaviour when nested inside the time_series builder, the builder tests cannot cover this (they test only parsing, which has not changed, and already allow missing sort fields, so no change there).

However, we added negative tests in the yaml (assert exception is thrown on incompatible sort field), and positive tests in both yaml and aggregator tests (the aggregator tests also verify that different geo_line are produced from the same data when passed through the different algorithms, and that the same geo_line is produced if no truncation/simplification happens, even with different algorithms).

Curiously I think there never was a test for a null sort-field throwing assertion for classic geo_line. That is missing test coverage from the old code. I could add that in this PR, or make a new one for that. In the case of time-series I think I have good coverage, so this PR is OK from that perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that as a follow up, this PR is already pretty big.

Since the primary criteria for switching to the new algorithm is that geo_line is within a time-series aggregation, we now disallow any other sort field. We test the negative case in the yaml tests, but changed the unit tests to use TermsAggregation to minim the time-series aggregation to get comparable results.
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The old code only threw error if there was data because the check was done inside the leaf collector just before actually reading the sort field. And there were no tests for missing sort field. This commit adds the tests, and checks early so even if data is missing.
* Test that behaviour is identical with or without POSITION metric * Removed fallback code in builder (was switching to old geo_line without POSITION metric) * Removed two TODO's that are no longer valid concerns
@craigtaverner craigtaverner merged commit 9dbce6f into elastic:main Jun 15, 2023
@craigtaverner craigtaverner changed the title Asset tracking: geo_line for TSDB Asset tracking: line-simplification for geo_line in TSDB Jun 15, 2023
elasticsearchmachine pushed a commit that referenced this pull request Jun 22, 2023
…96953) Recently we added support for line-simplification in geo-lines when in time-series aggregations: #94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes #96983
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Jun 22, 2023
…lastic#96953) Recently we added support for line-simplification in geo-lines when in time-series aggregations: elastic#94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes elastic#96983
elasticsearchmachine pushed a commit that referenced this pull request Jun 22, 2023
…96953) (#97005) Recently we added support for line-simplification in geo-lines when in time-series aggregations: #94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes #96983
@elasticsearchmachine
Copy link
Collaborator

@craigtaverner according to this PR's labels, I need to update the changelog YAML, but I can't because the PR is closed. Please either update the changelog yourself on the appropriate branch, or adjust the labels. Specifically:

  • The PR is labelled release highlight but the changelog has no highlight section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Geo Indexing, search aggregations of geo points and shapes >enhancement release highlight Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.9.0

4 participants