Asset tracking: line-simplification for geo_line in TSDB #94954

craigtaverner · 2023-03-31T13:28:39Z

Part of the work described in #85274

Within time-series data, two optimisations are done:

Optimizations based on TSDB data ordering (tsid and time-series order)
Optimizations based on line simplification from Geometry simplifier (streaming) #94859

Tasks to complete before merging:

Yaml tests for time-series
- Basic tests sorting by tsid ASC and DESC
- Basic test with terms agg on same field as tsid
- test for multiple terms and TSID
- Negative tests (disallowed sort field)
Unit tests for time-series
- Test ASC and DESC for single TSID
- test ASC and DESC for terms on TSID
- Test for multiple TSID
- test for multiple terms and TSID

Final decision on when to activate optimized geo_line:

geo_line inside time_series (required)
~~geo_line on position metric~~, decided to not require this, and instead:
- Automatically default sort field to @timestamp
- Throw exception on any other sort field setting

So now the switch between old and new geo_line behaviour is exclusively based on its nesting inside a time_series aggregation.

Work to be done in additional PRs:

~~Use registered ValueSource->Aggregator mappings to map position metrics to GeoLineAggregator.TimeSeries~~(cancelled based on decision above)
Optimize MergedGeoLines for time-series data (ie. append and re-simplify) - Fix time-series geo_line to include reduce phase in MergedGeoLines #96953
Allow user to turn on line simplification without time-series (geo_line flag): Geoline aggregation - add simplification option #87903
Allow user to specify geo_bounds within geo_line (simplify kibana queries)

elasticsearchmachine · 2023-03-31T13:29:27Z

Hi @craigtaverner, I've created a changelog YAML for you.

elasticsearchmachine · 2023-05-30T13:55:34Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

test/framework/src/main/java/org/elasticsearch/search/aggregations/AggregatorTestCase.java

x-pack/plugin/spatial/build.gradle

...ial/src/main/java/org/elasticsearch/xpack/spatial/search/aggregations/GeoLineAggregator.java

.../main/java/org/elasticsearch/xpack/spatial/search/aggregations/TimeSeriesGeoLineBuckets.java

.../main/java/org/elasticsearch/xpack/spatial/search/aggregations/GeoLineAggregatorFactory.java

iverase

I left a few comments, it is looking good. My only concerns at the moment is about sort validation on the aggregation when running in time series mode. In particular we discussed the possibility on not defining it for time series so it is implicit?

craigtaverner · 2023-06-01T07:20:37Z

Thanks for the review. I'll certainly add the missing sort-field checks and associated tests, and make other fixes. Probably only next week though.

Starting with YAML tests (which currently pass) and AggregatorTests (currently failing, likely due to mistake in the tests)

* Created TimeSeries version of GeoLineAggregator, and wired it in so that time-series aggregations use it, but current behavior is still identical to non-time-series. * Added both yaml and unit tests for testing that geo_line works with correct results in both time-series and non-time-series cases. * Added additional tests to verify the grouping behaviour of time-series vs. terms aggs, and the combination of the two.

* Started refactoring to re-use simplifier for all buckets

The bucket id can change within a segment, so we need to detect this and save the geo_line.

The original geo_line relied on the BucketedSort for all intelligence. The time-series geo_line uses none of that, and does its own memory management.

And enhanced unit tests to cover multiple groups

@timestamp

Only activate the time-series optimizations if the aggregation is both: * Within a time-series aggregation (ie. tsid and @timestamp ordered) * The geo_line sort field is @timestamp

Also disables the new geo_line for time-series even if the correct sort and point fields are used if the point field is not explicitly configured to be a position metric.

iverase · 2023-06-14T07:34:33Z

...main/java/org/elasticsearch/xpack/spatial/search/aggregations/GeoLineAggregationBuilder.java

+ defaultValueSourceType()
+ );
+ configs.put(SORT_FIELD.getPreferredName(), sourceConfig);
+ } else if (false && sourceConfig.fieldContext().field().equals(TIMESTAMP_FIELD.getName()) == false) {


This is always false?

I think we should add test on the builder for the different combinations, for example if the sort field is null and not a time series I expect it to fail.

Since the geo_line builder only changes behaviour when nested inside the time_series builder, the builder tests cannot cover this (they test only parsing, which has not changed, and already allow missing sort fields, so no change there).

However, we added negative tests in the yaml (assert exception is thrown on incompatible sort field), and positive tests in both yaml and aggregator tests (the aggregator tests also verify that different geo_line are produced from the same data when passed through the different algorithms, and that the same geo_line is produced if no truncation/simplification happens, even with different algorithms).

Curiously I think there never was a test for a null sort-field throwing assertion for classic geo_line. That is missing test coverage from the old code. I could add that in this PR, or make a new one for that. In the case of time-series I think I have good coverage, so this PR is OK from that perspective.

Let's do that as a follow up, this PR is already pretty big.

Since the primary criteria for switching to the new algorithm is that geo_line is within a time-series aggregation, we now disallow any other sort field. We test the negative case in the yaml tests, but changed the unit tests to use TermsAggregation to minim the time-series aggregation to get comparable results.

iverase

LGTM

The old code only threw error if there was data because the check was done inside the leaf collector just before actually reading the sort field. And there were no tests for missing sort field. This commit adds the tests, and checks early so even if data is missing.

* Test that behaviour is identical with or without POSITION metric * Removed fallback code in builder (was switching to old geo_line without POSITION metric) * Removed two TODO's that are no longer valid concerns

…96953) Recently we added support for line-simplification in geo-lines when in time-series aggregations: #94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes #96983

…lastic#96953) Recently we added support for line-simplification in geo-lines when in time-series aggregations: elastic#94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes elastic#96983

…96953) (#97005) Recently we added support for line-simplification in geo-lines when in time-series aggregations: #94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes #96983

elasticsearchmachine · 2023-07-04T17:12:34Z

@craigtaverner according to this PR's labels, I need to update the changelog YAML, but I can't because the PR is closed. Please either update the changelog yourself on the appropriate branch, or adjust the labels. Specifically:

The PR is labelled release highlight but the changelog has no highlight section

craigtaverner added >enhancement :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Mar 31, 2023

elasticsearchmachine added the v8.8.0 label Mar 31, 2023

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

craigtaverner force-pushed the geo_line_tsdb branch 2 times, most recently from 5dc6128 to d5fb222 Compare May 17, 2023 13:08

craigtaverner force-pushed the geo_line_tsdb branch from 03b2dfd to 48a8817 Compare May 26, 2023 10:02

craigtaverner marked this pull request as ready for review May 30, 2023 13:55

craigtaverner changed the title ~~WIP Started geo_line for TSDB work~~ Asset tracking: geo_line for TSDB May 30, 2023