- Notifications
You must be signed in to change notification settings - Fork 25.6k
Asset tracking: line-simplification for geo_line in TSDB #94954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Hi @craigtaverner, I've created a changelog YAML for you. |
5dc6128 to d5fb222 Compare 03b2dfd to 48a8817 Compare | Pinging @elastic/es-analytics-geo (Team:Analytics) |
test/framework/src/main/java/org/elasticsearch/search/aggregations/AggregatorTestCase.java Outdated Show resolved Hide resolved
...ial/src/main/java/org/elasticsearch/xpack/spatial/search/aggregations/GeoLineAggregator.java Outdated Show resolved Hide resolved
.../main/java/org/elasticsearch/xpack/spatial/search/aggregations/TimeSeriesGeoLineBuckets.java Outdated Show resolved Hide resolved
.../main/java/org/elasticsearch/xpack/spatial/search/aggregations/GeoLineAggregatorFactory.java Outdated Show resolved Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, it is looking good. My only concerns at the moment is about sort validation on the aggregation when running in time series mode. In particular we discussed the possibility on not defining it for time series so it is implicit?
| Thanks for the review. I'll certainly add the missing sort-field checks and associated tests, and make other fixes. Probably only next week though. |
8783d21 to e1dd4d4 Compare Starting with YAML tests (which currently pass) and AggregatorTests (currently failing, likely due to mistake in the tests)
* Created TimeSeries version of GeoLineAggregator, and wired it in so that time-series aggregations use it, but current behavior is still identical to non-time-series. * Added both yaml and unit tests for testing that geo_line works with correct results in both time-series and non-time-series cases. * Added additional tests to verify the grouping behaviour of time-series vs. terms aggs, and the combination of the two.
* Started refactoring to re-use simplifier for all buckets
The bucket id can change within a segment, so we need to detect this and save the geo_line.
The original geo_line relied on the BucketedSort for all intelligence. The time-series geo_line uses none of that, and does its own memory management.
And enhanced unit tests to cover multiple groups
Only activate the time-series optimizations if the aggregation is both: * Within a time-series aggregation (ie. tsid and @timestamp ordered) * The geo_line sort field is @timestamp
6b16df7 to 014c395 Compare Also disables the new geo_line for time-series even if the correct sort and point fields are used if the point field is not explicitly configured to be a position metric.
| defaultValueSourceType() | ||
| ); | ||
| configs.put(SORT_FIELD.getPreferredName(), sourceConfig); | ||
| } else if (false && sourceConfig.fieldContext().field().equals(TIMESTAMP_FIELD.getName()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add test on the builder for the different combinations, for example if the sort field is null and not a time series I expect it to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the geo_line builder only changes behaviour when nested inside the time_series builder, the builder tests cannot cover this (they test only parsing, which has not changed, and already allow missing sort fields, so no change there).
However, we added negative tests in the yaml (assert exception is thrown on incompatible sort field), and positive tests in both yaml and aggregator tests (the aggregator tests also verify that different geo_line are produced from the same data when passed through the different algorithms, and that the same geo_line is produced if no truncation/simplification happens, even with different algorithms).
Curiously I think there never was a test for a null sort-field throwing assertion for classic geo_line. That is missing test coverage from the old code. I could add that in this PR, or make a new one for that. In the case of time-series I think I have good coverage, so this PR is OK from that perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that as a follow up, this PR is already pretty big.
Since the primary criteria for switching to the new algorithm is that geo_line is within a time-series aggregation, we now disallow any other sort field. We test the negative case in the yaml tests, but changed the unit tests to use TermsAggregation to minim the time-series aggregation to get comparable results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The old code only threw error if there was data because the check was done inside the leaf collector just before actually reading the sort field. And there were no tests for missing sort field. This commit adds the tests, and checks early so even if data is missing.
* Test that behaviour is identical with or without POSITION metric * Removed fallback code in builder (was switching to old geo_line without POSITION metric) * Removed two TODO's that are no longer valid concerns
…96953) Recently we added support for line-simplification in geo-lines when in time-series aggregations: #94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes #96983
…lastic#96953) Recently we added support for line-simplification in geo-lines when in time-series aggregations: elastic#94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes elastic#96983
…96953) (#97005) Recently we added support for line-simplification in geo-lines when in time-series aggregations: #94954. This work did not, however, cover the case of `MergedGeoLines`, which occurs when the same geo_line covers multiple shards on data nodes, and needs to be merged on the coordinating node. The original work would use line-simplification and time-ordering on the data node, but then revert to re-sorting (unnecessary) and truncation (incorrect) on the coordinating node. This PR rectifies that, and adds two fields to the InternalGeoLine: * nonOverlapping * re-sorting is *NOT* required because the sort value ranges do not overlap, so we only sort the set of incoming geo_lines (easy/fast) instead of every single point (complex/slow). * simplified * The approach to memory limiting is by line-simplification instead of truncation. * If the data nodes are doing line-simplification, we want the coordinating node to do that as well For time-series, both of the above are true, and at this point all other cases will have both false. The reason for two booleans is: * They are not necessarily related concepts * We consider supporting line-simplification as a user-requested option in the non-time-series geo_line in future (much nicer than truncation) Fixes #96983
| @craigtaverner according to this PR's labels, I need to update the changelog YAML, but I can't because the PR is closed. Please either update the changelog yourself on the appropriate branch, or adjust the labels. Specifically:
|
Part of the work described in #85274
Within time-series data, two optimisations are done:
Tasks to complete before merging:
Final decision on when to activate optimized geo_line:
geo_line on position metric, decided to not require this, and instead:@timestampSo now the switch between old and new geo_line behaviour is exclusively based on its nesting inside a time_series aggregation.
Work to be done in additional PRs:
Use registered(cancelled based on decision above)ValueSource->Aggregatormappings to map position metrics toGeoLineAggregator.TimeSeriesMergedGeoLinesfor time-series data (ie. append and re-simplify) - Fix time-series geo_line to include reduce phase in MergedGeoLines #96953