- Notifications
You must be signed in to change notification settings - Fork 25.6k
ESQL: Support reading points from doc-values for STATS ST_CENTROID #104218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Support reading points from doc-values for STATS ST_CENTROID #104218
Conversation
d5ae6e7 to cc4bf7d Compare | Pinging @elastic/es-analytics-geo (Team:Analytics) |
| Hi @craigtaverner, I've created a changelog YAML for you. |
server/src/test/java/org/elasticsearch/index/mapper/GeoPointFieldMapperTests.java Outdated Show resolved Hide resolved
...g/elasticsearch/compute/aggregation/SpatialCentroidCartesianPointSourceValuesAggregator.java Outdated Show resolved Hide resolved
...ava/org/elasticsearch/compute/aggregation/SpatialCentroidGeoPointSourceValuesAggregator.java Outdated Show resolved Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest design, the use of forStats==true is limited to the local physical plan, so most of the time it is completely correct to set this to false. Should we bother changing all the callers to this method when the case where it is needed is limited to a single location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that the entire test class PointFieldMapperTests has not been setup for synthetic source tests, which GeoPointFieldMapperTests has, and this support is needed for the relevant tests. We can investigate adding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please raise an issue for this one to not loose track of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - #104504
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've considered alternative names, like forDisplay (opposite of for stats), but every choice has pros and cons, so I've left it like this for now, since this most closely matches current needs.
If we start using this for non-stats use cases, like spatial predicates that benefit from doc-values, we can change the name of the method at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Predicate and top-n run on the results of functions consuming the points will want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support for inheritence in finding the static methods used in aggregations, because, for example, the SpatialCentroid can be configured in four different ways (geo vs cartesian, and doc-values vs source values). Used only single-level inheritance for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at the changes in AggregateMapper to see how we supported these combinations of generated function names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the heart of the solution for figuring out when to use doc-values and when to use source values. It only runs on the local node, so we only consider doc-values on the local node, and it re-writes spatial aggregations to mark fields as doc-values so the right spatial aggregation functions are executed, as well as finding the fields involved, and re-writing the FieldExtractExec to load that field as doc-values from the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the re-writing to use forStats is done only on the local node, this knowledge does not need to be serialised across the cluster. It is therefor not required to update the transport-version, and this older constructor remains the main constructor for the entire stack. If we decide at some point to plan this knowledge in the logical planner, we would need to serialize this, and update the transport version too. At this point we see no need to take that step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the additional combination of function names is generated. So far only spatial aggregations need this, but the idea can be used by others if there is ever value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what this is about in more detail? I'd kind of imagined the withForStats would be enough to enable the optimization and the mapper would just call the single function. I imagine I'm missing some part of how the linkage works though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at SpatialCentroid.supplier(). There are four possible combinations, based on two spatial types GEO_POINT and CARTESIAN_POINT, and two loading types DocValues and SourceValues. The code generator needs to generate functions for all four combinations. The AggregateMapper deals with many possible combinations, but not this specific set of DocValues and SourceValues, so I added these using the same pattern I see here for other combinations. Without this fix the classloader cannot find the classes to load for the four combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this fix the classloader cannot find the classes to load for the four combinations.
That's the trick. I suppose we're using some kind of trick to get these linked up. Scalars just use booleans for it. I'd kind of prefer that. But now isn't the time to rework that.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/AggregateMapper.java Outdated Show resolved Hide resolved
craigtaverner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did my own review, with a few suggested changes as well as some explanatory comments.
| Craig, leaving a quick comment that I'll get to review this PR however due to its size but it will take some time. |
luigidellaquila left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @craigtaverner, LGTM!
I just left a comment, not sure it's really important but maybe worth checking
| var changed = as.replaceChild(af.withDocValues()); | ||
| changedAggregates = true; | ||
| if (af.field() instanceof Attribute fieldAttribute) { | ||
| foundAttribute.set(fieldAttribute); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a double-check: are you taking this recent change into consideration? #104387
(I don't have a real use case where this could make a big difference, just wondering if you checked it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at that now. It turns out that if we nest a function inside the st_centroid then we won't load from doc-values. The queries work, but the optimization is not activated. I'm busy debugging that now to see if there is a simple solution, but this could also be a followup PR to enabled this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did need to make a small change to the local physical planner to get all tests to pass with nested functions. They are not using the doc-values optimization, but at least the queries run with nested fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's handle that in a follow-up PR. When dealing with a nested expression, the agg will see the result of that computation so it becomes a question on whether that operation should load geo from doc values or source.
| Pinging @elastic/es-analytical-engine (Team:Analytics) |
| | SORT scalerank DESC | ||
| ; | ||
| | ||
| centroid:geo_point | count:long | scalerank:i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this specific test, but for a more wide range of value handlers some null values could be added to the "airport" data set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more data with null values and have aggregation working on that. The centroid of null is null. The centroid of data including a few nulls is the centroid of the non-null values.
| private static class SpatialDocValuesExtraction extends OptimizerRule<AggregateExec> { | ||
| @Override | ||
| protected PhysicalPlan rule(AggregateExec aggregate) { | ||
| var foundAttribute = new Holder<Attribute>(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is to blame or not (a single Holder for attributes, while there may be more than one while searching). So, I tested with FROM airports2 | stats centroid1=ST_CENTROID(location), centroid2=ST_CENTROID(location2) where location and location2 are geo_points. This fails with
Caused by: java.lang.ClassCastException: class org.elasticsearch.compute.data.BytesRefArrayBlock cannot be cast to class org.elasticsearch.compute.data.LongBlock (org.elasticsearch.compute.data.BytesRefArrayBlock and org.elasticsearch.compute.data.LongBlock are in unnamed module of loader java.net.FactoryURLClassLoader @550fa96f) at org.elasticsearch.compute.aggregation.spatial.SpatialCentroidGeoPointDocValuesAggregatorFunction.addRawInput(SpatialCentroidGeoPointDocValuesAggregatorFunction.java:64) at org.elasticsearch.compute.aggregation.Aggregator.processPage(Aggregator.java:42) at org.elasticsearch.compute.operator.AggregationOperator.addInput(AggregationOperator.java:79) at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:214) at org.elasticsearch.compute.operator.Driver.run(Driver.java:139) at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:327) Also, not sure if this is supposed to work or not, or maybe return a more descriptive and user-friendly error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look at this and this seems to be the culprit.
This code works for me:
protected PhysicalPlan rule(AggregateExec aggregate) { Set<FieldAttribute> fieldAttributes = new HashSet<>(); PhysicalPlan plan = aggregate.transformDown(UnaryExec.class, exec -> { ... if (af.field() instanceof FieldAttribute fieldAttribute) { // We need to both mark the field to load differently, and change the spatial function to know to use it fieldAttributes.add(fieldAttribute); changedAggregates = true; ... if (exec instanceof FieldExtractExec fieldExtractExec && fieldAttributes.size() > 0) { // Tell the field extractor that it should extract the field from doc-values instead of source values if (fieldAttributes.retainAll(fieldExtractExec.attributesToExtract())) { exec = fieldExtractExec.withForStats(fieldAttributes); } } return exec; ... There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug, and I changed the Holder to a Set, and wrote a test in spatial.csv-spec that tests this, as well as a PhysicalPlanner test to test the correct behaviour of the planner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the bug fix for this and the corresponding test, it LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test comes from a separate PR - looks like a git foobar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find your comment in the file view, was this something that was fixed by merging main?
Or do you mean the spatial test I added to this file? I could delete that but thought a test of the logical plan for a spatial aggregation was not entirely without value.
Includes a test fix for loading and converting nulls to encoded longs.
The local physical planner only marked a single field for stats loading, but marked all spatial aggregations for stats loading, which load to only one aggregation getting the right data, while the rest would get the wrong data.
Now the planner decides whether to load data from doc-values. To remove the confusion of preferDocValues==false in the non-spatial cases, we use an ENUM with the default value of NONE, to make it clear we're leaving the choice up to the field type in all non-spatial cases.
…computers This was not reproducible on the development machine, but CI machines were sufficiently different to lead to very tiny precision changes over very large Kahan summations. We fixed this by reducing the need for precision checks in clustered integration tests.
This reverts commit 12c6980.
| @craigtaverner according to this PR's labels, it should not have a changelog YAML, but I can't delete it because it is closed. Please either delete the changelog from the appropriate branch yourself, or adjust the labels. |
## Summary Adds new agg function as in elastic/elasticsearch#104218 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Adds new agg function as in elastic/elasticsearch#104218 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Adds new agg function as in elastic/elasticsearch#104218 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Support
st_centroidaggregation ongeo_pointandcartesian_pointdata. All aggregations will combined the original data into new data, and as such we can load from doc-values for performance. This feature is both the first spatial aggregation for ES|QL as well as support for loading points from doc-values instead of from source, but only for spatial aggregations.Since ES|QL aggregations (ie the
STATScommand) are split into two phases, an initial aggregation on the data nodes, and the final aggregation on the coordinator node, this allows us to only need to support doc-values on the data nodes. The data flow is as follows:forStats==true, and the GeoPoint and cartesian Point field types will detect this boolean and returnLongBlockloaders.SpatialAggregateFunctionon the data node will be notified that it should supportLongBlockinput, and will decode the long into(double x, double y)before aggregating, and producing intermediate results (fourDoubleBLockfor xSum, xComp, ySum, yCom and oneLongBlockfor count).SpatialAggregateFunctionon the coordinator node only receives intermediate results, and will never receive actual points, so does not care what block type was originally used, and will produce the final centroid in WKB format.This design has a few advantages over the original plan of supporting both WKB and encoded longs in the full stack:
LongBlocksupport we used in 8.12)LongBlockin the data loading from the index can be made very late, at theLocalPhysicalPlanOptimizerstage, isolating it to a very small set of changes, and also only on the data nodes at the data loading part.Note that this solution will not support the use of doc-values in
ST_CENTROIDif some otherSTATScommand is run first, because only the firstSTATScommand will be run on the data nodes. For example:In this case the
ST_CENTROIDwill use the source values for the calculation. We have decided to favour the simpler solution of only doing local physical planning for doc-values because support of doc-values in the coordinator node requires a much more complex solution with full-stack support of both source and doc-values versions of the point types.Fixes #104656 104656