ESQL: Support reading points from doc-values for STATS ST_CENTROID #104218

craigtaverner · 2024-01-10T17:06:39Z

Support st_centroid aggregation on geo_point and cartesian_point data. All aggregations will combined the original data into new data, and as such we can load from doc-values for performance. This feature is both the first spatial aggregation for ES|QL as well as support for loading points from doc-values instead of from source, but only for spatial aggregations.

Since ES|QL aggregations (ie the STATS command) are split into two phases, an initial aggregation on the data nodes, and the final aggregation on the coordinator node, this allows us to only need to support doc-values on the data nodes. The data flow is as follows:

Data node uses FieldExtractExec to tell the BlockLoaderContext that the blocks should be loaded with forStats==true, and the GeoPoint and cartesian Point field types will detect this boolean and return LongBlock loaders.
The SpatialAggregateFunction on the data node will be notified that it should support LongBlock input, and will decode the long into (double x, double y) before aggregating, and producing intermediate results (four DoubleBLock for xSum, xComp, ySum, yCom and one LongBlock for count).
The SpatialAggregateFunction on the coordinator node only receives intermediate results, and will never receive actual points, so does not care what block type was originally used, and will produce the final centroid in WKB format.

This design has a few advantages over the original plan of supporting both WKB and encoded longs in the full stack:

Only one backing type for each explicit type (no need to bring back the LongBlock support we used in 8.12)
The decision to switch to LongBlock in the data loading from the index can be made very late, at the LocalPhysicalPlanOptimizer stage, isolating it to a very small set of changes, and also only on the data nodes at the data loading part.

Note that this solution will not support the use of doc-values in ST_CENTROID if some other STATS command is run first, because only the first STATS command will be run on the data nodes. For example:

FROM airports | STATS count = COUNT(*) BY scalerank, location | LIMIT 10 | STATS centroid=ST_CENTROID(location), count=SUM(count)

In this case the ST_CENTROID will use the source values for the calculation. We have decided to favour the simpler solution of only doing local physical planning for doc-values because support of doc-values in the coordinator node requires a much more complex solution with full-stack support of both source and doc-values versions of the point types.

Fixes #104656 104656

elasticsearchmachine · 2024-01-15T15:35:45Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine · 2024-01-16T08:29:47Z

Hi @craigtaverner, I've created a changelog YAML for you.

server/src/test/java/org/elasticsearch/index/mapper/GeoPointFieldMapperTests.java

...g/elasticsearch/compute/aggregation/SpatialCentroidCartesianPointSourceValuesAggregator.java

...ava/org/elasticsearch/compute/aggregation/SpatialCentroidGeoPointSourceValuesAggregator.java

craigtaverner · 2024-01-16T16:17:36Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/PlannerUtils.java

With the latest design, the use of forStats==true is limited to the local physical plan, so most of the time it is completely correct to set this to false. Should we bother changing all the callers to this method when the case where it is needed is limited to a single location?

craigtaverner · 2024-01-16T16:39:35Z

...patial/src/test/java/org/elasticsearch/xpack/spatial/index/mapper/PointFieldMapperTests.java

The problem here is that the entire test class PointFieldMapperTests has not been setup for synthetic source tests, which GeoPointFieldMapperTests has, and this support is needed for the relevant tests. We can investigate adding this.

Please raise an issue for this one to not loose track of it.

Done - #104504

craigtaverner · 2024-01-16T16:41:58Z

.../src/main/java/org/elasticsearch/benchmark/compute/operator/ValuesSourceReaderBenchmark.java

We've considered alternative names, like forDisplay (opposite of for stats), but every choice has pros and cons, so I've left it like this for now, since this most closely matches current needs.

If we start using this for non-stats use cases, like spatial predicates that benefit from doc-values, we can change the name of the method at that point.

Predicate and top-n run on the results of functions consuming the points will want it.

craigtaverner · 2024-01-16T16:43:47Z

x-pack/plugin/esql/compute/gen/src/main/java/org/elasticsearch/compute/gen/Methods.java

Added support for inheritence in finding the static methods used in aggregations, because, for example, the SpatialCentroid can be configured in four different ways (geo vs cartesian, and doc-values vs source values). Used only single-level inheritance for simplicity.

craigtaverner · 2024-01-16T16:45:36Z

...rc/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/SpatialCentroid.java

Look at the changes in AggregateMapper to see how we supported these combinations of generated function names.

craigtaverner · 2024-01-16T16:47:32Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

This is the heart of the solution for figuring out when to use doc-values and when to use source values. It only runs on the local node, so we only consider doc-values on the local node, and it re-writes spatial aggregations to mark fields as doc-values so the right spatial aggregation functions are executed, as well as finding the fields involved, and re-writing the FieldExtractExec to load that field as doc-values from the index.

craigtaverner · 2024-01-16T16:50:14Z

...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/FieldExtractExec.java

Since the re-writing to use forStats is done only on the local node, this knowledge does not need to be serialised across the cluster. It is therefor not required to update the transport-version, and this older constructor remains the main constructor for the entire stack. If we decide at some point to plan this knowledge in the logical planner, we would need to serialize this, and update the transport version too. At this point we see no need to take that step.

craigtaverner · 2024-01-16T16:51:21Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/AggregateMapper.java

This is where the additional combination of function names is generated. So far only spatial aggregations need this, but the idea can be used by others if there is ever value.

Could you explain what this is about in more detail? I'd kind of imagined the withForStats would be enough to enable the optimization and the mapper would just call the single function. I imagine I'm missing some part of how the linkage works though.

Take a look at SpatialCentroid.supplier(). There are four possible combinations, based on two spatial types GEO_POINT and CARTESIAN_POINT, and two loading types DocValues and SourceValues. The code generator needs to generate functions for all four combinations. The AggregateMapper deals with many possible combinations, but not this specific set of DocValues and SourceValues, so I added these using the same pattern I see here for other combinations. Without this fix the classloader cannot find the classes to load for the four combinations.

Without this fix the classloader cannot find the classes to load for the four combinations.

That's the trick. I suppose we're using some kind of trick to get these linked up. Scalars just use booleans for it. I'd kind of prefer that. But now isn't the time to rework that.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/AggregateMapper.java

craigtaverner

Did my own review, with a few suggested changes as well as some explanatory comments.

costin · 2024-01-17T03:05:36Z

Craig, leaving a quick comment that I'll get to review this PR however due to its size but it will take some time.

This reverts commit cfc4341.

luigidellaquila

Thank you very much @craigtaverner, LGTM!

I just left a comment, not sure it's really important but maybe worth checking

luigidellaquila · 2024-01-18T10:07:43Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

+ var changed = as.replaceChild(af.withDocValues());
+ changedAggregates = true;
+ if (af.field() instanceof Attribute fieldAttribute) {
+ foundAttribute.set(fieldAttribute);


Just a double-check: are you taking this recent change into consideration? #104387
(I don't have a real use case where this could make a big difference, just wondering if you checked it)

Looking at that now. It turns out that if we nest a function inside the st_centroid then we won't load from doc-values. The queries work, but the optimization is not activated. I'm busy debugging that now to see if there is a simple solution, but this could also be a followup PR to enabled this case.

I did need to make a small change to the local physical planner to get all tests to pass with nested functions. They are not using the doc-values optimization, but at least the queries run with nested fields.

Let's handle that in a follow-up PR. When dealing with a nested expression, the agg will see the result of that computation so it becomes a question on whether that operation should load geo from doc values or source.

elasticsearchmachine · 2024-01-18T10:30:12Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

astefan · 2024-01-18T13:07:14Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/spatial.csv-spec

+| SORT scalerank DESC
+;
+
+centroid:geo_point | count:long | scalerank:i


Not related to this specific test, but for a more wide range of value handlers some null values could be added to the "airport" data set.

Added more data with null values and have aggregation working on that. The centroid of null is null. The centroid of data including a few nulls is the centroid of the non-null values.

astefan · 2024-01-18T13:44:33Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

+ private static class SpatialDocValuesExtraction extends OptimizerRule<AggregateExec> {
+ @Override
+ protected PhysicalPlan rule(AggregateExec aggregate) {
+ var foundAttribute = new Holder<Attribute>(null);


I am not sure if this is to blame or not (a single Holder for attributes, while there may be more than one while searching). So, I tested with FROM airports2 | stats centroid1=ST_CENTROID(location), centroid2=ST_CENTROID(location2) where location and location2 are geo_points. This fails with

Caused by: java.lang.ClassCastException: class org.elasticsearch.compute.data.BytesRefArrayBlock cannot be cast to class org.elasticsearch.compute.data.LongBlock (org.elasticsearch.compute.data.BytesRefArrayBlock and org.elasticsearch.compute.data.LongBlock are in unnamed module of loader java.net.FactoryURLClassLoader @550fa96f) at org.elasticsearch.compute.aggregation.spatial.SpatialCentroidGeoPointDocValuesAggregatorFunction.addRawInput(SpatialCentroidGeoPointDocValuesAggregatorFunction.java:64) at org.elasticsearch.compute.aggregation.Aggregator.processPage(Aggregator.java:42) at org.elasticsearch.compute.operator.AggregationOperator.addInput(AggregationOperator.java:79) at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:214) at org.elasticsearch.compute.operator.Driver.run(Driver.java:139) at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:327)

Also, not sure if this is supposed to work or not, or maybe return a more descriptive and user-friendly error message.

I had a quick look at this and this seems to be the culprit.
This code works for me:

protected PhysicalPlan rule(AggregateExec aggregate) { Set<FieldAttribute> fieldAttributes = new HashSet<>(); PhysicalPlan plan = aggregate.transformDown(UnaryExec.class, exec -> { ... if (af.field() instanceof FieldAttribute fieldAttribute) { // We need to both mark the field to load differently, and change the spatial function to know to use it fieldAttributes.add(fieldAttribute); changedAggregates = true; ... if (exec instanceof FieldExtractExec fieldExtractExec && fieldAttributes.size() > 0) { // Tell the field extractor that it should extract the field from doc-values instead of source values if (fieldAttributes.retainAll(fieldExtractExec.attributesToExtract())) { exec = fieldExtractExec.withForStats(fieldAttributes); } } return exec; ...

This was a bug, and I changed the Holder to a Set, and wrote a test in spatial.csv-spec that tests this, as well as a PhysicalPlanner test to test the correct behaviour of the planner.

astefan

Other than the bug fix for this and the corresponding test, it LGTM.

costin · 2024-01-18T17:44:17Z

...gin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java

This test comes from a separate PR - looks like a git foobar.

I cannot find your comment in the file view, was this something that was fixed by merging main?

Or do you mean the spatial test I added to this file? I could delete that but thought a test of the logical plan for a spatial aggregation was not entirely without value.

Includes a test fix for loading and converting nulls to encoded longs.

The local physical planner only marked a single field for stats loading, but marked all spatial aggregations for stats loading, which load to only one aggregation getting the right data, while the rest would get the wrong data.

Now the planner decides whether to load data from doc-values. To remove the confusion of preferDocValues==false in the non-spatial cases, we use an ENUM with the default value of NONE, to make it clear we're leaving the choice up to the field type in all non-spatial cases.

…computers This was not reproducible on the development machine, but CI machines were sufficiently different to lead to very tiny precision changes over very large Kahan summations. We fixed this by reducing the need for precision checks in clustered integration tests.

This reverts commit 12c6980.

elasticsearchmachine · 2024-01-23T15:15:20Z

@craigtaverner according to this PR's labels, it should not have a changelog YAML, but I can't delete it because it is closed. Please either delete the changelog from the appropriate branch yourself, or adjust the labels.

## Summary Adds new agg function as in elastic/elasticsearch#104218 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

craigtaverner added :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Jan 10, 2024

elasticsearchmachine added the v8.13.0 label Jan 10, 2024

craigtaverner force-pushed the esql_point_doc_values branch 3 times, most recently from d5ae6e7 to cc4bf7d Compare January 15, 2024 15:12

craigtaverner marked this pull request as ready for review January 15, 2024 15:35

craigtaverner requested review from costin and iverase January 16, 2024 08:28

craigtaverner added the >enhancement label Jan 16, 2024