Skip to content

Conversation

@craigtaverner
Copy link
Contributor

@craigtaverner craigtaverner commented Jan 10, 2024

Support st_centroid aggregation on geo_point and cartesian_point data. All aggregations will combined the original data into new data, and as such we can load from doc-values for performance. This feature is both the first spatial aggregation for ES|QL as well as support for loading points from doc-values instead of from source, but only for spatial aggregations.

Since ES|QL aggregations (ie the STATS command) are split into two phases, an initial aggregation on the data nodes, and the final aggregation on the coordinator node, this allows us to only need to support doc-values on the data nodes. The data flow is as follows:

  • Data node uses FieldExtractExec to tell the BlockLoaderContext that the blocks should be loaded with forStats==true, and the GeoPoint and cartesian Point field types will detect this boolean and return LongBlock loaders.
  • The SpatialAggregateFunction on the data node will be notified that it should support LongBlock input, and will decode the long into (double x, double y) before aggregating, and producing intermediate results (four DoubleBLock for xSum, xComp, ySum, yCom and one LongBlock for count).
  • The SpatialAggregateFunction on the coordinator node only receives intermediate results, and will never receive actual points, so does not care what block type was originally used, and will produce the final centroid in WKB format.

This design has a few advantages over the original plan of supporting both WKB and encoded longs in the full stack:

  • Only one backing type for each explicit type (no need to bring back the LongBlock support we used in 8.12)
  • The decision to switch to LongBlock in the data loading from the index can be made very late, at the LocalPhysicalPlanOptimizer stage, isolating it to a very small set of changes, and also only on the data nodes at the data loading part.

Note that this solution will not support the use of doc-values in ST_CENTROID if some other STATS command is run first, because only the first STATS command will be run on the data nodes. For example:

FROM airports | STATS count = COUNT(*) BY scalerank, location | LIMIT 10 | STATS centroid=ST_CENTROID(location), count=SUM(count) 

In this case the ST_CENTROID will use the source values for the calculation. We have decided to favour the simpler solution of only doing local physical planning for doc-values because support of doc-values in the coordinator node requires a much more complex solution with full-stack support of both source and doc-values versions of the point types.

Fixes #104656 104656

@craigtaverner craigtaverner added :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Jan 10, 2024
@craigtaverner craigtaverner force-pushed the esql_point_doc_values branch 3 times, most recently from d5ae6e7 to cc4bf7d Compare January 15, 2024 15:12
@craigtaverner craigtaverner marked this pull request as ready for review January 15, 2024 15:35
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @craigtaverner, I've created a changelog YAML for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest design, the use of forStats==true is limited to the local physical plan, so most of the time it is completely correct to set this to false. Should we bother changing all the callers to this method when the case where it is needed is limited to a single location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that the entire test class PointFieldMapperTests has not been setup for synthetic source tests, which GeoPointFieldMapperTests has, and this support is needed for the relevant tests. We can investigate adding this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please raise an issue for this one to not loose track of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - #104504

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've considered alternative names, like forDisplay (opposite of for stats), but every choice has pros and cons, so I've left it like this for now, since this most closely matches current needs.

If we start using this for non-stats use cases, like spatial predicates that benefit from doc-values, we can change the name of the method at that point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predicate and top-n run on the results of functions consuming the points will want it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for inheritence in finding the static methods used in aggregations, because, for example, the SpatialCentroid can be configured in four different ways (geo vs cartesian, and doc-values vs source values). Used only single-level inheritance for simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the changes in AggregateMapper to see how we supported these combinations of generated function names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the heart of the solution for figuring out when to use doc-values and when to use source values. It only runs on the local node, so we only consider doc-values on the local node, and it re-writes spatial aggregations to mark fields as doc-values so the right spatial aggregation functions are executed, as well as finding the fields involved, and re-writing the FieldExtractExec to load that field as doc-values from the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the re-writing to use forStats is done only on the local node, this knowledge does not need to be serialised across the cluster. It is therefor not required to update the transport-version, and this older constructor remains the main constructor for the entire stack. If we decide at some point to plan this knowledge in the logical planner, we would need to serialize this, and update the transport version too. At this point we see no need to take that step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the additional combination of function names is generated. So far only spatial aggregations need this, but the idea can be used by others if there is ever value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what this is about in more detail? I'd kind of imagined the withForStats would be enough to enable the optimization and the mapper would just call the single function. I imagine I'm missing some part of how the linkage works though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at SpatialCentroid.supplier(). There are four possible combinations, based on two spatial types GEO_POINT and CARTESIAN_POINT, and two loading types DocValues and SourceValues. The code generator needs to generate functions for all four combinations. The AggregateMapper deals with many possible combinations, but not this specific set of DocValues and SourceValues, so I added these using the same pattern I see here for other combinations. Without this fix the classloader cannot find the classes to load for the four combinations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this fix the classloader cannot find the classes to load for the four combinations.

That's the trick. I suppose we're using some kind of trick to get these linked up. Scalars just use booleans for it. I'd kind of prefer that. But now isn't the time to rework that.

Copy link
Contributor Author

@craigtaverner craigtaverner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did my own review, with a few suggested changes as well as some explanatory comments.

@costin
Copy link
Member

costin commented Jan 17, 2024

Craig, leaving a quick comment that I'll get to review this PR however due to its size but it will take some time.

Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @craigtaverner, LGTM!

I just left a comment, not sure it's really important but maybe worth checking

var changed = as.replaceChild(af.withDocValues());
changedAggregates = true;
if (af.field() instanceof Attribute fieldAttribute) {
foundAttribute.set(fieldAttribute);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a double-check: are you taking this recent change into consideration? #104387
(I don't have a real use case where this could make a big difference, just wondering if you checked it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at that now. It turns out that if we nest a function inside the st_centroid then we won't load from doc-values. The queries work, but the optimization is not activated. I'm busy debugging that now to see if there is a simple solution, but this could also be a followup PR to enabled this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did need to make a small change to the local physical planner to get all tests to pass with nested functions. They are not using the doc-values optimization, but at least the queries run with nested fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's handle that in a follow-up PR. When dealing with a nested expression, the agg will see the result of that computation so it becomes a question on whether that operation should load geo from doc values or source.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

| SORT scalerank DESC
;

centroid:geo_point | count:long | scalerank:i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this specific test, but for a more wide range of value handlers some null values could be added to the "airport" data set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more data with null values and have aggregation working on that. The centroid of null is null. The centroid of data including a few nulls is the centroid of the non-null values.

private static class SpatialDocValuesExtraction extends OptimizerRule<AggregateExec> {
@Override
protected PhysicalPlan rule(AggregateExec aggregate) {
var foundAttribute = new Holder<Attribute>(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is to blame or not (a single Holder for attributes, while there may be more than one while searching). So, I tested with FROM airports2 | stats centroid1=ST_CENTROID(location), centroid2=ST_CENTROID(location2) where location and location2 are geo_points. This fails with

Caused by: java.lang.ClassCastException: class org.elasticsearch.compute.data.BytesRefArrayBlock cannot be cast to class org.elasticsearch.compute.data.LongBlock (org.elasticsearch.compute.data.BytesRefArrayBlock and org.elasticsearch.compute.data.LongBlock are in unnamed module of loader java.net.FactoryURLClassLoader @550fa96f) at org.elasticsearch.compute.aggregation.spatial.SpatialCentroidGeoPointDocValuesAggregatorFunction.addRawInput(SpatialCentroidGeoPointDocValuesAggregatorFunction.java:64) at org.elasticsearch.compute.aggregation.Aggregator.processPage(Aggregator.java:42) at org.elasticsearch.compute.operator.AggregationOperator.addInput(AggregationOperator.java:79) at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:214) at org.elasticsearch.compute.operator.Driver.run(Driver.java:139) at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:327) 

Also, not sure if this is supposed to work or not, or maybe return a more descriptive and user-friendly error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look at this and this seems to be the culprit.
This code works for me:

 protected PhysicalPlan rule(AggregateExec aggregate) { Set<FieldAttribute> fieldAttributes = new HashSet<>(); PhysicalPlan plan = aggregate.transformDown(UnaryExec.class, exec -> { ... if (af.field() instanceof FieldAttribute fieldAttribute) { // We need to both mark the field to load differently, and change the spatial function to know to use it fieldAttributes.add(fieldAttribute); changedAggregates = true; ... if (exec instanceof FieldExtractExec fieldExtractExec && fieldAttributes.size() > 0) { // Tell the field extractor that it should extract the field from doc-values instead of source values if (fieldAttributes.retainAll(fieldExtractExec.attributesToExtract())) { exec = fieldExtractExec.withForStats(fieldAttributes); } } return exec; ... 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug, and I changed the Holder to a Set, and wrote a test in spatial.csv-spec that tests this, as well as a PhysicalPlanner test to test the correct behaviour of the planner.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the bug fix for this and the corresponding test, it LGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test comes from a separate PR - looks like a git foobar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find your comment in the file view, was this something that was fixed by merging main?

Or do you mean the spatial test I added to this file? I could delete that but thought a test of the logical plan for a spatial aggregation was not entirely without value.

Includes a test fix for loading and converting nulls to encoded longs.
The local physical planner only marked a single field for stats loading, but marked all spatial aggregations for stats loading, which load to only one aggregation getting the right data, while the rest would get the wrong data.
Now the planner decides whether to load data from doc-values. To remove the confusion of preferDocValues==false in the non-spatial cases, we use an ENUM with the default value of NONE, to make it clear we're leaving the choice up to the field type in all non-spatial cases.
…computers This was not reproducible on the development machine, but CI machines were sufficiently different to lead to very tiny precision changes over very large Kahan summations. We fixed this by reducing the need for precision checks in clustered integration tests.
@craigtaverner craigtaverner merged commit eb1c490 into elastic:main Jan 23, 2024
@craigtaverner craigtaverner changed the title ESQL: Support reading points from doc-values for STATS ESQL: Support reading points from doc-values for STATS ST_CENTROID Jan 23, 2024
@elasticsearchmachine
Copy link
Collaborator

@craigtaverner according to this PR's labels, it should not have a changelog YAML, but I can't delete it because it is closed. Please either delete the changelog from the appropriate branch yourself, or adjust the labels.

dej611 added a commit to elastic/kibana that referenced this pull request Jan 29, 2024
## Summary Adds new agg function as in elastic/elasticsearch#104218 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this pull request Feb 15, 2024
## Summary Adds new agg function as in elastic/elasticsearch#104218 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
fkanout pushed a commit to fkanout/kibana that referenced this pull request Mar 4, 2024
## Summary Adds new agg function as in elastic/elasticsearch#104218 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL :Analytics/Geo Indexing, search aggregations of geo points and shapes >docs General docs changes >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Docs Meta label for docs team v8.13.0

7 participants