Skip to content

ESQL: planning perf improvements over many fields #124395

@costin

Description

@costin

Description

We've noticed that planning shows up in the profiler when dealing with huge mappings (10k-100k+ fields).
Overall the goal is to add conditional and prevent code for execution on such large number of objects by avoid iteration in the first place.
This meta issue contains a list of (potential) improvements to apply to improve performance in this scenario broken down in two main buckets:

Avoiding execution

Optimized execution of existing code

  • LogicalVerifier#verify

  • PruneColumns

  • PropagateUnmappedFields

  • PropagateEvalFoldables

  • stop using super inside TypedAttribute/NamedExpression/Attribute/FieldAttribute equals

Currently the equals method delegate to their parent which helps with code but also causes suboptimal equality since the children of the node are compared before the attributes. Better to compare all the node properties first and delegate to the collection as a last result.

  • use collection hashing before performing attributes equality
    to avoid comparing large collections, use a hash comparison first before iterating over the collection

  • optimize Node#forEachProperty
    prop != children && children.contains(prop) == false && typeToken.isInstance(prop) -->
    prop != children && typeToken.isInstance(prop) && children.contains(prop) == false
    ESQL: Lazy collection copying during node transform #124424

  • look in removing/replacing children.contains(prop) inside Node#forEachProperty
    A (linkedhashSet) set would work better and preserve order however it would prevent a child to appear more than once. This can be an issue in projection with duplicate fields (keep a,a,a).

  • optimize NameId#hashCode to avoid array boxing (use Long.hashCode(id) instead) ESQL: Lazy collection copying during node transform #124424

  • replace Java stream api with regular for-loop

Though brief, stream(), collect(), reduce() & co are slower than their equivalent foreach and pollute the stack trace. They have an edge in parallel processing which, depending on the data size, could yield better results however that's not the case here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions