- Notifications
You must be signed in to change notification settings - Fork 2.5k
[CALCITE-7346] Prevent overflow in metadata row-count when LIMIT/OFFSET literal exceeds Long range #4707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ET literal exceeds Long range
4c932c4 to 2299cd5 Compare | } | ||
| | ||
| final long offset = rel.offset instanceof RexLiteral ? RexLiteral.longValue(rel.offset) : 0; | ||
| final double offset = literalNumericValue(rel.offset, 0D); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A large long value cannot be represented exactly as a double, since doubles have only 53 bits of mantissa.
In particular, very large integer values that differ may have the same representation as doubles.
Does this change the semantics of this code?
BigDecimal should be completely safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing that out. The currently used
"long" could indeed have the same issue. The current logic is all about metadata calculation (like row counts), and it’s actually quite easy for the number of rows to exceed the precision limit, especially in scenarios like the optimizer relying on row counts for cost estimation. If this logic doesn’t affect the actual computation of OFFSET/LIMIT in operators, I suppose it’s acceptable, but there is certainly a hidden risk.
Currently, the interfaces in these three classes all use
"Double" as the return type. Changing them to
"BigDecimal" would require significant modifications. Is it worth doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pragmatically it is very unlikely that the code will deal with collections with more than 2^53 values.
Most metadata information is approximate anyway, counting on exact comparisons would be wrong.
But this should de considered independently for every kind of metadata.
(As a side note, in some computation modes which deal with multisets it is possible for multisets to have many copies of an element, exceeding 2^53 elements in a collection. In our implementation all sets are represented this way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not have fully understood your point. Are you leaning towards accepting the changes in my current PR, or are you suggesting closing the PR? Currently, it only involves three metadata classes that compute RowCount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if long overflows, so does double, but the error handling is different: long will throw, but double will approximate.
Just make it clear what is happening.
| * Returns the numeric value stored in a literal, throwing if it cannot be | ||
| * represented as a finite double. | ||
| */ | ||
| public static double literalNumericValue(@Nullable RexNode node, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function should be called literalValueApproximatedByDouble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if the modified code 496e6d8 meets the requirements of the comments.
| * <p>Throws when the literal exceeds {@link Double#MAX_VALUE} instead of | ||
| * silently rounding to infinity. Doubles still approximate large integers (53 | ||
| * bits of mantissa), so the returned value becomes only an approximation when | ||
| * occupancy reaches that limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when numbers are very large
See CALCITE-7346