Skip to content

Conversation

@sundy-li
Copy link
Member

@sundy-li sundy-li commented Feb 4, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

New HyperLogLog implementation

simple_hll is a simple HyperLogLog implementation in rust. It is designed to be simple to use and less space to store (with Sparse HyperLogLog).

Adaptive Sparse serialization of hyperloglog

#[derive(serde::Deserialize, borsh::BorshDeserialize)] enum HyperLogLogVariant<const P: usize> { Empty, Sparse { data: Vec<(u16, u8)> }, Full(Vec<u8>), } 

The serialized bytes could be [1, 1<<14 = 16k] with P=14.

Refactor table function fuse_statistic and test the ndv results within expected error rate range

query T select *, (ndv - 7500) / 7500 < 0.01625 * 6 from fuse_statistic('db_09_0020', 'a') ---- 0 7637 1 1 7730 1 2 7530 1 

Additional param to set the error rate in approx_count_distinct function

approx_count_distinct(0.1)(number)

Refactor column stats using hyperloglog with P = 12 with max size = 4k (1<<12)

/// Takes at most `1<<12 = 4k`` spaces with error ratio of `0.01625` pub type ColumnStatHLL = HyperLogLog<12>; 

Acknowledgements

Some codes and tests are borrowed and inspired from:

Reference papers:

Thanks @jimexist @crepererum for the initial codes and the paper author: Otmar Ertl

Others

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Feb 4, 2024
@dantengsky dantengsky added ci-cloud Build docker image for cloud test ci-benchmark Benchmark: run all test and removed ci-cloud Build docker image for cloud test labels Feb 4, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2024

Docker Image for PR

  • tag: pr-14585-b60c11e

note: this image tag is only available for internal use,
please check the internal doc for more details.

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2024

Docker Image for PR

  • tag: pr-14585-b60c11e

note: this image tag is only available for internal use,
please check the internal doc for more details.

@sundy-li sundy-li removed the ci-benchmark Benchmark: run all test label Feb 4, 2024
@sundy-li sundy-li marked this pull request as draft February 4, 2024 08:16
@sundy-li
Copy link
Member Author

sundy-li commented Feb 4, 2024

I found the join order is not very accurate (due to inaccurate selectivity) , we need support #14587 later .

@sundy-li sundy-li marked this pull request as ready for review February 4, 2024 10:25
@BohuTANG BohuTANG added the ci-benchmark Benchmark: run all test label Feb 4, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2024

Docker Image for PR

  • tag: pr-14585-19e3ac3

note: this image tag is only available for internal use,
please check the internal doc for more details.

case_name: &str,
snapshot_count: u32,
table_statistic_count: u32,
_table_statistic_count: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the arg be removed?

Comment on lines +130 to 135
// assert_eq!(
// ts_count, table_statistic_count,
// "case [{}], check snapshot statistics count",
// case_name
// );
assert_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless?

@dantengsky dantengsky marked this pull request as draft February 5, 2024 08:30
@sundy-li
Copy link
Member Author

sundy-li commented Feb 5, 2024

The segment file may grow too much. Converted to draft now.

@sundy-li sundy-li closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-benchmark Benchmark: run all test pr-refactor this PR changes the code base without new features or bugfix

6 participants