DartMinHash & Efficient Rejection Sampling: Fast Sketching for Weighted Sets

This crate provides the implementation of DartMinHash (1) and Efficient Rejection Sampling (3) algorithm for estimation of weighted Jaccard similarity. To reproduce the algorithm in the paper, we use the same tabulation hashing idea (4). Mersenne Twister PRNG was used as seed. Other high quality 64-bit hash functions such as xxhash-rust or whyash-rs should also work as well.

Note: DartMinHash is significantly faster than (Efficient) Rejection Sampling (2,3) for very sparse vectors, that is the number of nonezero elements (d) is less than ~5% of vector dimension (D) on average for all vectors. This is especially true for large-scale datasets. However, For RS and ERS, a maxmimum value of weight for input vector must be known (dimension-wise). Otherwise, the estimation is significantly biased (6). Therefore, general applicability is limited by the required priori knowledge of sharp upper bounds for $w_{max}(d)$. Also, ERS is not unbiased (3).

Install & test

Add below lines to your Cargo.toml dependencies. Official release in crates.io is here.

dartminhash = "0.1"

Test case to evaulate the accuracy of the DartMinHash algorithm.

cargo test --release dartminhash_approximates_weighted_jaccard -- --nocapture

Test for true weighted Jaccard from 0.005 to 0.98. Output for DartMinHash:

true weighted Jaccard: 0.9801980198019941 estimated weighted Jaccard: 0.98046875 true weighted Jaccard: 0.9230769230769282 estimated weighted Jaccard: 0.921142578125 true weighted Jaccard: 0.8691588785046781 estimated weighted Jaccard: 0.872802734375 true weighted Jaccard: 0.8181818181818195 estimated weighted Jaccard: 0.8193359375 true weighted Jaccard: 0.7391304347825998 estimated weighted Jaccard: 0.73876953125 true weighted Jaccard: 0.6666666666666655 estimated weighted Jaccard: 0.67138671875 true weighted Jaccard: 0.5999999999999979 estimated weighted Jaccard: 0.6025390625 true weighted Jaccard: 0.5384615384615447 estimated weighted Jaccard: 0.53662109375 true weighted Jaccard: 0.48148148148148356 estimated weighted Jaccard: 0.484619140625 true weighted Jaccard: 0.42857142857142844 estimated weighted Jaccard: 0.4267578125 true weighted Jaccard: 0.37931034482758635 estimated weighted Jaccard: 0.37060546875 true weighted Jaccard: 0.3333333333333311 estimated weighted Jaccard: 0.336181640625 true weighted Jaccard: 0.25000000000000155 estimated weighted Jaccard: 0.245361328125 true weighted Jaccard: 0.17647058823529352 estimated weighted Jaccard: 0.17626953125 true weighted Jaccard: 0.11111111111111141 estimated weighted Jaccard: 0.109130859375 true weighted Jaccard: 0.05263157894736854 estimated weighted Jaccard: 0.05224609375 true weighted Jaccard: 0.02564102564102588 estimated weighted Jaccard: 0.026123046875 true weighted Jaccard: 0.005025125628140735 estimated weighted Jaccard: 0.004638671875

Test case to evaulate the accuracy of RS and ERS algorithms.

cargo test --release ers_approximates_weighted_jaccard -- --nocapture

Test for true weighted Jaccard from 0.005 to 0.98. Output for ERS:

true weighted Jaccard: 0.980198019801974 estimated weighted Jaccard: 0.981689453125 true weighted Jaccard: 0.9230769230769282 estimated weighted Jaccard: 0.930419921875 true weighted Jaccard: 0.869158878504667 estimated weighted Jaccard: 0.873779296875 true weighted Jaccard: 0.818181818181824 estimated weighted Jaccard: 0.813232421875 true weighted Jaccard: 0.7391304347826144 estimated weighted Jaccard: 0.743408203125 true weighted Jaccard: 0.6666666666666666 estimated weighted Jaccard: 0.6650390625 true weighted Jaccard: 0.6000000000000038 estimated weighted Jaccard: 0.593994140625 true weighted Jaccard: 0.5384615384615451 estimated weighted Jaccard: 0.52978515625 true weighted Jaccard: 0.4814814814814917 estimated weighted Jaccard: 0.477783203125 true weighted Jaccard: 0.42857142857142766 estimated weighted Jaccard: 0.4443359375 true weighted Jaccard: 0.3793103448275924 estimated weighted Jaccard: 0.36767578125 true weighted Jaccard: 0.33333333333333287 estimated weighted Jaccard: 0.32275390625 true weighted Jaccard: 0.24999999999999994 estimated weighted Jaccard: 0.25390625 true weighted Jaccard: 0.17647058823529382 estimated weighted Jaccard: 0.1767578125 true weighted Jaccard: 0.11111111111111145 estimated weighted Jaccard: 0.112060546875 true weighted Jaccard: 0.052631578947368515 estimated weighted Jaccard: 0.05126953125 true weighted Jaccard: 0.025641025641025373 estimated weighted Jaccard: 0.030029296875 true weighted Jaccard: 0.00502512562814075 estimated weighted Jaccard: 0.005859375

Usage

DartMinhash:

use dartminhash::dartminhash::DartMinHash; use dartminhash::rng_utils::mt_from_seed; use dartminhash::similarity::jaccard_estimate_from_minhashes; fn main() { let mut rng = mt_from_seed(42); let k = 128; let dartminhash = DartMinHash::new_mt(&mut rng, k); // Weighted inputs: overlap in IDs, but weights differ a bit let sample_a = vec![ (5, 1.2), (17, 0.9), (23, 1.1), (42, 0.95), (100, 1.0), ]; let sample_b = vec![ (5, 1.0), (17, 1.0), (44, 1.1), (100, 1.05), ]; let sketch_a = dartminhash.sketch(&sample_a); let sketch_b = dartminhash.sketch(&sample_b); let est_jaccard = jaccard_estimate_from_minhashes(&sketch_a, &sketch_b); println!("Estimated weighted Jaccard similarity: {:.4}", est_jaccard); }

Efficient Rejection Sampling:

use dartminhash::{ErsWmh}; use dartminhash::rng_utils::mt_from_seed; // Tight, real-valued caps: m_i = max_s w_i(s) (no ceil, no max(1)) fn caps_from_sets(d: usize, sets: &[&[(u64, f64)]]) -> Vec<f64> { let mut m = vec![0.0f64; d]; for s in sets { for &(id, w) in *s { if w > 0.0 { let idx = id as usize; if w > m[idx] { m[idx] = w; } } } } m } fn main() { let mut rng = mt_from_seed(1337); let d: usize = 200_000; let k: u64 = 1024; let L: u64 = 512; // with tight caps you can usually keep this modest // Two weighted vectors let a = vec![(5, 1.2), (17, 0.9), (23, 1.1), (42, 0.95), (100, 1.0)]; let b = vec![(5, 1.0), (17, 1.0), (44, 1.1), (100, 1.05)]; // Caps must dominate both vectors; length d should cover the largest id+1 let m_per_dim: Vec<f64> = caps_from_sets(d, &[&a, &b]); // New constructor takes &[f64] caps let ers = ErsWmh::new_mt(&mut rng, &m_per_dim, k); // ERS returns k (id, rank) pairs; collisions on id estimate Jaccard let sk_a = ers.sketch(&a, Some(L)); let sk_b = ers.sketch(&b, Some(L)); let hits = sk_a.iter().zip(&sk_b).filter(|(x, y)| x.0 == y.0).count(); let j_est = hits as f64 / k as f64; println!("ERS (L={}) estimated weighted Jaccard: {:.4}", L, j_est); }

Choosing L for Efficent Rejection Sampling (ERS)

The best L for achiving a given accuracy is related to the sparsity of the data (see ERS paper here). The author recommended an equation for L: $l=\frac{\alpha}{s}$, where s is the sparsity of the data (d/D, see above) while $\alpha$ is a constant, normally 0.5 to 5. In real-world datasets, $\alpha$ = 5 is better.

References

1.Christiani, T., 2020. Dartminhash: Fast sketching for weighted sets. arXiv preprint arXiv:2005.11547.

2.Shrivastava, A., 2016. Simple and efficient weighted minwise hashing. Advances in Neural Information Processing Systems, 29.

3.Li, X. and Li, P., 2021, May. Rejection sampling for weighted jaccard similarity revisited. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 5, pp. 4197-4205).

4.Pǎtraşcu, M. and Thorup, M., 2012. The power of simple tabulation hashing. Journal of the ACM (JACM), 59(3), pp.1-50.

5.Ertl, O. (2025) “TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation”. Zenodo. doi: 10.5281/zenodo.16730965.

6.Ertl, O., 2018, July. Bagminhash-minwise hashing algorithm for weighted sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1368-1377).

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DartMinHash_logo.png		DartMinHash_logo.png
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DartMinHash & Efficient Rejection Sampling: Fast Sketching for Weighted Sets

Install & test

Usage

Choosing L for Efficent Rejection Sampling (ERS)

References

About

Uh oh!

Releases 1

Packages

Languages

License

jianshu93/dartminhash-rs

Folders and files

Latest commit

History

Repository files navigation

DartMinHash & Efficient Rejection Sampling: Fast Sketching for Weighted Sets

Install & test

Usage

Choosing L for Efficent Rejection Sampling (ERS)

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages