Kernel Divergence Score

Official implementation of ICML 2025 paper, "How Contaminated Is Your Benchmark? Measuring Dataset Leakage in Large Language Models with Kernel Divergence" by Hyeong Kyu Choi*, Maxim Khanov*, Hongxin Wei, and Yixuan Li.

Setup Environment

git clone https://github.com/deeplearning-wisc/kernel-divergence-score.git cd kernel-divergence-score

conda env create -f environment.yml conda activate kds

Finally, create a "token" file right outside the src/ directory (note that there shouldn't be any extension in the file name), containing your huggingface credential token.

Experiments

Experiment commands are in scripts/. Each shell file computes the kernel divergence scores for contamination rate 0.0~1.0 on seed 0.

sh scripts/wikimia.sh

sh scripts/bookmia.sh

sh scripts/arxivtection.sh

sh scripts/pile.sh

Citation

@inproceedings{choi2024beyond, title={How Contaminated Is Your Benchmark? Measuring Dataset Leakage in Large Language Models with Kernel Divergence}, author={Hyeong Kyu Choi and Maxim Khanov and Hongxin Wei and Yixuan Li}, booktitle = {International Conference on Machine Learning}, year = {2025} }

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kernel Divergence Score

Setup Environment

Experiments

Citation

About

Uh oh!

Releases

Packages

Languages

deeplearning-wisc/kernel-divergence-score

Folders and files

Latest commit

History

Repository files navigation

Kernel Divergence Score

Setup Environment

Experiments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages