GitHub - ChenghaoMou/text-dedup: All-in-one text de-duplication

Installation

Only tested with Python 3.10 so far.

pip install text-dedup

or

pip install git+https://github.com/ChenghaoMou/text-dedup

Documentation

Github Pages

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

RETSim/UniSim, an embedding-based near deduplication (WIP)
MinHash + MinHashLSH, including a spark implementation suitable for large (TB) datasets
64 or 128 bit SimHash
SuffixArray Substring
Bloom Filter
Exact Hash (document-level, line-level/ccnet)

I also have big plans for the future:

Memory benchmark for streaming processing
Inter-dataset deduplication
Rewrite suffix array in Python
A collections of other deduplication methods: SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Optimal Densification for Fast and Accurate Minwise Hashing, Fast Similarity Sketching

However, I do not intent to build a general purpose deduplication library, which was the goal of this repo early on. I will gradually retire the pypi package as well. The reason behind it is that each use-case can be wildly different and requires careful design and consideration. I sincerely encourage you to read the script first (they are relatively short) so you can understand what are at stake here when using it. You can use it to bootstrap your own script, or just use it as a reference.

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Datasketch (MIT)
simhash-py and simhash-cpp (MIT)
Deduplicating Training Data Makes Language Models Better (Apache 2.0)
Gaoya (MIT)

Quick Examples

Native PySpark

MODIFY text_dedup/minhash_spark.py FOR YOUR OWN PROJECT AND DATASET FIRST!

Assuming you have a downloaded dataset (in parquet files) under "./temp-data", you can process with file with your local compute by:

export PYSPARK_PYTHON="path to your python with scipy, xxhash, and numpy installed" spark-submit --executor-memory 16g \ --driver-memory 20g \ --executor-cores 3 \ --num-executors 2 \ --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 \ --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \ --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \ text_dedup/minhash_spark.py\ --input "./temp-data" \ --output "./temp-output" \ --column "text" \ --threshold 0.7

DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------ DEBUG __main__ - Using B=25, R=10 DEBUG __main__ - Loaded documents: 88803 DEBUG __main__ - args.input='./temp-data' DEBUG __main__ - args.output='./temp-output' DEBUG __main__ - args.threshold=0.7 DEBUG __main__ - args.ngram_size=5 DEBUG __main__ - args.min_length=5 DEBUG __main__ - args.num_perm=250 DEBUG __main__ - args.column='text' DEBUG __main__ - id : bigint DEBUG __main__ - text : string DEBUG __main__ - meta : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>> DEBUG __main__ - __id__ : bigint DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------ DEBUG __main__ - Initial edges: 52102 DEBUG __main__ - Edges DataFrame: 52102 DEBUG __main__ - Vertices DataFrame: 50206 DEBUG __main__ - Assignment DataFrame: 50206 DEBUG __main__ - Merging records: 88803 INFO __main__ - Saving with 1 partitions and 44092 rows each DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------ DEBUG __main__ - Number of rows before: 88803 DEBUG __main__ - Number of rows after: 44092 DEBUG __main__ - Percentage of rows kept: 49.65% DEBUG __main__ - Output: ./temp-output DEBUG __main__ - Time: 68.80s DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

Or take a look at bigcode-v2/run.sh on how to run the job with GCP DataProc.

UniSim (WIP)

Based on Google's RETSim model(Github, Arxiv), it is an embedding based on near-deduplication method.

For a large dataset, it would require GPU(s) for fast inference.

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

Output:

INFO Load Dataset : 5.56s INFO Index Dataset : 8.13s INFO Clustering : 8.72s INFO Filtering : 0.35s INFO Saving : 0.01s INFO Cleaning : 0.00s INFO Total : 22.77s INFO Before : 817 INFO After : 788

Suffix Array Substring Exact Deduplication

# input python -m text_dedup.suffix_array \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/suffix_array/oscar_gl_dedup" \ --column "text" \ --google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets" \ --use_auth_token true # output INFO Loading : 2.75 seconds INFO Preprocessing : 4.78 seconds INFO SuffixArray : 98.29 seconds INFO SelfSimilar : 4.24 seconds INFO Restore : 0.25 seconds INFO Deduplicate : 6.23 seconds INFO Saving : 8.91 seconds INFO Total : 125.45 seconds INFO Before : 180332342 bytes (88803) INFO After : 97646271 bytes (40404)

MinHash Near Deduplication

# input python -m text_dedup.minhash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/minhash/oscar_gl_dedup" \ --column "text" \ --batch_size 10000 \ --use_auth_token true # output INFO Loading : 2.62 seconds INFO MinHashing : 0.08 seconds INFO Clustering : 2.20 seconds INFO Filtering : 0.53 seconds INFO Saving : 9.86 seconds INFO Total : 15.29 seconds INFO Data Number (before) : 88803 INFO Data Number (after) : 44124 (49.69%) INFO Duplicate Number : 44679 (50.31%) INFO 🤗 Happy Deduplicating 🤗

SimHash Near Deduplication

# input python -m text_dedup.simhash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/simhash/oscar_gl_dedup" \ --column "text" \ --batch_size 10000 \ --use_auth_token true # output INFO Loading : 2.60 seconds INFO SimHashing : 0.04 seconds INFO Indexing : 28.88 seconds INFO Filtering : 0.88 seconds INFO Saving : 10.41 seconds INFO Total : 42.80 seconds INFO Data Number (before) : 88803 INFO Data Number (after) : 46163 (51.98%) INFO Duplicate Number : 42640 (48.02%) INFO 🤗 Happy Deduplicating 🤗

Exact Hash Exact Deduplication

# input python -m text_dedup.exact_hash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/exact_hash/oscar_gl_dedup" \ --column "text" \ --batch_size 1000 \ --use_auth_token true # output INFO Loading : 2.95s INFO Processing : 3.79s INFO Filtering : 0.10s INFO Saving : 2.89s INFO Total : 9.72s INFO Before : 88803 INFO After : 47049

Bloom Filter Exact Deduplication

# input python -m text_dedup.bloom_filter \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/bloom_filter/oscar_gl_dedup" \ --error_rate 1e-5 \ --column "text" \ --use_auth_token true --batch_size 1000 # output INFO Loading : 2.72s INFO Processing : 4.84s INFO Filtering : 0.10s INFO Saving : 2.88s INFO Total : 10.54s INFO Before : 88803 INFO After : 47045

Benchmarks

Note

Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.

pinecone/core-2020-05-10-deduplication

See tests/benchmark_core.py for reproduction.

Algorithm	Precision (Duplicates)	Recall (Duplicates)	Precision (Non Duplicates)	Recall (Non Duplicates)	Macro F1 score	Accuracy	Time
UniSim	0.9307	0.8924	0.9055	0.9394	0.9181	0.9054	1305.79s
MinHash Spark	0.957	0.9445	0.9471	0.959	0.952	0.9202	691.77s
MinHash	0.9594	0.9445	0.9474	0.9616	0.9534	0.924	18.88s
SimHash	0.9042	0.721	0.792	0.9329	0.8481	0.8321	644.36s
Exact Title	0.8302	0.5521	0.7098	0.9065	0.77	0.7456	-
Exact Title Matching ¹	0.830	0.50	0.709	0.992	0.757	0.746	-
Simhash Matching ¹	0.697	0.247	0.598	0.985	0.631	0.616	-
Document Vector Similarity ¹	0.912	0.779	0.861	0.986	0.885	0.883	-
Hybrid Method ¹	0.908	0.828	0.899	0.979	0.904	0.903	-
LaBSE²	0.937	0.923	0.930	0.943	0.933	0.919	-
Multilingual USE²	0.917	0.907	0.918	0.927	0.917	0.909	-
Multilingual E5-Base²	0.931	0.908	0.919	0.939	0.924	0.920	-
MinHash + LSH²	0.929	0.902	0.915	0.938	0.921	0.918	-
RETSim Partial-Dup²	0.945	0.941	0.945	0.949	0.945	0.928	-
RETSim Near-Dup²	0.928	0.937	0.942	0.934	0.935	0.926	-

NEWS-COPY

See tests/benchmark_news.py for reproduction.

Adjusted Rand Index (ARI) on NEWS-COPY dataset:

Model/Algorithm	ARI
SimHash	0.612
MinHash (Spark)	0.740
MinHash	0.742
RETSim Near-Dup + ANN*	0.051
n-gram ³	0.440
SimHash²	0.695
MinHash³	0.737
MinHash²	0.783
Multilingual USE²	0.730
Multilingual E5-Base²	0.742
S-BERT³	0.700
RETSim Partial-Dup²	0.831
RETSim Near-Dup²	0.704
Re-ranking ³	0.937
Bi-encoder ³	0.915

*: I can't seem to reproduce the results from the paper.

License

Apache 2.0

Citations

Generally, you can cite this repository as:

@software{chenghao_mou_2023_8364980, author = {Chenghao Mou and  Chris Ha and  Kenneth Enevoldsen and  Peiyuan Liu}, title = {ChenghaoMou/text-dedup: Reference Snapshot}, month = sep, year = 2023, publisher = {Zenodo}, version = {2023.09.20}, doi = {10.5281/zenodo.8364980}, url = {https://doi.org/10.5281/zenodo.8364980} }

The spark version was born from BigCode (Apache 2.0) and BigScience (Apache 2.0), and you can cite the original paper if you want:

@article{ kocetkov2023the, title={The Stack: 3 {TB} of permissively licensed source code}, author={Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{\~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2023}, url={https://openreview.net/forum?id=pxpbTdUEpD}, note={} }

Name		Name	Last commit message	Last commit date
Latest commit History 365 Commits
.github/workflows		.github/workflows
deduplicate-text-datasets @ fcf7432		deduplicate-text-datasets @ fcf7432
docs/source		docs/source
notebooks		notebooks
tests		tests
text_dedup		text_dedup
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.bib		CITATION.bib
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
banner.png		banner.png
cobertura.xml		cobertura.xml
compose.yaml		compose.yaml
log4j.properties		log4j.properties
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Documentation

Features

Acknowledgements

Quick Examples

Benchmarks

License

Citations

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

ChenghaoMou/text-dedup

Folders and files

Latest commit

History

Repository files navigation

Installation

Documentation

Features

Acknowledgements

Quick Examples

Benchmarks

License

Citations

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages