An approach to detect semantically similar python repositories using pre-trained language models.
This repository contains the notebooks and scripts conducted for our approach to detect semantically similar python repositories using pre-trained language models.
Currently our best performing model is UniXCoder fine-tuned on code search task with AdvTest dataset. For evaluations of different language models on repository similarity comparison, please refer to this Jupyter notebook: notebooks/BiEncoder/Embeddings_evaluation.ipynb
More details on our approach's implementations and applications can be found under the scripts folder.
RepoSnipy is a neural search engine for discoving similar Python repositories on GitHub, powered by RepoSim. Please feel free to give it a try!
RepoSim ├── LICENSE ├── README.md ├── data │ ├── df2txt.py # Convert PoolC dataset for clone detection fine-tuning script │ ├── repo_topic.json # Topic-Repos mapping │ └── repo_topic.py # Script to select repos from topics ├── notebooks │ ├── BiEncoder │ │ ├── Embeddings_evaluation.ipynb # Evaluations for comparing different language models │ │ ├── RepoSim.ipynb # Our approach's implementation │ │ └── UnixCoder_C4_Evaluation.ipynb │ └── CrossEncoder │ ├── Clone_Detection_C4_Evaluation.ipynb │ ├── HungarianAlgorithm.ipynb # Cross-encoder approaches for repo similarity comparison │ └── keonalgorithms-TheAlgorithmsPython.csv # Evaluation results by ungarianAlgorithm.ipynb └── scripts ├── LICENSE ├── PlayGround.ipynb # For experimenting with repo embeddings ├── README.md ├── pipeline.py # Our approach's implementation as a HuggingFace pipeline ├── repo_sim.py └── requirements.txtDistributed under the MIT License. See LICENSE for more information.