Source code for our EMNLP 2025 paper: "CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion" [arXiv].
1. Install uv
uv syncsource .venv/bin/activateBefore running scripts, download benchmarks (recceval and cceval) and edit the configuration file:
config/config.tomlThen execute the Python scripts sequentially:
python scripts/build_query.py- Generates query strings from the benchmark dataset.
 
python scripts/retrieve.py- Retrieves top-k relevant code blocks using the configured retriever.
 
python scripts/rerank.py- Reranks retrieved code blocks based on their estimated importance.
 
python scripts/build_prompt.py- Constructs prompts from retrieved code blocks for the code completion generator.
 
python scripts/inference.py- Feeds prompts to the generator model.
 - You can replace this step with your own inference code.
Input: JSON file containing an array of strings
Output: JSON file containing an array of generated completions. 
python scripts/evaluation.py- Evaluates code completion performance using inference results.
 
If you find this work helpful, please consider citing our paper:
@inproceedings{coderag2025, title={CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion}, author={Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, Hui Li}, booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2025} }For questions, please open an issue or contact dingyf@stu.xmu.edu.cn.