CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

📰 News • 🚀 Quick Start • 📋 Evaluation • 📌 Citation

📌 About

CodeMMLU

CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.

Why CodeMMLU?

CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.
Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.

🚀 Quick Start

Install CodeMMLU and setup dependencies via pip:

pip install codemmlu

Generate response for CodeMMLU MCQs benchmark:

codemmlu --model_name <your_model_name_or_path> \ --subset <subset> \ --backend <backend> \ --output_dir <your_output_dir>

📋 Evaluation

Build codemmlu from source:

git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git cd CodeMMLU pip install -e .

Note

If you prefer vllm backend, we highly recommend you install vllm from official project before install codemmlu.

Generating with CodeMMLU questions:

codemmlu --model_name <your_model_name_or_path> \ --peft_model <your_peft_model_name_or_path> \ --subset all \ --batch_size 16 \ --backend [vllm|hf] \ --max_new_tokens 1024 \ --temperature 0.0 \ --output_dir <your_output_dir> \ --instruction_prefix <special_prefix> \ --assistant_prefix <special_prefix> \ --cache_dir <your_cache_dir>

⏬ API Usage :: click to expand ::

codemmlu [-h] [-V] [--subset SUBSET] [--batch_size BATCH_SIZE] [--instruction_prefix INSTRUCTION_PREFIX] [--assistant_prefix ASSISTANT_PREFIX] [--output_dir OUTPUT_DIR] [--model_name MODEL_NAME] [--peft_model PEFT_MODEL] [--backend BACKEND] [--max_new_tokens MAX_NEW_TOKENS] [--temperature TEMPERATURE] [--prompt_mode PROMPT_MODE] [--cache_dir CACHE_DIR] [--trust_remote_code] ==================== CodeMMLU ==================== optional arguments: -h, --help show this help message and exit -V, --version Get version --subset SUBSET Select evaluate subset --batch_size BATCH_SIZE --instruction_prefix INSTRUCTION_PREFIX --assistant_prefix ASSISTANT_PREFIX --output_dir OUTPUT_DIR Save generation and result path --model_name MODEL_NAME Local path or Huggingface Hub link to load model --peft_model PEFT_MODEL Lora config --backend BACKEND LLM generation backend (default: hf) --max_new_tokens MAX_NEW_TOKENS Number of max new tokens --temperature TEMPERATURE --prompt_mode PROMPT_MODE Prompt available: zeroshot, fewshot, cot_zs, cot_fs --cache_dir CACHE_DIR Cache for save model download checkpoint and dataset --trust_remote_code

List of supported backends:

Backend	DecoderModel	LoRA
Transformers (hf)	✅	✅
Vllm (vllm)	✅	✅

Leaderboard

To evaluate your model and submit your results to the leaderboard, please follow the instruction in data/README.md.

📌 Citation

If you find this repository useful, please consider citing our paper:

@article{nguyen2024codemmlu, title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities}, author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.}, journal={arXiv preprint}, year={2024} }

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
asset		asset
data		data
paper		paper
src/codemmlu		src/codemmlu
.gitignore		.gitignore
HISTORY.md		HISTORY.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

📌 About

CodeMMLU

Why CodeMMLU?

🚀 Quick Start

📋 Evaluation

Leaderboard

📌 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

FSoft-AI4Code/CodeMMLU

Folders and files

Latest commit

History

Repository files navigation

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

📌 About

CodeMMLU

Why CodeMMLU?

🚀 Quick Start

📋 Evaluation

Leaderboard

📌 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages