Skip to content

FSoft-AI4Code/CodeMMLU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

CodeMMLU

πŸ“° News β€’ πŸš€ Quick Start β€’ πŸ“‹ Evaluation β€’ πŸ“Œ Citation

πŸ“Œ About

CodeMMLU

CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.

Why CodeMMLU?

  • CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.

  • Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.

πŸš€ Quick Start

Install CodeMMLU and setup dependencies via pip:

pip install codemmlu

Generate response for CodeMMLU MCQs benchmark:

codemmlu --model_name <your_model_name_or_path> \ --subset <subset> \ --backend <backend> \ --output_dir <your_output_dir>

πŸ“‹ Evaluation

Build codemmlu from source:

git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git cd CodeMMLU pip install -e .

Note

If you prefer vllm backend, we highly recommend you install vllm from official project before install codemmlu.

Generating with CodeMMLU questions:

codemmlu --model_name <your_model_name_or_path> \ --peft_model <your_peft_model_name_or_path> \ --subset all \ --batch_size 16 \ --backend [vllm|hf] \ --max_new_tokens 1024 \ --temperature 0.0 \ --output_dir <your_output_dir> \ --instruction_prefix <special_prefix> \ --assistant_prefix <special_prefix> \ --cache_dir <your_cache_dir>
⏬ API Usage :: click to expand ::
codemmlu [-h] [-V] [--subset SUBSET] [--batch_size BATCH_SIZE] [--instruction_prefix INSTRUCTION_PREFIX] [--assistant_prefix ASSISTANT_PREFIX] [--output_dir OUTPUT_DIR] [--model_name MODEL_NAME] [--peft_model PEFT_MODEL] [--backend BACKEND] [--max_new_tokens MAX_NEW_TOKENS] [--temperature TEMPERATURE] [--prompt_mode PROMPT_MODE] [--cache_dir CACHE_DIR] [--trust_remote_code] ==================== CodeMMLU ==================== optional arguments: -h, --help show this help message and exit -V, --version Get version --subset SUBSET Select evaluate subset --batch_size BATCH_SIZE --instruction_prefix INSTRUCTION_PREFIX --assistant_prefix ASSISTANT_PREFIX --output_dir OUTPUT_DIR Save generation and result path --model_name MODEL_NAME Local path or Huggingface Hub link to load model --peft_model PEFT_MODEL Lora config --backend BACKEND LLM generation backend (default: hf) --max_new_tokens MAX_NEW_TOKENS Number of max new tokens --temperature TEMPERATURE --prompt_mode PROMPT_MODE Prompt available: zeroshot, fewshot, cot_zs, cot_fs --cache_dir CACHE_DIR Cache for save model download checkpoint and dataset --trust_remote_code

List of supported backends:

Backend DecoderModel LoRA
Transformers (hf) βœ… βœ…
Vllm (vllm) βœ… βœ…

Leaderboard

To evaluate your model and submit your results to the leaderboard, please follow the instruction in data/README.md.

πŸ“Œ Citation

If you find this repository useful, please consider citing our paper:

@article{nguyen2024codemmlu, title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities}, author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.}, journal={arXiv preprint}, year={2024} } 

About

[ICLR 2025] πŸš€ CodeMMLU Evaluator: A framework for evaluating LM models on CodeMMLU MCQs benchmark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages