Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models
A novel framework for synthesizing high-quality code instruction data through tree-structured evolution
Data synthesis has become a crucial research area in large language models (LLMs), especially for generating high-quality instruction fine-tuning data to enhance downstream performance. In code generation, a key application of LLMs, manual annotation of code instruction data is costly.
Recent methods, such as Code Evol-Instruct and OSS-Instruct, leverage LLMs to synthesize large-scale code instruction data, significantly improving LLM coding capabilities. However, these approaches face limitations due to unidirectional synthesis and randomness-driven generation, which restrict data quality and diversity.
To overcome these challenges, we introduce Tree-of-Evolution (ToE), a novel framework that:
- π³ Models code instruction synthesis with tree structures - exploring multiple evolutionary paths
- π― Uses optimization-driven evolution - refining each generation based on previous iteration quality
- π Achieves superior performance - base models fine-tuned on just 75k synthesized samples match state-of-the-art performance
Results: Our method achieves comparable or superior performance to Qwen2.5-Coder-Instruct (trained on millions of samples) across five coding benchmarks: HumanEval, MBPP, EvalPlus, LiveCodeBench, and BigCodeBench.
We provide the following resources for the community:
| Resource | Description | Link |
|---|---|---|
| Dataset | Tree-of-Evol-75k: High-quality synthesized code instruction data | π€ HuggingFace |
| Model (1.5B) | Qwen2.5-Coder-1.5B-Base fine-tuned on Tree-of-Evol-75k | π€ HuggingFace |
| Model (7B) | Qwen2.5-Coder-7B-Base fine-tuned on Tree-of-Evol-75k | π€ HuggingFace |
| Model (14B) | Qwen2.5-Coder-14B-Base fine-tuned on Tree-of-Evol-75k | π€ HuggingFace |
git clone https://github.com/CodeLLM-Research/Tree-of-Evolution.git cd Tree-of-Evolution pip install -r requirements.txtThis will clone the repository and install all necessary dependencies.
Important: Before running the framework, you need to set up your API keys:
-
Copy the sample environment file:
cp .env.sample .env
-
Edit the
.envfile and add your API keys:# OpenAI API Key (required for instruction synthesis and complexity scoring) OPENAI_API_KEY=your_openai_api_key_here # Add other API keys as needed
-
Make sure to keep your
.envfile secure and never commit it to version control.
Note: The framework requires valid API keys to function properly. Without proper configuration, the synthesis and scoring modules will fail.
The Tree-of-Evolution framework generates high-quality code instruction data through tree-structured evolution. Here's how to use the three main components:
This is the core module that generates evolved instructions using the tree-structured approach with complexity and diversity guidance.
PYTHONPATH=. python src/instruction_synthesis.py \ --input_path data/seed.function.5k.json \ --output_dir data/round1_synthesis \ --model_name gpt-4o \ --num_threads 10 \ --temperature 1.0 \ --max_tokens 2048 \ --opt_evo # Use optimization-driven evolution. For the first round, we do not have previously generated samples, so we should not use this flag.Parameters:
--input_path: Path to input JSON file (with complexity and diversity scores)--output_dir: Directory to store synthesis results--model_name: LLM model for instruction synthesis (default: gpt-4o)--num_threads: Number of parallel threads (default: 4)--temperature: Temperature for creative synthesis (default: 0.7)--max_tokens: Maximum tokens in response (default: 4096)--opt_evo: Use optimization-driven evolution. For the first round, we do not have previously generated samples, so we should not use this flag.
Input Format:
[ { "id": "1", "content": "Write a Python function to calculate factorial...", "self complexity score": 7.5, "self diversity score": 0.8 } ]How it works:
- For root samples (ID like "1"): Uses only the content to generate evolved instructions
- For child samples (ID like "1_2"): Uses content + complexity/diversity scores to guide evolution
- Creates tree-structured evolution paths with multiple generations
This supporting module evaluates the complexity of programming questions using LLM-based judgment. It's used to prepare data for instruction synthesis.
Basic Usage:
PYTHONPATH=. python src/complexity_scoring.py \ --input_path data/round1_synthesis/all_synthesized_instructions.json \ --output_dir data/round1_complexity \ --model_name gpt-4o \ --num_threads 10 \ --temperature 0.0 \ --max_tokens 2048Parameters:
--input_path: Path to input JSON file containing programming questions--output_dir: Directory to store complexity scoring results--model_name: LLM model to use for complexity evaluation (default: gpt-4o)--num_threads: Number of parallel threads for processing (default: 4)--temperature: Temperature for LLM response (default: 0.0)--max_tokens: Maximum tokens in response (default: 2048)
Input Format:
[ { "id": "1", "content": "Write a Python function to calculate factorial..." } ]Output: Individual .jsonl files for each question and a summary file with complexity scores (1-10 scale).
This supporting module calculates diversity scores by measuring semantic similarity between samples. It complements complexity scoring to guide instruction synthesis.
Basic Usage:
PYTHONPATH=. python src/diversity_scoring.py \ --input_path data/round1_complexity/all_questions_w_complexity_scores.json \ --output_path data/round1_diversity/questions_w_diversity_scores.json \ --model_name "Alibaba-NLP/gte-large-en-v1.5" \ --batch_size 10 \ --device "auto"Parameters:
--input_path: Path to complexity scoring results JSON file--output_path: Path to save diversity scoring results--model_name: Sentence transformer model for embeddings (default: Alibaba-NLP/gte-large-en-v1.5)--batch_size: Batch size for embedding computation (default: 32)--device: Device for computation (auto/cuda/mps/cpu, default: auto)
Note: For Apple Silicon Macs, use --device cpu if you encounter segmentation faults with MPS.
Here's how to run the complete Tree-of-Evolution pipeline for generating high-quality code instruction data:
Round 1: Initial Evolution (from seed data)
# Step 1: Synthesize evolved instructions from seed data (without --opt_evo for first round) PYTHONPATH=. python src/instruction_synthesis.py \ --input_path data/seed.function.5k.json \ --output_dir data/round1_synthesis \ --model_name gpt-4o \ --num_threads 10 \ --temperature 1.0 \ --max_tokens 2048 # Step 2: Score complexity of synthesized instructions PYTHONPATH=. python src/complexity_scoring.py \ --input_path data/round1_synthesis/all_synthesized_instructions.json \ --output_dir data/round1_complexity \ --model_name gpt-4o \ --num_threads 10 \ --temperature 0.0 # Step 3: Calculate diversity scores PYTHONPATH=. python src/diversity_scoring.py \ --input_path data/round1_complexity/all_questions_w_complexity_scores.json \ --output_path data/round1_diversity/questions_w_diversity_scores.json \ --model_name "Alibaba-NLP/gte-large-en-v1.5" \ --batch_size 10 \ --device autoRound 2: Optimization-Driven Evolution (from Round 1 results)
# Step 4: Synthesize with optimization-driven evolution (now use --opt_evo) PYTHONPATH=. python src/instruction_synthesis.py \ --input_path data/round1_diversity/questions_w_diversity_scores.json \ --output_dir data/round2_synthesis \ --model_name gpt-4o \ --num_threads 10 \ --temperature 1.0 \ --max_tokens 2048 \ --opt_evo # Step 5: Score complexity of Round 2 synthesized instructions PYTHONPATH=. python src/complexity_scoring.py \ --input_path data/round2_synthesis/all_synthesized_instructions.json \ --output_dir data/round2_complexity \ --model_name gpt-4o \ --num_threads 10 \ --temperature 0.0 # Step 6: Calculate diversity scores for Round 2 PYTHONPATH=. python src/diversity_scoring.py \ --input_path data/round2_complexity/all_questions_w_complexity_scores.json \ --output_path data/round2_diversity/questions_w_diversity_scores.json \ --model_name "Alibaba-NLP/gte-large-en-v1.5" \ --batch_size 10 \ --device autoContinue for additional rounds as needed...
Seed Data (5k samples) β [Instruction Synthesis - Round 1] Evolved Instructions (~15k samples) β [Complexity Scoring] Complexity Scores (1-10 scale) β [Diversity Scoring] Diversity Scores (0-1 scale) β [Instruction Synthesis - Round 2 with --opt_evo] Further Evolved Instructions (~45k samples) β [Complexity + Diversity Scoring] Scored Instructions for next round β [Continue rounds...] Final Dataset (75k+ high-quality samples) For each round, we apply quality thresholds based on complexity and diversity scores to filter out low-quality samples, ensuring only the most challenging and diverse instructions proceed to the next evolution cycle.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you find this work useful, please cite our paper:
@inproceedings{luo-etal-2025-tree, title = "Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models", author = "Luo, Ziyang and Li, Kaixin and Lin, Hongzhan and Tian, Yuchen and Kankanhalli, Mohan and Ma, Jing", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.14/", pages = "297--316", ISBN = "979-8-89176-251-0", abstract = "Data synthesis has become a crucial research area in large language models (LLMs), especially for generating high-quality instruction fine-tuning data to enhance downstream performance. In code generation, a key application of LLMs, manual annotation of code instruction data is costly. Recent methods, such as Code Evol-Instruct and OSS-Instruct, leverage LLMs to synthesize large-scale code instruction data, significantly improving LLM coding capabilities. However, these approaches face limitations due to unidirectional synthesis and randomness-driven generation, which restrict data quality and diversity. To overcome these challenges, we introduce Tree-of-Evolution (ToE), a novel framework that models code instruction synthesis process with a tree structure, exploring multiple evolutionary paths to alleviate the constraints of unidirectional generation. Additionally, we propose optimization-driven evolution, which refines each generation step based on the quality of the previous iteration. Experimental results across five widely-used coding benchmarks{---}HumanEval, MBPP, EvalPlus, LiveCodeBench, and BigCodeBench{---}demonstrate that base models fine-tuned on just 75k data synthesized by our method achieve comparable or superior performance to the state-of-the-art open-weight Code LLM, Qwen2.5-Coder-Instruct, which was fine-tuned on millions of samples." }

