Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models

A novel framework for synthesizing high-quality code instruction data through tree-structured evolution

📋 Abstract

Data synthesis has become a crucial research area in large language models (LLMs), especially for generating high-quality instruction fine-tuning data to enhance downstream performance. In code generation, a key application of LLMs, manual annotation of code instruction data is costly.

Recent methods, such as Code Evol-Instruct and OSS-Instruct, leverage LLMs to synthesize large-scale code instruction data, significantly improving LLM coding capabilities. However, these approaches face limitations due to unidirectional synthesis and randomness-driven generation, which restrict data quality and diversity.

To overcome these challenges, we introduce Tree-of-Evolution (ToE), a novel framework that:

🌳 Models code instruction synthesis with tree structures - exploring multiple evolutionary paths
🎯 Uses optimization-driven evolution - refining each generation based on previous iteration quality
📊 Achieves superior performance - base models fine-tuned on just 75k synthesized samples match state-of-the-art performance

Results: Our method achieves comparable or superior performance to Qwen2.5-Coder-Instruct (trained on millions of samples) across five coding benchmarks: HumanEval, MBPP, EvalPlus, LiveCodeBench, and BigCodeBench.

Figure 1: Tree-of-Evolution Framework Overview

🚀 Quick Start

📦 Released Resources

We provide the following resources for the community:

Resource	Description	Link
Dataset	Tree-of-Evol-75k: High-quality synthesized code instruction data	🤗 HuggingFace
Model (1.5B)	Qwen2.5-Coder-1.5B-Base fine-tuned on Tree-of-Evol-75k	🤗 HuggingFace
Model (7B)	Qwen2.5-Coder-7B-Base fine-tuned on Tree-of-Evol-75k	🤗 HuggingFace
Model (14B)	Qwen2.5-Coder-14B-Base fine-tuned on Tree-of-Evol-75k	🤗 HuggingFace

🔧 Installation

git clone https://github.com/CodeLLM-Research/Tree-of-Evolution.git cd Tree-of-Evolution pip install -r requirements.txt

This will clone the repository and install all necessary dependencies.

⚙️ Configuration

Important: Before running the framework, you need to set up your API keys:

Copy the sample environment file:
```
cp .env.sample .env
```

Edit the .env file and add your API keys:

# OpenAI API Key (required for instruction synthesis and complexity scoring) OPENAI_API_KEY=your_openai_api_key_here # Add other API keys as needed

Make sure to keep your .env file secure and never commit it to version control.

Note: The framework requires valid API keys to function properly. Without proper configuration, the synthesis and scoring modules will fail.

📝 Usage

The Tree-of-Evolution framework generates high-quality code instruction data through tree-structured evolution. Here's how to use the three main components:

1. Instruction Synthesis (`instruction_synthesis.py`)

This is the core module that generates evolved instructions using the tree-structured approach with complexity and diversity guidance.

PYTHONPATH=. python src/instruction_synthesis.py \ --input_path data/seed.function.5k.json \ --output_dir data/round1_synthesis \ --model_name gpt-4o \ --num_threads 10 \ --temperature 1.0 \ --max_tokens 2048 \ --opt_evo # Use optimization-driven evolution. For the first round, we do not have previously generated samples, so we should not use this flag.

Parameters:

--input_path: Path to input JSON file (with complexity and diversity scores)
--output_dir: Directory to store synthesis results
--model_name: LLM model for instruction synthesis (default: gpt-4o)
--num_threads: Number of parallel threads (default: 4)
--temperature: Temperature for creative synthesis (default: 0.7)
--max_tokens: Maximum tokens in response (default: 4096)
--opt_evo: Use optimization-driven evolution. For the first round, we do not have previously generated samples, so we should not use this flag.

Input Format:

[ { "id": "1", "content": "Write a Python function to calculate factorial...", "self complexity score": 7.5, "self diversity score": 0.8 } ]

How it works:

For root samples (ID like "1"): Uses only the content to generate evolved instructions
For child samples (ID like "1_2"): Uses content + complexity/diversity scores to guide evolution
Creates tree-structured evolution paths with multiple generations

2. Complexity Scoring (`complexity_scoring.py`)

This supporting module evaluates the complexity of programming questions using LLM-based judgment. It's used to prepare data for instruction synthesis.

Basic Usage:

PYTHONPATH=. python src/complexity_scoring.py \ --input_path data/round1_synthesis/all_synthesized_instructions.json \ --output_dir data/round1_complexity \ --model_name gpt-4o \ --num_threads 10 \ --temperature 0.0 \ --max_tokens 2048

Parameters:

--input_path: Path to input JSON file containing programming questions
--output_dir: Directory to store complexity scoring results
--model_name: LLM model to use for complexity evaluation (default: gpt-4o)
--num_threads: Number of parallel threads for processing (default: 4)
--temperature: Temperature for LLM response (default: 0.0)
--max_tokens: Maximum tokens in response (default: 2048)

Input Format:

[ { "id": "1", "content": "Write a Python function to calculate factorial..." } ]

Output: Individual .jsonl files for each question and a summary file with complexity scores (1-10 scale).

3. Diversity Scoring (`diversity_scoring.py`)

This supporting module calculates diversity scores by measuring semantic similarity between samples. It complements complexity scoring to guide instruction synthesis.

Basic Usage:

PYTHONPATH=. python src/diversity_scoring.py \ --input_path data/round1_complexity/all_questions_w_complexity_scores.json \ --output_path data/round1_diversity/questions_w_diversity_scores.json \ --model_name "Alibaba-NLP/gte-large-en-v1.5" \ --batch_size 10 \ --device "auto"

Parameters:

--input_path: Path to complexity scoring results JSON file
--output_path: Path to save diversity scoring results
--model_name: Sentence transformer model for embeddings (default: Alibaba-NLP/gte-large-en-v1.5)
--batch_size: Batch size for embedding computation (default: 32)
--device: Device for computation (auto/cuda/mps/cpu, default: auto)

Note: For Apple Silicon Macs, use --device cpu if you encounter segmentation faults with MPS.

Complete Pipeline Example

Here's how to run the complete Tree-of-Evolution pipeline for generating high-quality code instruction data:

🔄 Multi-Round Evolution Process

Round 1: Initial Evolution (from seed data)

# Step 1: Synthesize evolved instructions from seed data (without --opt_evo for first round) PYTHONPATH=. python src/instruction_synthesis.py \ --input_path data/seed.function.5k.json \ --output_dir data/round1_synthesis \ --model_name gpt-4o \ --num_threads 10 \ --temperature 1.0 \ --max_tokens 2048 # Step 2: Score complexity of synthesized instructions PYTHONPATH=. python src/complexity_scoring.py \ --input_path data/round1_synthesis/all_synthesized_instructions.json \ --output_dir data/round1_complexity \ --model_name gpt-4o \ --num_threads 10 \ --temperature 0.0 # Step 3: Calculate diversity scores PYTHONPATH=. python src/diversity_scoring.py \ --input_path data/round1_complexity/all_questions_w_complexity_scores.json \ --output_path data/round1_diversity/questions_w_diversity_scores.json \ --model_name "Alibaba-NLP/gte-large-en-v1.5" \ --batch_size 10 \ --device auto

Round 2: Optimization-Driven Evolution (from Round 1 results)

# Step 4: Synthesize with optimization-driven evolution (now use --opt_evo) PYTHONPATH=. python src/instruction_synthesis.py \ --input_path data/round1_diversity/questions_w_diversity_scores.json \ --output_dir data/round2_synthesis \ --model_name gpt-4o \ --num_threads 10 \ --temperature 1.0 \ --max_tokens 2048 \ --opt_evo # Step 5: Score complexity of Round 2 synthesized instructions PYTHONPATH=. python src/complexity_scoring.py \ --input_path data/round2_synthesis/all_synthesized_instructions.json \ --output_dir data/round2_complexity \ --model_name gpt-4o \ --num_threads 10 \ --temperature 0.0 # Step 6: Calculate diversity scores for Round 2 PYTHONPATH=. python src/diversity_scoring.py \ --input_path data/round2_complexity/all_questions_w_complexity_scores.json \ --output_path data/round2_diversity/questions_w_diversity_scores.json \ --model_name "Alibaba-NLP/gte-large-en-v1.5" \ --batch_size 10 \ --device auto

Continue for additional rounds as needed...

📊 Expected Data Flow

Seed Data (5k samples) ↓ [Instruction Synthesis - Round 1] Evolved Instructions (~15k samples) ↓ [Complexity Scoring] Complexity Scores (1-10 scale) ↓ [Diversity Scoring] Diversity Scores (0-1 scale) ↓ [Instruction Synthesis - Round 2 with --opt_evo] Further Evolved Instructions (~45k samples) ↓ [Complexity + Diversity Scoring] Scored Instructions for next round ↓ [Continue rounds...] Final Dataset (75k+ high-quality samples)

For each round, we apply quality thresholds based on complexity and diversity scores to filter out low-quality samples, ensuring only the most challenging and diverse instructions proceed to the next evolution cycle.

📈 Performance

HumanEval and MBPP Performance Comparison

LiveCodeBench and BigCodeBench Performance Comparison

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📖 Citation

If you find this work useful, please cite our paper:

@inproceedings{luo-etal-2025-tree, title = "Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models", author = "Luo, Ziyang and  Li, Kaixin and  Lin, Hongzhan and  Tian, Yuchen and  Kankanhalli, Mohan and  Ma, Jing", editor = "Che, Wanxiang and  Nabende, Joyce and  Shutova, Ekaterina and  Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.14/", pages = "297--316", ISBN = "979-8-89176-251-0", abstract = "Data synthesis has become a crucial research area in large language models (LLMs), especially for generating high-quality instruction fine-tuning data to enhance downstream performance. In code generation, a key application of LLMs, manual annotation of code instruction data is costly. Recent methods, such as Code Evol-Instruct and OSS-Instruct, leverage LLMs to synthesize large-scale code instruction data, significantly improving LLM coding capabilities. However, these approaches face limitations due to unidirectional synthesis and randomness-driven generation, which restrict data quality and diversity. To overcome these challenges, we introduce Tree-of-Evolution (ToE), a novel framework that models code instruction synthesis process with a tree structure, exploring multiple evolutionary paths to alleviate the constraints of unidirectional generation. Additionally, we propose optimization-driven evolution, which refines each generation step based on the quality of the previous iteration. Experimental results across five widely-used coding benchmarks{---}HumanEval, MBPP, EvalPlus, LiveCodeBench, and BigCodeBench{---}demonstrate that base models fine-tuned on just 75k data synthesized by our method achieve comparable or superior performance to the state-of-the-art open-weight Code LLM, Qwen2.5-Coder-Instruct, which was fine-tuned on millions of samples." }

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asset		asset
data		data
src		src
LICENSE		LICENSE
README.md		README.md
env.example		env.example
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models

📋 Abstract

🚀 Quick Start

📦 Released Resources

🔧 Installation

⚙️ Configuration

📝 Usage

1. Instruction Synthesis (`instruction_synthesis.py`)

2. Complexity Scoring (`complexity_scoring.py`)

3. Diversity Scoring (`diversity_scoring.py`)

Complete Pipeline Example

🔄 Multi-Round Evolution Process

📊 Expected Data Flow

📈 Performance

📄 License

📖 Citation

About

Uh oh!

Releases

Packages

Languages

License

CodeLLM-Research/Tree-of-Evolution

Folders and files

Latest commit

History

Repository files navigation

Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models

📋 Abstract

🚀 Quick Start

📦 Released Resources

🔧 Installation

⚙️ Configuration

📝 Usage

1. Instruction Synthesis (instruction_synthesis.py)

2. Complexity Scoring (complexity_scoring.py)

3. Diversity Scoring (diversity_scoring.py)

Complete Pipeline Example

🔄 Multi-Round Evolution Process

📊 Expected Data Flow

📈 Performance

📄 License

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Instruction Synthesis (`instruction_synthesis.py`)

2. Complexity Scoring (`complexity_scoring.py`)

3. Diversity Scoring (`diversity_scoring.py`)

Packages