Cooperative Multi-LLM Reinforcement Learning (CoMLRL) is an open-source library for training multiple LLMs to collaborate using Multi-Agent Reinforcement Learning (MARL). It provides implementations of various MARL algorithms for LLM collaboration and support for different environments and benchmarks.
pip install comlrl # install PyTorch compatible with your deviceconda install -c conda-forge comlrl # install PyTorch compatible with your deviceTo access the latest features, you can install CoMLRL from source:
git clone https://github.com/OpenMLRL/CoMLRL.git cd CoMLRL pip install -e . # install PyTorch compatible with your device-
MARL trainers to optimize LLM collaboration:
- Multi-Agent REINFORCE: Critic-free policy gradient methods, including MAREINFROCE, MAGRPO, MARLOO, MAREMAX.
- Aligned individual response joint with
joint_mode='align'. - Memory-efficient cross joint with
joint_mode='cross'.
- Aligned individual response joint with
- Multi-Agent Actor-Critic: Critic-based policy gradient methods, including IAC and MAAC.
- Independent actor-critic (separate critic or value-head over LLM backbone).
- Centralized critic over joint prompts with separate actors.
- Multi-Agent REINFORCE: Critic-free policy gradient methods, including MAREINFROCE, MAGRPO, MARLOO, MAREMAX.
-
Environments that simulate real-world tasks for training and evaluating LLM collaboration:
- Writing Collaboration: Multiple LLM agents collaborate on processing articles.
- Code Generation: Generate code solutions for programming problems.
- MBPP - Mostly basic python problems.
- HumanEval - Handwritten evaluation problems
- CoopHumanEval - HumanEval with cooperative nature.
- Code Completion: Complete code snippets based on given contexts.
- ClassEval - Complete class-level code based on method stubs and docstrings.
Quick start by training 2 Qwen-2.5 agents to summarize Reddit posts with MAGRPO:
from datasets import load_dataset from transformers import AutoTokenizer from comlrl.trainers.magrpo import MAGRPOConfig, MAGRPOTrainer # Load dataset and tokenizer tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") dataset = load_dataset("trl-lib/tldr", split="train").select(range(128)) # Initialize trainer and start training trainer = MAGRPOTrainer( model="Qwen/Qwen2.5-0.5B", num_agents=2, tokenizer=tokenizer, train_dataset=dataset, reward_func=lambda a, b: [abs(max(len(b[0]), 1) / max(len(a[0]), 1) - 3.0)], formatters=[lambda example: example["prompt"]] * 2, args=MAGRPOConfig( per_device_train_batch_size=1, ), ) trainer.train()We welcome contributions from the community! Please see contributing guidelines on setting up a development environment and contribute.
Thanks to the gracious help of contributors:
Shuo Liu π€ π§ π» π | Tianle Chen π§ π» π | Ryan Amiri π§ π» π | Zeyu Liang π π |
Please cite our paper if you find this library useful in your research:
@misc{liu2025comlrl, title={LLM Collaboration With Multi-Agent Reinforcement Learning}, author={Shuo Liu and Tianle Chen and Zeyu Liang and Xueguang Lyu and Christopher Amato}, year={2025}, eprint={2508.04652}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2508.04652}, }
