CoMLRL

Welcome to CoMLRL's documentation 👋

Cooperative Multi-LLM Reinforcement Learning (CoMLRL) is an open-source library for training multiple LLMs to collaborate using Multi-Agent Reinforcement Learning (MARL). It provides implementations of various MARL algorithms for LLM collaboration and supports different environments and benchmarks.

About#

Q&A

“What are the differences between CoMLRL and other multi-LLM training frameworks?"

A lot of works use role-conditioned, parameter-sharing approaches built on single-agent RL training framework to implement multi-agent training. Using truly distinct agents provides a better modeling where highly heterogeneous LLMs possess fundamentally different capabilities and aligns better with the concept of study — “multi-agent”.

Compare with other multi-LLM training frameworks, agents can either be trained centralizedly or decentralizedly with CoMLRL, while their execution is always fully decentralized to ensure efficient inference. CoMLRL implements standard MARL algorithms from scratch to maximize flexibility and customizability while maintaining simplicity for usage.

“Does CoMLRL support single-agent fine-tuning?"

Yes! The simplest way is to set num_agents=1 in your trainer. But since we omit fancy optimizations for simplicity of multi-agent training, you may not find the single-agent trainers optimal. LLaMA-Factory, trl, OpenRLHF, and verl are good choices for single-agent fine-tuning.

“Does CoMLRL support self-play/self-improvement/self-evolving by MARL?"

Yes! Although we focus on LLM collaboration formalized as Dec-POMDP, users can still customize the interactions with environment to implement pipeline like self-play (Spiral) and self-improvement (MAFT). Users can refer to our multi-turn training for more details.

“Does CoMLRL support distributed training?"

Not yet. We are currently focusing on CTDE on light-weighted training small-scale LLMs with cooperative MARL for proof of the concept. Resource-consuming distributed training with slow and complex gradient accumulation is under development and will be open-sourced in the near future.

LLM Collaboration

“What is LLM collaboration?"

LLM collaboration refers to the problems where LLM agents cooperatively solve tasks in multi-agent systems. The tasks are specified in language and provided to each agent as a prompt, and the agent generates a response synchronously based on its instructions. The set of all agents’ responses jointly forms a solution. Users and systems may validate the solutions to provide additional requirements or suggestions for LLMs. These components form part of the environment for LLM collaboration, with states that may be updated based on the agents’ outputs. The updates are embedded into prompts for subsequent turns. This process iterates until the task is completed or a turn limit is reached.

MARL Fine-Tuning

“Why should we fine-tune multi-LLM systems with MARL?"

Many studies have explored LLM-based multi-agent systems for completing tasks with multiple interacting agents. However, most of these models are pretrained separately and are not specifically optimized for coordination, which would limit their performance. In addition, designing effective prompts remains difficult and unclear. Cooperative MARL methods have been extensively studied for years, which optimize a team of agents towards a shared objective. They naturally fit LLM collaboration and motivate us to bring advances from the well-established MARL community to LLM-based MAS.

Decentralization

“What are the benefits of decentralized reasoning?"

Cooperative MARL methods are grounded in the theory of Dec-POMDP. The agents execute in a decentralized manner, which has many advantages. Unlike knowledge distillation, pruning, or quantization, it accelerates LLM inference without incurring information loss. Moreover, decentralization reduces the computational and memory burden of maintaining long-context dependencies and conducting joint decision-making within a single model. By assigning specific subtasks to individual agents, the system achieves more modular, efficient, and lightweight reasoning. In addition, effective cooperation among small local language models can offer a safe and cost-efficient solution for offline and edge intelligence.

Features#

MARL trainers to optimize LLM collaboration:
- Multi-Agent REINFORCE: Critic-free policy gradient methods, including MAREINFORCE, MAGRPO, MARLOO, MAREMAX.
  - Aligned individual response joint with joint_mode='align'.
  - Memory-efficient cross joint with joint_mode='cross'.
- Multi-Agent Actor-Critic: Critic-based policy gradient methods, including IAC and MAAC.
  - Independent actor-critic (separate critic or value-head over LLM backbone).
  - Centralized critic over joint prompts with separate actors.
Environments that simulate real-world tasks for training and evaluating LLM collaboration:
- Writing Collaboration: Multiple LLM agents collaborate on processing articles.
  - TLDR - Summarizing Reddit posts.
  - ArXiv - Expanding abstracts into introductions.
- Code Generation: Generate code solutions for programming problems.
  - MBPP - Mostly basic python problems.
  - HumanEval - Handwritten evaluation problems
  - CoopHumanEval - HumanEval with cooperative nature.
- Code Completion: Complete code snippets based on given contexts.
  - ClassEval - Complete class-level code based on method stubs and docstrings.