Unity-Technologies · ervteng · Mar 12, 2021 · Dec 21, 2020 · Dec 23, 2020 · Jan 4, 2021
diff --git a/com.unity.ml-agents/CHANGELOG.md b/com.unity.ml-agents/CHANGELOG.md
@@ -10,7 +10,11 @@ and this project adheres to
 ### Major Changes
 #### com.unity.ml-agents (C#)
 - The `BufferSensor` and `BufferSensorComponent` have been added. They allow the Agent to observe variable number of entities. (#4909)
+- The `SimpleMultiAgentGroup` class and `IMultiAgentGroup` interface have been added. These allow Agents to be given rewards and
+ end episodes in groups. (#4923)
 #### ml-agents / ml-agents-envs / gym-unity (Python)
+- The MA-POCA trainer has been added. This is a new trainer that enables Agents to learn how to work together in groups. Configure
+ `poca` as the trainer in the configuration YAML after implementing a `MultiAgentGroup` to use this feature. (#5005)
 
 ### Minor Changes
 #### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)

diff --git a/config/ppo/PushBlockCollab.yaml b/config/ppo/PushBlockCollab.yaml
@@ -0,0 +1,26 @@
+behaviors:
+ PushBlock:
+ trainer_type: coma
+ hyperparameters:
+ batch_size: 1024
+ buffer_size: 10240
+ learning_rate: 0.0003
+ beta: 0.01
+ epsilon: 0.2
+ lambd: 0.95
+ num_epoch: 3
+ learning_rate_schedule: constant
+ network_settings:
+ normalize: false
+ hidden_units: 256
+ num_layers: 2
+ vis_encode_type: simple
+ reward_signals:
+ extrinsic:
+ gamma: 0.99
+ strength: 1.0
+ keep_checkpoints: 5
+ max_steps: 20000000
+ time_horizon: 64
+ summary_freq: 60000
+ threaded: true
diff --git a/docs/Learning-Environment-Design-Agents.md b/docs/Learning-Environment-Design-Agents.md
@@ -29,7 +29,7 @@
  - [Rewards Summary & Best Practices](#rewards-summary--best-practices)
 - [Agent Properties](#agent-properties)
 - [Destroying an Agent](#destroying-an-agent)
-- [Defining Teams for Multi-agent Scenarios](#defining-teams-for-multi-agent-scenarios)
+- [Defining Multi-agent Scenarios](#defining-multi-agent-scenarios)
 - [Recording Demonstrations](#recording-demonstrations)
 
 An agent is an entity that can observe its environment, decide on the best
@@ -537,7 +537,7 @@ the padded observations. Note that attention layers are invariant to
 the order of the entities, so there is no need to properly "order" the
 entities before feeding them into the `BufferSensor`.
 
-The  the `BufferSensorComponent` Editor inspector have two arguments:
+The `BufferSensorComponent` Editor inspector has two arguments:
  - `Observation Size` : This is how many floats each entities will be
  represented with. This number is fixed and all entities must
  have the same representation. For example, if the entities you want to
@@ -900,7 +900,9 @@ is always at least one Agent training at all times by either spawning a new
 Agent every time one is destroyed or by re-spawning new Agents when the whole
 environment resets.
 
-## Defining Teams for Multi-agent Scenarios
+## Defining Multi-agent Scenarios
+
+### Teams for Adversarial Scenarios
 
 Self-play is triggered by including the self-play hyperparameter hierarchy in
 the [trainer configuration](Training-ML-Agents.md#training-configurations). To
@@ -927,6 +929,92 @@ provide examples of symmetric games. To train an asymmetric game, specify
 trainer configurations for each of your behavior names and include the self-play
 hyperparameter hierarchy in both.
 
+### Groups for Cooperative Scenarios
+
+Cooperative behavior in ML-Agents can be enabled by instantiating a `SimpleMultiAgentGroup`,
+typically in an environment controller or similar script, and adding agents to it
+using the `RegisterAgent(Agent agent)` method. Using `MultiAgentGroup` enables the
+agents within a group to learn how to work together to achieve a common goal (i.e.,
+maximize a group-given reward), even if one or more of the group members are removed
+before the episode ends. You can then use this group to add/set rewards, end or interrupt episodes
+at a group level using the `AddGroupReward()`, `SetGroupReward()`, `EndGroupEpisode()`, and
+`GroupEpisodeInterrupted()` methods. For example:
+
+```csharp
+// Create a Multi Agent Group in Start() or Initialize()
+m_AgentGroup = new SimpleMultiAgentGroup();
+
+// Register agents in group at the beginning of an episode
+for (var agent in AgentList)
+{
+ m_AgentGroup.RegisterAgent(agent);
+}
+
+// if the team scores a goal
+m_AgentGroup.AddGroupReward(score);
-m_AgentGroup.AddGroupReward(score);
+m_AgentGroup.AddGroupReward(rewardForGoal);
-m_AgentGroup.AddGroupReward(score);
+m_AgentGroup.AddGroupReward(rewardForGoal);
+
+// If the goal is reached and the episode is over
+m_AgentGroup.EndGroupEpisode();
+ResetScene();
+
+// If time ran out and we need to interrupt the episode
+m_AgentGroup.GroupEpisodeInterrupted();
+ResetScene();
+```
+
+Multi Agent Groups are best used with the MA-POCA trainer, which is explicitly designed to train
-Multi Agent Groups are best used with the MA-POCA trainer, which is explicitly designed to train
+Multi Agent Groups can only be trained with the MA-POCA trainer, which is explicitly designed to train
-Multi Agent Groups are best used with the MA-POCA trainer, which is explicitly designed to train
+Multi Agent Groups can only be trained with the MA-POCA trainer, which is explicitly designed to train
+cooperative environments. This can be enabled by using the `coma` trainer - see the
-cooperative environments. This can be enabled by using the `coma` trainer - see the
+cooperative environments. This can be enabled by using the `poca` trainer - see the
-cooperative environments. This can be enabled by using the `coma` trainer - see the
+cooperative environments. This can be enabled by using the `poca` trainer - see the
+[training configurations](Training-Configuration-File.md) doc for more information on
+configuring MA-POCA. When using MA-POCA, agents which are deactivated or removed from the Scene
+during the episode will still learn to contribute to the group's long term rewards, even
+if they are not active in the scene to experience them.
+
+**NOTE**: Groups differ from Teams (for competitive settings) in the following way - agents
+working together should be added to the same Group, while Agents playing against each other
+should be given different Team Ids. If in the Scene there is one playing field and two teams,
+there should be two Groups, one for each team, and each team should be assigned a different
+Team Id. If this playing field is duplicated many times in the Scene (e.g. for training
+speedup), there should be two Groups _per playing field_, and two unique Team Ids
+_for the entire Scene_. In environments with both Groups and Team Ids configured, MA-POCA and
+self-play can be used together for training.
+
+For an example of how to set up cooperative environments, see the
+[Cooperative PushBlock](Learning-Environment-Examples.md#cooperative-push-block) and
+[Dungeon Escape](Learning-Environment-Examples.md#dungeon-escape) example environments.
-For an example of how to set up cooperative environments, see the
-[Cooperative PushBlock](Learning-Environment-Examples.md#cooperative-push-block) and
-[Dungeon Escape](Learning-Environment-Examples.md#dungeon-escape) example environments.
-For an example of how to set up cooperative environments, see the
-[Cooperative PushBlock](Learning-Environment-Examples.md#cooperative-push-block) and
-[Dungeon Escape](Learning-Environment-Examples.md#dungeon-escape) example environments.
+
+#### Cooperative Behaviors Notes and Best Practices
+* An Agent can only be registered to one MultiAgentGroup at a time. If you want to re-assign an
+Agent from one group to another, you have to unregister it from the current group first.
+
+* Agents with different behavior names in the same group are not supported.
+
+* Agents within groups should always set the `Max Steps` parameter the Agent script to 0, meaning
+they will never reach a max step. Instead, handle Max Steps with MultiAgentGroup by ending the episode for the entire
-* Agents within groups should always set the `Max Steps` parameter the Agent script to 0, meaning
-they will never reach a max step. Instead, handle Max Steps with MultiAgentGroup by ending the episode for the entire
+* Agents within groups should always set the `Max Steps` parameter in the Agent script to 0. Instead, handle Max Steps with MultiAgentGroup by ending the episode for the entire
-* Agents within groups should always set the `Max Steps` parameter the Agent script to 0, meaning
-they will never reach a max step. Instead, handle Max Steps with MultiAgentGroup by ending the episode for the entire
+* Agents within groups should always set the `Max Steps` parameter in the Agent script to 0. Instead, handle Max Steps with MultiAgentGroup by ending the episode for the entire
+Group using `GroupEpisodeInterrupted()`.
+
+* `EndGroupEpisode` and `GroupEpisodeInterrupted` do the same job in the game, but has
+slightly different effect on the training. If the episode is completed, you would want to call
+`EndGroupEpisode`. But if the episode is not over but it has been running for enough steps, i.e.
+reaching max step, you would call `GroupEpisodeInterrupted`.
+
+* If an Agent finished earlier, e.g. completed tasks/be removed/be killed in the game, do not call
+`EndEpisode()` on the Agent. Instead, disable the Agent and re-enable it when the next episode starts,
+or destroy the agent entirely.
+
+* If an Agent is disabled in a scene, it must be re-registered to the MultiAgentGroup.
+
+* Group rewards are meant to reinforce agents to act in the group's best interest instead of
+individual ones, and are treated differently than individual agent rewards during
+training. So calling AddGroupReward() is not equivalent to calling agent.AddReward() on each agent
-training. So calling AddGroupReward() is not equivalent to calling agent.AddReward() on each agent
+training. So calling `AddGroupReward()` is not equivalent to calling `agent.AddReward()` on each agent
-training. So calling AddGroupReward() is not equivalent to calling agent.AddReward() on each agent
+training. So calling `AddGroupReward()` is not equivalent to calling `agent.AddReward()` on each agent
+in the group.
+
+* You can still add incremental rewards to Agents using `Agent.AddReward()` if they are
+in a Group. These rewards will only be given to those agents and are received when the
+Agent is active.
+
+* Environments which use Multi Agent Groups can be trained using PPO or SAC, but agents will
+not be able to learn from group rewards after deactivation/removal, nor will they behave as cooperatively.
+
 ## Recording Demonstrations
 
 In order to record demonstrations from an agent, add the

diff --git a/docs/Learning-Environment-Examples.md b/docs/Learning-Environment-Examples.md
@@ -456,3 +456,43 @@ drop down. New pieces are spawned randomly at the top, with a chance of being
  - Recommended Minimum: 1
  - Recommended Maximum: 20
  - Benchmark Mean Reward: Depends on the number of tiles.
+
+## Cooperative Push Block
+![CoopPushBlock](images/cooperative_pushblock.png)
+
+- Set-up: Similar to Push Block, the agents are in an area with blocks that need
+to be pushed into a goal. Small blocks can be pushed by one agents and are worth
++1 value, medium blocks require two agents to push in and are worth +2, and large
+blocks require all 3 agents to push and are worth +3.
+- Goal: Push all blocks into the goal.
+- Agents: The environment contains three Agents in a Multi Agent Group.
+- Agent Reward Function:
+ - -(1/15000) Existential penalty, as a group reward.
+ - +1, +2, or +3 for pushing in a block, added as a group reward.
+- Behavior Parameters:
+ - Observation space: A single Grid Sensor with separate tags for each block size,
+ the goal, the walls, and other agents.
+ - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise
+ and counterclockwise, move along four different face directions, or do nothing.
+- Float Properties: None
+- Benchmark Mean Reward:
+
+## Dungeon Escape
+![DungeonEscape](images/dungeon_escape.png)
+
+- Set-up: Agents are trapped in a dungeon with a dragon, and must work together to escape.
+ To retrieve the key, one of the agents must find and slay the dragon, sacrificing itself
+ to do so. The dragon will drop a key for the others to use. The other agents can then pick
+ up this key and unlock the dungeon door.
+- Goal: Unlock the dungeon door and leave.
+- Agents: The environment contains three Agents in a Multi Agent Group and one Dragon, which
+ moves in a predetermined pattern.
+- Agent Reward Function:
+ - +1 group reward if any agent successfully unlocks the door and leaves the dungeon.
+- Behavior Parameters:
+ - Observation space: A single Grid Sensor with separate tags for the walls, other agents,
+ the door, keys, and the dragon.
+ - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise
+ and counterclockwise, move along four different face directions, or do nothing.
+- Float Properties: None
+- Benchmark Mean Reward:
diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
@@ -553,7 +553,7 @@ In addition to the three environment-agnostic training methods introduced in the
 previous section, the ML-Agents Toolkit provides additional methods that can aid
 in training behaviors for specific types of environments.
 
-### Training in Multi-Agent Environments with Self-Play
+### Training in Competitive Multi-Agent Environments with Self-Play
 
 ML-Agents provides the functionality to train both symmetric and asymmetric
 adversarial games with
@@ -588,6 +588,37 @@ our
 [blog post on self-play](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/)
 for additional information.
 
+### Training In Cooperative Multi-Agent Environments with MA-POCA
+
+![PushBlock with Agents Working Together](images/cooperative_pushblock.png)
+
+ML-Agents provides the functionality for training cooperative behaviors - i.e.,
+groups of agents working towards a common goal, where the success of the individual
+is linked to the success of the whole group. In such a scenario, agents typically receive
+rewards as a group. For instance, if a team of agents wins a game against an opposing
+team, everyone is rewarded - even agents who did not directly contribute to the win. This
+makes learning what to do as an individual difficult - you may get a win
+for doing nothing, and a loss for doing your best.
+
+In ML-Agents, we provide MA-POCA (MultiAgent POsthumous Credit Assignment), which
+is a novel multi-agent trainer that trains a _centralized critic_, a neural network
+that acts as a "coach" for a whole group of agents. You can then give rewards to the team
+as a whole, and the agents will learn how best to contribute to achieving that reward.
+Agents can _also_ be given rewards individually, and the team will work together to help the
+individual achieve those goals. During an episode, agents can be added or removed from the group,
+such as when agents spawn or die in a game. If agents are removed mid-episode (e.g., if teammates die
+or are removed from the game), they will still learn whether their actions contributed
+to the team winning later, enabling agents to take group-beneficial actions even if
+they result in the individual being removed from the game (i.e., self-sacrifice).
+MA-POCA can also be combined with self-play to train teams of agents to play against each other.
+
+To learn more about enabling cooperative behaviors for agents in an ML-Agents environment,
+check out [this page](Learning-Environment-Design-Agents.md#cooperative-scenarios).
+
+For further reading, MA-POCA builds on previous work in multi-agent cooperative learning
+([Lowe et al.](https://arxiv.org/abs/1706.02275), [Foerster et al.](https://arxiv.org/pdf/1705.08926.pdf),
+among others) to enable the above use-cases.
+
 ### Solving Complex Tasks using Curriculum Learning
 
 Curriculum learning is a way of training a machine learning model where more