- Notifications
You must be signed in to change notification settings - Fork 4.4k
[docs] Documentation for POCA and cooperative behaviors #5056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 250 commits
d2e315d 5cf76e3 8708f70 44fb8b5 56f9dbf a468075 db184d9 32cbdee c90472c d429b53 44093f2 1dc0059 2b5b994 cd84fe3 eed2fce fe41094 3822b18 3f4b2b5 c7c7d4c f391b35 f706a91 cee5466 195978c 10f336e f0bf657 e03c79e 1118089 13a90b1 36d1b5b 376d500 dd8b5fb 8673820 fb86a57 1beea7d 9e69790 3c2b9d1 b4b9d72 7d5f3e3 b7c5533 53e1277 2134004 60c6071 47cfae4 d31da21 541d062 d3c4372 3407478 f84ca50 287c1b9 f73ef80 10a416a e716199 45349b8 9a6474e 04d9617 5bbb222 2868694 c9b4e71 a10caaf 44c616d ef01af4 f329e1d fbd1749 908b1df d4073ce 7d8f2b5 9452239 39adec6 14bb6fd 7cb5dbc 761a206 3afae60 f0dfada c3d8d8e c3d84c5 f5419aa d7a2386 cdc6dde 4f35048 05c8ea1 a7f2fc2 6d2be2c b812da4 0c3dbff 09590ad c982c06 7e3d976 c40fec0 ffb3f0b f87cfbd 8b8e916 6b71f5a 87e97dd d3d1dc1 2ba09ca 5587e48 7e51ad1 f25b171 128b09b 4690c4e dbdd045 30c846f dd7f867 f36f696 a1b7e75 236f398 96278d0 5f8cbc5 293ec08 7d20bd9 a22c621 c669226 2dc90a9 70207a3 b22d0ae 49282f6 204b45b 7eacfba 016ffd8 d7e2ca6 8b9d662 3fb14b9 4e4ecad 492fd17 9f6eca7 7292672 78e052b 81d8389 39f92c3 1d500d6 6418e05 944997a 527ca06 eb15030 4d215cf 9fac4b1 d5a30f1 6da8dd3 ad4a821 9725aa5 65b5992 f5190fe 664ae89 8f696f4 31da276 77557ca 6464cb6 cbfdfb3 6badfb5 ef67f53 8e78dbd 31ee1c4 1e4c837 cba26b2 2113a43 ba9896c 5679e2f 0e28c07 6936004 97d1b80 ce2e7b1 dbcf313 9a00053 fce4ad3 2c03d2b f70c345 33a27e0 b39e873 f879b61 f86e7b4 01ca5df 6d7a604 587e3da fd4aa53 8dbea77 ec9e5ad e1f48db d42896a b3f2689 7085461 7005daa 3658310 b2100c1 7b1f805 8096b11 8359ca3 8c18a80 cc9f5c0 c34837c 107bb3d 4e82760 4b7db51 5905680 d7d622a 7ec4b34 277d66f 2a93ca1 743ede0 edfdbdc c15abe4 1a01dd4 e12744a 71da407 17bcd7f b42f393 2866d08 ede29cc c46b601 df9ae94 83a44ef 6be73e6 bf08535 a238fe4 1837968 d4fe1f2 369aa86 2071092 0a5092b b6e70ce 2c83696 f243181 2df0296 ff25fc4 a7d2a65 2d0ee89 8046811 ab6b1d5 0f4201a 56548dd 3b91d38 5acbcd6 52ea4a7 20c8759 2ed7f46 bb04d14 8511f9f 445c1f0 f98c615 65af6ff adbe1b2 4c0986d 3ed9702 4134d54 a500410 ea7914e d972351 ed462aa 92ad505 12fdc1d 0ff8ac8 File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| behaviors: | ||
| PushBlock: | ||
| trainer_type: coma | ||
| hyperparameters: | ||
| batch_size: 1024 | ||
| buffer_size: 10240 | ||
| learning_rate: 0.0003 | ||
| beta: 0.01 | ||
| epsilon: 0.2 | ||
| lambd: 0.95 | ||
| num_epoch: 3 | ||
| learning_rate_schedule: constant | ||
| network_settings: | ||
| normalize: false | ||
| hidden_units: 256 | ||
| num_layers: 2 | ||
| vis_encode_type: simple | ||
| reward_signals: | ||
| extrinsic: | ||
| gamma: 0.99 | ||
| strength: 1.0 | ||
| keep_checkpoints: 5 | ||
| max_steps: 20000000 | ||
| time_horizon: 64 | ||
| summary_freq: 60000 | ||
| threaded: true |
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| | @@ -29,7 +29,7 @@ | |||||||
| - [Rewards Summary & Best Practices](#rewards-summary--best-practices) | ||||||||
| - [Agent Properties](#agent-properties) | ||||||||
| - [Destroying an Agent](#destroying-an-agent) | ||||||||
| - [Defining Teams for Multi-agent Scenarios](#defining-teams-for-multi-agent-scenarios) | ||||||||
| - [Defining Multi-agent Scenarios](#defining-multi-agent-scenarios) | ||||||||
| - [Recording Demonstrations](#recording-demonstrations) | ||||||||
| | ||||||||
| An agent is an entity that can observe its environment, decide on the best | ||||||||
| | @@ -537,7 +537,7 @@ the padded observations. Note that attention layers are invariant to | |||||||
| the order of the entities, so there is no need to properly "order" the | ||||||||
| entities before feeding them into the `BufferSensor`. | ||||||||
| | ||||||||
| The the `BufferSensorComponent` Editor inspector have two arguments: | ||||||||
| The `BufferSensorComponent` Editor inspector has two arguments: | ||||||||
| - `Observation Size` : This is how many floats each entities will be | ||||||||
| represented with. This number is fixed and all entities must | ||||||||
| have the same representation. For example, if the entities you want to | ||||||||
| | @@ -900,7 +900,9 @@ is always at least one Agent training at all times by either spawning a new | |||||||
| Agent every time one is destroyed or by re-spawning new Agents when the whole | ||||||||
| environment resets. | ||||||||
| | ||||||||
| ## Defining Teams for Multi-agent Scenarios | ||||||||
| ## Defining Multi-agent Scenarios | ||||||||
| | ||||||||
| ### Teams for Adversarial Scenarios | ||||||||
| | ||||||||
| Self-play is triggered by including the self-play hyperparameter hierarchy in | ||||||||
| the [trainer configuration](Training-ML-Agents.md#training-configurations). To | ||||||||
| | @@ -927,6 +929,92 @@ provide examples of symmetric games. To train an asymmetric game, specify | |||||||
| trainer configurations for each of your behavior names and include the self-play | ||||||||
| hyperparameter hierarchy in both. | ||||||||
| | ||||||||
| ### Groups for Cooperative Scenarios | ||||||||
| | ||||||||
| Cooperative behavior in ML-Agents can be enabled by instantiating a `SimpleMultiAgentGroup`, | ||||||||
| typically in an environment controller or similar script, and adding agents to it | ||||||||
| using the `RegisterAgent(Agent agent)` method. Using `MultiAgentGroup` enables the | ||||||||
| agents within a group to learn how to work together to achieve a common goal (i.e., | ||||||||
| maximize a group-given reward), even if one or more of the group members are removed | ||||||||
| before the episode ends. You can then use this group to add/set rewards, end or interrupt episodes | ||||||||
| at a group level using the `AddGroupReward()`, `SetGroupReward()`, `EndGroupEpisode()`, and | ||||||||
| `GroupEpisodeInterrupted()` methods. For example: | ||||||||
| | ||||||||
| ```csharp | ||||||||
| // Create a Multi Agent Group in Start() or Initialize() | ||||||||
| m_AgentGroup = new SimpleMultiAgentGroup(); | ||||||||
| | ||||||||
| // Register agents in group at the beginning of an episode | ||||||||
| for (var agent in AgentList) | ||||||||
| { | ||||||||
| m_AgentGroup.RegisterAgent(agent); | ||||||||
| } | ||||||||
| | ||||||||
| // if the team scores a goal | ||||||||
| m_AgentGroup.AddGroupReward(score); | ||||||||
| | ||||||||
| // If the goal is reached and the episode is over | ||||||||
| m_AgentGroup.EndGroupEpisode(); | ||||||||
| ResetScene(); | ||||||||
| | ||||||||
| // If time ran out and we need to interrupt the episode | ||||||||
| m_AgentGroup.GroupEpisodeInterrupted(); | ||||||||
| ResetScene(); | ||||||||
| ``` | ||||||||
| | ||||||||
| Multi Agent Groups are best used with the MA-POCA trainer, which is explicitly designed to train | ||||||||
| ||||||||
| Multi Agent Groups are best used with the MA-POCA trainer, which is explicitly designed to train | |
| Multi Agent Groups can only be trained with the MA-POCA trainer, which is explicitly designed to train |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this isn't exactly true - Multi Agent Groups will run and try to train with PPO but their behaviors won't be very collaborative. I changed it to the stronger-but-not-as-hard "should be trained with".
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cooperative environments. This can be enabled by using the `coma` trainer - see the | |
| cooperative environments. This can be enabled by using the `poca` trainer - see the |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a little image will help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a small diagram of the difference
andrewcoh marked this conversation as resolved. Show resolved Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For an example of how to set up cooperative environments, see the | |
| [Cooperative PushBlock](Learning-Environment-Examples.md#cooperative-push-block) and | |
| [Dungeon Escape](Learning-Environment-Examples.md#dungeon-escape) example environments. |
Remove until the environments are actually merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some inconsistency in the page between agent and Agent (capitalization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time this is mentioned (I think). This section is a summary, so it should be called out earlier.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Agents within groups should always set the `Max Steps` parameter the Agent script to 0, meaning | |
| they will never reach a max step. Instead, handle Max Steps with MultiAgentGroup by ending the episode for the entire | |
| * Agents within groups should always set the `Max Steps` parameter in the Agent script to 0. Instead, handle Max Steps with MultiAgentGroup by ending the episode for the entire |
andrewcoh marked this conversation as resolved. Show resolved Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not specific to GroupTraining and should be called out in a more general documentation page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we never explicitly called this out since we handle all the max_step stuff for single agent so users don't need to know about this. The only place that used Agent.EpisodeInterrupted is in Match3 where different agents will make moves at different frequencies.
Maybe we can add a section about manually requesting decision and manually handling max_step_reached (in separate PR)?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give an explanation:
"This is because calling EndEpisode will call OnEpisodeBegin, hence resetting the Agent immediately. This is usually not the desired behavior when training a group of Agents."
It is possible to call EndEpisode it just will most likely not be what the user expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this explanation 👍
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disabled or destroyed right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destroyed = gone. no way it can be re-registered right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I would say, if a previously disabled agent is re-enabled it must be re-registered
andrewcoh marked this conversation as resolved. Show resolved Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| training. So calling AddGroupReward() is not equivalent to calling agent.AddReward() on each agent | |
| training. So calling `AddGroupReward()` is not equivalent to calling `agent.AddReward()` on each agent |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -456,3 +456,43 @@ drop down. New pieces are spawned randomly at the top, with a chance of being | |
| - Recommended Minimum: 1 | ||
| - Recommended Maximum: 20 | ||
| - Benchmark Mean Reward: Depends on the number of tiles. | ||
| | ||
| ||
| ## Cooperative Push Block | ||
|  | ||
| | ||
| - Set-up: Similar to Push Block, the agents are in an area with blocks that need | ||
| to be pushed into a goal. Small blocks can be pushed by one agents and are worth | ||
| +1 value, medium blocks require two agents to push in and are worth +2, and large | ||
| blocks require all 3 agents to push and are worth +3. | ||
| - Goal: Push all blocks into the goal. | ||
| - Agents: The environment contains three Agents in a Multi Agent Group. | ||
| - Agent Reward Function: | ||
| - -(1/15000) Existential penalty, as a group reward. | ||
| - +1, +2, or +3 for pushing in a block, added as a group reward. | ||
| - Behavior Parameters: | ||
| - Observation space: A single Grid Sensor with separate tags for each block size, | ||
| the goal, the walls, and other agents. | ||
| - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise | ||
| and counterclockwise, move along four different face directions, or do nothing. | ||
| - Float Properties: None | ||
| - Benchmark Mean Reward: | ||
| | ||
| ## Dungeon Escape | ||
|  | ||
| | ||
| - Set-up: Agents are trapped in a dungeon with a dragon, and must work together to escape. | ||
| To retrieve the key, one of the agents must find and slay the dragon, sacrificing itself | ||
| to do so. The dragon will drop a key for the others to use. The other agents can then pick | ||
| up this key and unlock the dungeon door. | ||
| - Goal: Unlock the dungeon door and leave. | ||
| - Agents: The environment contains three Agents in a Multi Agent Group and one Dragon, which | ||
| moves in a predetermined pattern. | ||
| - Agent Reward Function: | ||
| - +1 group reward if any agent successfully unlocks the door and leaves the dungeon. | ||
| - Behavior Parameters: | ||
| - Observation space: A single Grid Sensor with separate tags for the walls, other agents, | ||
| the door, keys, and the dragon. | ||
| - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise | ||
| and counterclockwise, move along four different face directions, or do nothing. | ||
| - Float Properties: None | ||
| - Benchmark Mean Reward: | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -553,7 +553,7 @@ In addition to the three environment-agnostic training methods introduced in the | |
| previous section, the ML-Agents Toolkit provides additional methods that can aid | ||
| in training behaviors for specific types of environments. | ||
| | ||
| ### Training in Multi-Agent Environments with Self-Play | ||
| ### Training in Competitive Multi-Agent Environments with Self-Play | ||
| | ||
| ML-Agents provides the functionality to train both symmetric and asymmetric | ||
| adversarial games with | ||
| | @@ -588,6 +588,37 @@ our | |
| [blog post on self-play](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/) | ||
| for additional information. | ||
| | ||
| ### Training In Cooperative Multi-Agent Environments with MA-POCA | ||
| | ||
|  | ||
| | ||
| ML-Agents provides the functionality for training cooperative behaviors - i.e., | ||
| groups of agents working towards a common goal, where the success of the individual | ||
| is linked to the success of the whole group. In such a scenario, agents typically receive | ||
| rewards as a group. For instance, if a team of agents wins a game against an opposing | ||
| team, everyone is rewarded - even agents who did not directly contribute to the win. This | ||
| makes learning what to do as an individual difficult - you may get a win | ||
| for doing nothing, and a loss for doing your best. | ||
| | ||
| In ML-Agents, we provide MA-POCA (MultiAgent POsthumous Credit Assignment), which | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we say "paper coming soon" or something? Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is fine to not say anything. Although I am worried someone will coin the name. | ||
| is a novel multi-agent trainer that trains a _centralized critic_, a neural network | ||
| that acts as a "coach" for a whole group of agents. You can then give rewards to the team | ||
| as a whole, and the agents will learn how best to contribute to achieving that reward. | ||
| Agents can _also_ be given rewards individually, and the team will work together to help the | ||
| individual achieve those goals. During an episode, agents can be added or removed from the group, | ||
| such as when agents spawn or die in a game. If agents are removed mid-episode (e.g., if teammates die | ||
| or are removed from the game), they will still learn whether their actions contributed | ||
| to the team winning later, enabling agents to take group-beneficial actions even if | ||
| they result in the individual being removed from the game (i.e., self-sacrifice). | ||
| MA-POCA can also be combined with self-play to train teams of agents to play against each other. | ||
| | ||
| To learn more about enabling cooperative behaviors for agents in an ML-Agents environment, | ||
| check out [this page](Learning-Environment-Design-Agents.md#cooperative-scenarios). | ||
| | ||
| For further reading, MA-POCA builds on previous work in multi-agent cooperative learning | ||
| ([Lowe et al.](https://arxiv.org/abs/1706.02275), [Foerster et al.](https://arxiv.org/pdf/1705.08926.pdf), | ||
| among others) to enable the above use-cases. | ||
| | ||
| ### Solving Complex Tasks using Curriculum Learning | ||
| | ||
| Curriculum learning is a way of training a machine learning model where more | ||
| | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.