- Notifications
You must be signed in to change notification settings - Fork 5.9k
Multigpu Feature #3769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Multigpu Feature #3769
Changes from all commits
Commits
Show all changes
14 commits Select commit Hold shift + click to select a range
b317cbf fix typo, rewrite graph
dzhwinter dbaaa49 fix typo, rewrite graph
dzhwinter c117185 rewrite graph
dzhwinter c8701bd rewrite graph
dzhwinter 1e5302c "redraw the graph"
dzhwinter e0a8b59 "redo the graph"
dzhwinter 1c63771 Merge remote-tracking branch 'origin/develop' into multigpu
dzhwinter ddc2587 Merge remote-tracking branch 'origin/develop' into multigpu
dzhwinter 988a4a6 Merge remote-tracking branch 'origin/develop' into feature/nccl_doc
dzhwinter 7389ea9 "add NCCL multi-GPU design doc"
dzhwinter ebd0cf1 "add manual allreduce"
dzhwinter 9b16750 "remove AllReduce2 comments"
dzhwinter a2dfabb "fix based on comments"
dzhwinter a02a68d "fixed based on comment"
dzhwinter File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # Design Doc: NCCL support in Paddle Fluid | ||
| | ||
| ## Abstract | ||
| | ||
| This Design Doc refers to the NCCL feature in paddle. We propose an approach to support NCCL library both on a single machine and multiple machines. We wrapper the NCCL primitives `Broadcast`, `Allreduce`, `Reduce` as operators to utilize Multi-GPU powers in one script. | ||
| | ||
| | ||
| ## Motivation | ||
| | ||
| [NCCL](https://developer.nvidia.com/nccl) is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. With NCCL library, we can easily accelerate the training in parallel. | ||
| | ||
| - Pros | ||
| 1. easily plug-in with [NCCL2](https://developer.nvidia.com/nccl) library. | ||
| 1. high performance in NVIDIA GPUs. | ||
| 1. MPI like primitives, which have low learning cost for users. | ||
| | ||
| - Cons | ||
| 1. Only design for NVIDIA GPUs, not a general multi-device solution. | ||
| 1. Although NCCL1 is opensourced under BSD license, but NCCL2 is not opensourced anymore. | ||
| | ||
| At the beginning of training, the framework needs to distribute the same parameters to every GPU, and merge the gradients at any time user interests. | ||
| | ||
| As a result, during training, we need the operations of peer to peer copy between different GPUs, aggregating gradients/parameters from GPUs, and broadcasting parameters to GPUs. Every GPU only need to run the operator with correct place information. | ||
| | ||
| Besides, it needs interfaces to synchronize model update with each different GPU Cards. | ||
| | ||
| ## Implementation | ||
| | ||
| As mentioned above, we wrap the NCCL routines as several kinds of operators. Need to note that NCCL need to create Communicator between gpu at the beginning, so there is a NCCLInit operator created. | ||
| | ||
| ### Transpiler | ||
| | ||
| To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the transpiler compiles the user defined operation graph into sub-graphs to be executed on different devices. | ||
| | ||
| 1. The user-defined model will be a single device program | ||
| | ||
| 2. Broadcast/Reduce operators between GPUs will be inserted into the program, even for the multi-node, may insert the `Send`, `Recv` operator. | ||
| | ||
| *Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, [Send, Recv](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md#graph-converter) in multiple machines* | ||
| | ||
| <img src="images/multigpu_before_convert.png" width="300"/> | ||
| | ||
| After compiling, the graph as shows | ||
| | ||
| <img src="images/multigpu_allreduce.png" width="1000"/> | ||
| | ||
| Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc. | ||
| | ||
| - **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU. | ||
| - **AllReduce**. AllReduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based communicating method, avoid of the bottle neck in a single GPU. | ||
| | ||
| Need to notice that AllReduce operator force GPUs synchronized at that point. The whole training process in asynchronous or synchronous mode depends on the AllReduce point in the graph. | ||
| | ||
| As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`. | ||
| | ||
| - **AllReduce** | ||
| Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is | ||
| 1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs. | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we could introduce how to distribute the parameters(round-robin, hash or user-specified)? Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, that's another problem coupled with | ||
| 2. The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs. | ||
| 3. Logically neighberhood card will start send parameter to the next one. After one round, the parameter main card will aggregate the full gradients. | ||
| 4. Then the root card will optimize the parameter. | ||
| 5. This parameter card will send its optimized result to its neighberhood, then the neighberhood will send parameter to its next one. | ||
| 6. Finish the sychronization round. | ||
| | ||
| The total time cost will be 2 * (n-1) * per-parameter-send-time, we reach the goal of amortize the upgrade time into communicating phase. | ||
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL2 also support ring-base AllReduce. see https://github.com/PaddlePaddle/Paddle/wiki/NCCL2-Survey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个并不一样,我们需要的不仅是ring-based AllReduce. NCCL2 AllReduce只支持sum, max这类简单操作,我们需要在其中做优化。