-
Couldn't load subscription status.
- Fork 5.9k
Closed
Description
As shown in #8550, send_op tooks too much time of GPU distributed training, here's some tips we need to do to improve the performance:
- perf details about send_op -- @gongweibao
- do not copy before sending variables -- @typhoonzero
- do not copy when deserialize -- @gongweibao
- use distribute_transpiler_simple to reduce copying -- @typhoonzero
- merge small variables into one message and send
- parameter run optimization parallelly -- @typhoonzero
- implement communication using RDMA -- @seiriosPlus
- implement multi GPU multi node dist training using NCCL2 -- @typhoonzero
- async send gradient after execution of each backward op. Split send_op into fetch_vars_op and send_vars_op #9161 -- @Yancey1989
- prepare the executor on pserver before training. -- @typhoonzero
- test maximum throughput of grpc with large messages
- whether grpc streaming can help
- use cuda pinned memory to enable DMA copy
Metadata
Metadata
Labels
No labels