Skip to content

Improve Fluid Distributed Training performance #8638

@typhoonzero

Description

@typhoonzero

As shown in #8550, send_op tooks too much time of GPU distributed training, here's some tips we need to do to improve the performance:

  • perf details about send_op -- @gongweibao
  • do not copy before sending variables -- @typhoonzero
  • do not copy when deserialize -- @gongweibao
  • use distribute_transpiler_simple to reduce copying -- @typhoonzero
  • merge small variables into one message and send
  • parameter run optimization parallelly -- @typhoonzero
  • implement communication using RDMA -- @seiriosPlus
  • implement multi GPU multi node dist training using NCCL2 -- @typhoonzero
  • async send gradient after execution of each backward op. Split send_op into fetch_vars_op and send_vars_op #9161 -- @Yancey1989
  • prepare the executor on pserver before training. -- @typhoonzero
  • test maximum throughput of grpc with large messages
  • whether grpc streaming can help
  • use cuda pinned memory to enable DMA copy

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions