Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

*Question* distributed training parameter setting and performance tuning #840

@EdwardZhang88

Description

@EdwardZhang88

Description

I have trained a simple NMT dnn using the transformer model on a small dataset and I am pretty impressed by the good result achieved with just 4500 steps. Now the problem arises when I decide to train the network in my GPU cluster which runs on Kubernetes+Docker. It seems that the TF_CONFIG required in tensor2tensor is different from that in tf-operator which can have no master at all. If I use the TF_CONFIG generated by tf-operator directly to start distributed t2t training, workers(or masters) will hang with warning message bellow, which seems to be caused by the failure of invocation of chief's create session call according to this issue thread.

TF_CONFIG={"cluster":{"ps":["tensorflow-cluster-gpu-8-ps-fj6v-0:3334","tensorflow-cluster-gpu-8-ps-fj6v-1:3334"],"worker":["tensorflow-cluster-gpu-8-worker-fj6v-0:3333","tensorflow-cluster-gpu-8-worker-fj6v-1:3333"]},"task":{"type":"worker","index":0},"environment":"cloud"} t2t-trainer ... ... --worker-job='/job:worker' ... ... 

will get worker hang on "waiting for model to be ready" warning message.

I have to explicitly change all "worker" to "master" to get the distributed training work.
Can anyone point it out in code where this binding logic happens?

TF_CONFIG={"cluster":{"ps":["tensorflow-cluster-gpu-8-ps-fj6v-0:3334","tensorflow-cluster-gpu-8-ps-fj6v-1:3334"],"master":["tensorflow-cluster-gpu-8-worker-fj6v-0:3333","tensorflow-cluster-gpu-8-worker-fj6v-1:3333"]},"task":{"type":"master","index":0},"environment":"cloud"} t2t-trainer ... ... --worker-job='/job:master' ... ... 

will get the training started.

Another issue is that the training speed gets significantly slower with distributed training.

With local training on 4 GPU, global_step/sec is approx. 1.1 With 1 ps (CPU) + 2 masters (each has 4GPU), global_step/sec is approx. 1.0 However, with 2 ps (CPU) + 2 masters (each has 4GPU), global_step/sec is approx. 0.4 

Hyper parameters are all the same. Does this mean that the communication overhead outweighs the computation gains? Is there anyway we can improve the performance of distributed training?

...

Environment information

Tensorflow v1.41.
Tensor2Tensor v1.2.9
OS: CentOS 7

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions