*Question* distributed training parameter setting and performance tuning

Description

I have trained a simple NMT dnn using the transformer model on a small dataset and I am pretty impressed by the good result achieved with just 4500 steps. Now the problem arises when I decide to train the network in my GPU cluster which runs on Kubernetes+Docker. It seems that the TF_CONFIG required in tensor2tensor is different from that in tf-operator which can have no master at all. If I use the TF_CONFIG generated by tf-operator directly to start distributed t2t training, workers(or masters) will hang with warning message bellow, which seems to be caused by the failure of invocation of chief's create session call according to this issue thread.

TF_CONFIG={"cluster":{"ps":["tensorflow-cluster-gpu-8-ps-fj6v-0:3334","tensorflow-cluster-gpu-8-ps-fj6v-1:3334"],"worker":["tensorflow-cluster-gpu-8-worker-fj6v-0:3333","tensorflow-cluster-gpu-8-worker-fj6v-1:3333"]},"task":{"type":"worker","index":0},"environment":"cloud"} t2t-trainer ... ... --worker-job='/job:worker' ... ...

will get worker hang on "waiting for model to be ready" warning message.

I have to explicitly change all "worker" to "master" to get the distributed training work.
Can anyone point it out in code where this binding logic happens?

TF_CONFIG={"cluster":{"ps":["tensorflow-cluster-gpu-8-ps-fj6v-0:3334","tensorflow-cluster-gpu-8-ps-fj6v-1:3334"],"master":["tensorflow-cluster-gpu-8-worker-fj6v-0:3333","tensorflow-cluster-gpu-8-worker-fj6v-1:3333"]},"task":{"type":"master","index":0},"environment":"cloud"} t2t-trainer ... ... --worker-job='/job:master' ... ...

will get the training started.

Another issue is that the training speed gets significantly slower with distributed training.

With local training on 4 GPU, global_step/sec is approx. 1.1 With 1 ps (CPU) + 2 masters (each has 4GPU), global_step/sec is approx. 1.0 However, with 2 ps (CPU) + 2 masters (each has 4GPU), global_step/sec is approx. 0.4

Hyper parameters are all the same. Does this mean that the communication overhead outweighs the computation gains? Is there anyway we can improve the performance of distributed training?

...

Environment information

Tensorflow v1.41.
Tensor2Tensor v1.2.9
OS: CentOS 7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question distributed training parameter setting and performance tuning #840

Description

Environment information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

*Question* distributed training parameter setting and performance tuning #840

Description

Description

Environment information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Question distributed training parameter setting and performance tuning #840