- Notifications
You must be signed in to change notification settings - Fork 3.7k
*Question* distributed training parameter setting and performance tuning #840
Description
Description
I have trained a simple NMT dnn using the transformer model on a small dataset and I am pretty impressed by the good result achieved with just 4500 steps. Now the problem arises when I decide to train the network in my GPU cluster which runs on Kubernetes+Docker. It seems that the TF_CONFIG required in tensor2tensor is different from that in tf-operator which can have no master at all. If I use the TF_CONFIG generated by tf-operator directly to start distributed t2t training, workers(or masters) will hang with warning message bellow, which seems to be caused by the failure of invocation of chief's create session call according to this issue thread.
TF_CONFIG={"cluster":{"ps":["tensorflow-cluster-gpu-8-ps-fj6v-0:3334","tensorflow-cluster-gpu-8-ps-fj6v-1:3334"],"worker":["tensorflow-cluster-gpu-8-worker-fj6v-0:3333","tensorflow-cluster-gpu-8-worker-fj6v-1:3333"]},"task":{"type":"worker","index":0},"environment":"cloud"} t2t-trainer ... ... --worker-job='/job:worker' ... ...
will get worker hang on "waiting for model to be ready" warning message.
I have to explicitly change all "worker" to "master" to get the distributed training work.
Can anyone point it out in code where this binding logic happens?
TF_CONFIG={"cluster":{"ps":["tensorflow-cluster-gpu-8-ps-fj6v-0:3334","tensorflow-cluster-gpu-8-ps-fj6v-1:3334"],"master":["tensorflow-cluster-gpu-8-worker-fj6v-0:3333","tensorflow-cluster-gpu-8-worker-fj6v-1:3333"]},"task":{"type":"master","index":0},"environment":"cloud"} t2t-trainer ... ... --worker-job='/job:master' ... ...
will get the training started.
Another issue is that the training speed gets significantly slower with distributed training.
With local training on 4 GPU, global_step/sec is approx. 1.1 With 1 ps (CPU) + 2 masters (each has 4GPU), global_step/sec is approx. 1.0 However, with 2 ps (CPU) + 2 masters (each has 4GPU), global_step/sec is approx. 0.4
Hyper parameters are all the same. Does this mean that the communication overhead outweighs the computation gains? Is there anyway we can improve the performance of distributed training?
...
Environment information
Tensorflow v1.41.
Tensor2Tensor v1.2.9
OS: CentOS 7