Skip to content

Conversation

@dzhwinter
Copy link
Contributor

No description provided.


## Synchronize SGD

In synchronize SGD, trainer need to wait other nodes finish training in every minibatch. And don't go on next epoch training if there is any node lag behind.
Copy link
Contributor

@helinwang helinwang Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some terms we usually use:

  • step: one forward backward step, computes gradient.
  • mini-batch: several data instances used in a single step.
  • task: multiple mini-batches, the master server assigns task to trainers.
  • pass: all training data, consisted of multiple tasks.
  • epoch: start of a new pass.

In this line, "epoch" is used with "mini-batch", I think by "epoch" you actually mean "step"?


<img src="src/paddle-trainer.png" width="600"/>

To wait other trainer in same epoch, use the waitEpochFinish to decide if an epoch has finished and enter next training epoch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trainer does not need to know about epoch (start of a new pass), it just get task from the master. So I think waitEpochFinish is not necessary.


## Event Handler

To select the trainer for process Python client event, same way as initialization parameters. Every trainer will try to get a distribute lock, then election a leader one. Leader trainer will keep to writing a file/ send metric data to evaluatorServer. Then python client can use that data draw metrics in real time.
Copy link
Contributor

@helinwang helinwang Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Event Handler" section is too early to be put into a design doc (we have not reached consensus yet).
Please see: #2364 (comment)

@dzhwinter dzhwinter closed this Aug 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants