Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions doc/design/cluster_train/trainer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
## Design Doc : Trainer

the trainer role in whole system please refer to [distributed training design doc](./README.md).

This design doc only focus on Master Server and Trainer synchronization and python client event processing . The Task dispatch interface please refer to the [master_server](./master_server.md), [data_dispatch](./data_dispatch.md) and so on.

## Synchronize SGD

In synchronize SGD, trainer need to wait other nodes finish training in every minibatch. And don't go on next epoch training if there is any node lag behind.
Copy link
Contributor

@helinwang helinwang Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some terms we usually use:

  • step: one forward backward step, computes gradient.
  • mini-batch: several data instances used in a single step.
  • task: multiple mini-batches, the master server assigns task to trainers.
  • pass: all training data, consisted of multiple tasks.
  • epoch: start of a new pass.

In this line, "epoch" is used with "mini-batch", I think by "epoch" you actually mean "step"?


To wait other trainer in same training minibatch, the trainer call get_params will be blocked until pserver had finished the model update.

<img src="src/paddle-trainer.png" width="600"/>

To wait other trainer in same epoch, use the waitEpochFinish to decide if an epoch has finished and enter next training epoch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trainer does not need to know about epoch (start of a new pass), it just get task from the master. So I think waitEpochFinish is not necessary.


```go
// Master Service
func(s *Service) waitEpochFinish(dummy int, epoch_id *int) error;
```

## Event Handler

To select the trainer for process Python client event, same way as initialization parameters. Every trainer will try to get a distribute lock, then election a leader one. Leader trainer will keep to writing a file/ send metric data to evaluatorServer. Then python client can use that data draw metrics in real time.
Copy link
Contributor

@helinwang helinwang Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Event Handler" section is too early to be put into a design doc (we have not reached consensus yet).
Please see: #2364 (comment)