You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/design/cluster_train/README.md
+12-13Lines changed: 12 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,17 +54,18 @@ The life cycle of a single task is illustrated below:
54
54
<imgsrc="src/paddle-task-states.png"/>
55
55
56
56
1. When a new pass of training starts, all tasks will be placed in the todo queue.
57
-
1.The master server will dispatch few tasks to each trainer at a time, puts them in the pending queue and waits for completion.
58
-
1. The trainer will work on its tasks and tell the master server once a task is completed. The master server will dispatch a new task to that trainer.
59
-
1. If a task timeout. the master server will move it back to the todo queue. The timeout count will increase by one. If the timeout count is above a threshold, the task is likely to cause a trainer to crash, so it will be discarded.
57
+
1.Upon trainer requests for new task, the master server will dispatch a task from todo queue to it, put the task in the pending queue and wait for completion.
58
+
1. The trainer will work on its task and tell the master server once the task is completed and ask for new task. The master server will dispatch a new task to that trainer.
59
+
1. If a task fails for any reason in trainer, or takes longer than a specific period of time, the master server will move the task back to the todo queue. The timeout count for that task will increase by one. If the timeout count is above a threshold, the task is likely to cause a trainer to crash, then it will be discarded.
60
60
1. The master server will move completed task to the done queue. When the todo queue is empty, the master server will start a new pass by moving all tasks in the done queue to todo queue and reset the timeout counter of all tasks to zero.
61
61
62
62
### Trainer Process
63
63
64
64
The trainer process will:
65
65
66
-
- Receive tasks from the master.
67
-
- Work on the tasks: calculate and upload gradient to parameter servers, and update local model by downloading new parameters from parameter servers.
66
+
- Request tasks from the master.
67
+
- Work on the tasks
68
+
- Upload gradient to parameter servers, and update local model by downloading new parameters from parameter servers.
68
69
69
70
### Parameter Server Process
70
71
@@ -119,22 +120,20 @@ When the master is started by the Kubernetes, it executes the following steps at
119
120
120
121
1. Grabs a unique *master* lock in etcd, which prevents concurrent master instantiations.
121
122
1. Recovers the task queues from etcd if they already exist, otherwise, the master will create them.
122
-
1.Watches the trainer prefix keys `/trainer/` on etcd to find the live trainers.
123
-
1.Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.
123
+
1.Write its ip address to */master/addr* so that trainers can discover it.
124
+
1.Listens to trainers' request of task, dispatch one upon request, and updates task queue using an etcd transaction to ensure lock is held during the update.
124
125
125
126
When the master server process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.
126
127
127
128
### Trainer Process
128
129
129
130
When the trainer is started by the Kubernetes, it executes the following steps at startup:
130
131
131
-
1. Watches the available parameter server prefix keys `/ps/` on etcd and waits until the count of parameter servers reaches the desired count.
132
-
1.Generates a unique ID, and sets key `/trainer/<unique ID>` with its contact address as value. The key will be deleted when the lease expires, so the master will be aware of the trainer being online and offline.
133
-
1.Waits for tasks from the master to start training.
132
+
1. Watches the available parameter server prefix keys `/ps/` on etcd and waits until the count of parameter servers reaches the desired count*/ps_desired*.
133
+
1.Finds and watches */master/addr* to get master's address.
134
+
1.Requests for tasks from the master to start training.
134
135
135
-
If trainer's etcd lease expires, it will try set key `/trainer/<unique ID>` again so that the master server can discover the trainer again.
136
-
137
-
When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.
136
+
When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from master and go on training.
Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存,以保证编译高效。具体做法请参考[这个issue](https://github.com/PaddlePaddle/Paddle/issues/627)。
0 commit comments