Skip to content

Commit ae647bf

Browse files
authored
Merge pull request #2111 from Yancey1989/submit-job-doc
Design doc: submit paddle job
2 parents 7da6a01 + b7e4b2d commit ae647bf

File tree

3 files changed

+127
-0
lines changed

3 files changed

+127
-0
lines changed
3.84 KB
Binary file not shown.
51.5 KB
Loading
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Submit a Distributed Training Job
2+
3+
The user can submit a distributed training job with Python code, rather than with a command-line interface.
4+
5+
## Runtime Environment On Kubernetes
6+
7+
For a distributed training job, there is two Docker image called *runtime Docker image* and *base Docker image*. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.
8+
9+
### Base Docker Image
10+
11+
Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and python package. And of course, users can specify any image name hosted on any docker registry which users have the access right.
12+
13+
### Runtime Docker Image
14+
15+
The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image.
16+
17+
- Handle Python Dependencies
18+
19+
You need to provide requirements.txt file in your `trainer-package` folder. Example:
20+
21+
```txt
22+
pillow
23+
protobuf==3.1.0
24+
```
25+
More [details](https://pip.readthedocs.io/en/1.1/requirements.html) about requirements, an example project looks like:
26+
```bash
27+
paddle_example
28+
|-quick_start
29+
|-trainer.py
30+
|-dataset.py
31+
|-requirements.txt
32+
```
33+
34+
## Submit Distributed Training Job With Python Code
35+
<img src="./src/submit-job.png" width="800">
36+
37+
- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
38+
- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.
39+
- *NOTE*: For the first version, we will not prepare the runtime Docker image, instead, the package is uploaded to Paddle Cloud, and Paddle Cloud will mount the package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.
40+
41+
You can call `paddle.job.dist_train` and provide distributed training configuration as the parameters:
42+
```python
43+
paddle.job.dist_train(
44+
trainer=dist_trainer(),
45+
paddle_job=PaddleJob(
46+
job_name = "paddle-cloud",
47+
entry_point = "python %s"%__file__,
48+
trainer_package = "/example/word2vec",
49+
image = "yancey1989/paddle-job",
50+
trainers = 10,
51+
pservers = 3,
52+
trainer_cpu = 1,
53+
trainer_gpu = 1,
54+
trainer_mem = "10G",
55+
pserver_cpu = 1,
56+
pserver_mem = "2G"
57+
))
58+
```
59+
60+
The parameter `trainer` of `paddle.job.dist_train` is a function and you can implement it as follows:
61+
```python
62+
def dist_trainer():
63+
def trainer_creator():
64+
trainer = paddle.v2.trainer.SGD(...)
65+
trainer.train(...)
66+
return trainer_creator
67+
```
68+
69+
The pseudo code of `paddle.job.dist_train` is as follows:
70+
```python
71+
def dist_train(trainer, paddle_job):
72+
# if the code is running on cloud, set PADDLE_ON_CLOUD=YES
73+
if os.getenv("RUNNING_ON_CLOUD", "NO") == "NO":
74+
#submit the paddle job
75+
paddle_job.submit()
76+
else:
77+
#start the training
78+
trainer()
79+
```
80+
### PaddleJob Parameters
81+
parameter | type | explanation
82+
--- | --- | ---
83+
job_name | str | the unique name for the training job
84+
entry_point | str | entry point for startup trainer process
85+
trainer_package | str | trainer package file path which user have the access right
86+
image|str|the [base image](#base-docker-image) for building the [runtime image](#runtime-docker-image)
87+
pservers|int| Parameter Server process count
88+
trainers|int| Trainer process count
89+
pserver_cpu|int| CPU count for each Parameter Server process
90+
pserver_mem|str| memory allocated for each Parameter Server process, a plain integer using one of these suffixes: E, P, T, G, M, K
91+
trainer_cpu|int| CPU count for each Trainer process
92+
trainer_mem|str| memory allocated for each Trainer process, a plain integer using one of these suffixes: E, P, T, G, M, K
93+
trainer_gpu|int| GPU count for each Trainer process, if you only want CPU, do not set this parameter
94+
95+
### Deploy Parameter Server, Trainer and Master Process
96+
- Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
97+
- Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
98+
- Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.
99+
100+
## Job Server
101+
102+
- RESTful API
103+
104+
Job server provides RESTful HTTP API for receiving the trainer package and displaying
105+
PaddlePaddle job related informations.
106+
- `POST /v1/package` receive the trainer package and save them on CephFS
107+
- `POST /v1/trainer/job` submit a trainer job
108+
- `GET /v1/jobs/` list all jobs
109+
- `GET /v1/jobs/<job-name>` the status of a job
110+
- `DELETE /v1/jobs/<job-name>` delete a job
111+
- `GET /v1/version` job server version
112+
113+
- Build Runtime Docker Image on Kubernetes
114+
115+
`paddle.job.dist_train` will upload the trainer package to Job Server, save them on the distributed filesystem, and then start up a job for building the runtime Docker image that gets scheduled by Kubernetes to run during training.
116+
117+
There are some benefits for building runtime Docker image on JobServer:
118+
- On Paddle Cloud, users will run the trainer code in a Jupyter Notebook which is a Kubernetes Pod, if we want to execute `docker build` in the Pod, we should mount the host's `docker.sock` to the Pod, user's code will connect the host's Docker Engine directly, it's not safe.
119+
- Users only need to upload the training package files, does not need to install docker engine, docker registry as dependencies.
120+
- If we want to change another image type, such as RKT, users do not need to care about it.
121+
122+
- Deploy Parameter Server, Trainer and Master Processes
123+
124+
`POST /v1/trainer/job` receives the distributed training parameters, and deploy the job as follows:
125+
- Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
126+
- Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
127+
- Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.

0 commit comments

Comments
 (0)