Skip to content

Conversation

@Omkaka09
Copy link
Collaborator

@Omkaka09 Omkaka09 commented Jan 23, 2018

hi, @typhoonzero @Yancey1989 , I'd like you to review this issue. In this issue. This is the design doc for paddlepaddle cluster job.

Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the later review, and how about writing a design doc to describe how to implement a TrainingJobControler and how to implement the simple state machine?

@@ -0,0 +1,62 @@
// trainingjober is a package for managing a TrainingJob
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't think trainingjober is a suitable name, we need a Controller to manager the TrainingJob, so how about rename TrainingJober to TrainingJobController ?

type trainingJobEvent struct {
// pet is the TrainingJobEventType of TrainingJob
pet trainingJobEventType
// The job transfer the information fo job
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job transfer the information fo job , Do you mean job is the reference to TrainingJob ?

job *v1alpha1.TrainingJob
}

// TrainingJober is to manager a specific TrainingJob
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a TrainingjobController to manager a TrainingJob.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a controller which @m3ngyang committed in another PR #565. We use controller to manager all TrainingJob of a specific cluster and use TrainingJober to manager a specific TrainingJob.

@Omkaka09
Copy link
Collaborator Author

Omkaka09 commented Jan 29, 2018

@Yancey1989 ok~ I will write a design doc to describe the definition of Controller、TrainingJober、TrainingJob and the state machine of a TrainingJob.

@Omkaka09 Omkaka09 changed the title add trainingjober struct Design Doc for PaddlePaddle TrainingJob k8s CRD Feb 5, 2018
@@ -0,0 +1,172 @@
# Design Doc for PaddlePaddle TrainingJob k8s CRD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design docs must put under docs/design

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advice, I will put it under docs/design.

The whole life cycle of TrainingJob is managed through the two layer control of Controller and TrainingJober. As
shown in the following figure:

![image](https://github.com/qizheng09/figure/blob/master/paddlecloud/overall_design.png?raw=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the image in this repo. And, if you are using tools like "OmniGraffle" or something like that, you should put the source file in the repo too, so we can modify the graph later for updates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This figure should have some explanation of how it runs and what components it consists of.


Here is an example yaml for a PaddlePaddle cluster job.

```yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please try to merge this to current design: https://github.com/PaddlePaddle/cloud/blob/develop/doc/edl/README.md, so we only have one total design here.

image: "paddlepaddle/paddlecloud-job"
# base port of pserver
port: 7164
# ports num default 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments seem not necessary, we can explain the details of the yaml file in another markdown file.


The example will create a PaddlePaddle cluster job with a master, 2 pserver and 2 trainer.

### TrainingJober
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainingJober is a confusing name.

@m3ngyang
Copy link
Member

m3ngyang commented Feb 8, 2018

too many commits for one pr, you'd better rebase and merge the commits.

A PaddlePaddle training job contains several trainer instances,
several parameter server instances, and one master instance. We would
like to automatically scale the number of training instances and the
several parameter server instances, and one master instance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background section should only tell us about "Why we need to do this", designs and thoughts need to put to bottom sections.

- Create and delete TrainingJob
- TrainingJob life cycle management

**The goals of Autoscale Controller including:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call the controller "controller", and the Autoscaler is not a Kubernetes controller actually, It's just a goroutine running at the background to do autoscaling.


```yaml
apiVersion: paddlepaddle.org/v1
apiVersion: paddlepaddle.org/v1alpha1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use paddlepaddle.org/v1 so some code won't need big changes.

v1.9.
![](state_machine.jpg)

We will support TPR first, because some of our clusters is using
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may still need this line somewhere.

@Omkaka09
Copy link
Collaborator Author

@typhoonzero , I fix the comments you mentioned above. How do you think about this commit?

// TrainingJobUpdater is to manager a specific TrainingJob
type TrainingJobUpdater struct {
// job is the job the TrainingJobUpdater manager.
job *v1.TrainingJob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code format problem


Kubrenetes provide [CustomResourceDefinition (CRD)](https://kubernetes.io/docs/concepts/api-extension/custom-resources/#custom-resources)and [custom controller](https://kubernetes.io/docs/concepts/api-extension/custom-resources/#custom-controllers)
we can use these feature to develop EDL controller, so it can flexiblly running in our
we can use these feature to develop EDL controller, so it can flexibly running in our
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running -> run

Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@typhoonzero typhoonzero merged commit 002506f into PaddlePaddle:develop Feb 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants