-   Notifications  You must be signed in to change notification settings 
- Fork 79
Design Doc for PaddlePaddle TrainingJob k8s CRD #563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the later review, and how about writing a design doc to describe how to implement a TrainingJobControler and how to implement the simple state machine?
| @@ -0,0 +1,62 @@ | |||
| // trainingjober is a package for managing a TrainingJob | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't think trainingjober is a suitable name, we need a Controller to manager the TrainingJob, so how about rename TrainingJober to TrainingJobController ?
| type trainingJobEvent struct { | ||
| // pet is the TrainingJobEventType of TrainingJob | ||
| pet trainingJobEventType | ||
| // The job transfer the information fo job | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job transfer the information fo job , Do you mean job is the reference to TrainingJob ?
| job *v1alpha1.TrainingJob | ||
| } | ||
|  | ||
| // TrainingJober is to manager a specific TrainingJob | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a TrainingjobController to manager a TrainingJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @Yancey1989 ok~ I will write a design doc to describe the definition of Controller、TrainingJober、TrainingJob and the state machine of a TrainingJob. | 
   go/pkg/design_doc.md  Outdated    
 | @@ -0,0 +1,172 @@ | |||
| # Design Doc for PaddlePaddle TrainingJob k8s CRD | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design docs must put under docs/design
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advice, I will put it under docs/design.
   doc/design/traingjob_crd_design.md  Outdated    
 | The whole life cycle of TrainingJob is managed through the two layer control of Controller and TrainingJober. As | ||
| shown in the following figure: | ||
|  | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the image in this repo. And, if you are using tools like "OmniGraffle" or something like that, you should put the source file in the repo too, so we can modify the graph later for updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This figure should have some explanation of how it runs and what components it consists of.
   doc/design/traingjob_crd_design.md  Outdated    
 |  | ||
| Here is an example yaml for a PaddlePaddle cluster job. | ||
|  | ||
| ```yaml | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please try to merge this to current design: https://github.com/PaddlePaddle/cloud/blob/develop/doc/edl/README.md, so we only have one total design here.
   doc/design/traingjob_crd_design.md  Outdated    
 | image: "paddlepaddle/paddlecloud-job" | ||
| # base port of pserver | ||
| port: 7164 | ||
| # ports num default 1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments seem not necessary, we can explain the details of the yaml file in another markdown file.
   doc/design/traingjob_crd_design.md  Outdated    
 |  | ||
| The example will create a PaddlePaddle cluster job with a master, 2 pserver and 2 trainer. | ||
|  | ||
| ### TrainingJober | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrainingJober is a confusing name.
| too many commits for one pr, you'd better rebase and merge the commits. | 
| A PaddlePaddle training job contains several trainer instances, | ||
| several parameter server instances, and one master instance. We would | ||
| like to automatically scale the number of training instances and the | ||
| several parameter server instances, and one master instance. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Background section should only tell us about "Why we need to do this", designs and thoughts need to put to bottom sections.
   doc/edl/README.md  Outdated    
 | - Create and delete TrainingJob | ||
| - TrainingJob life cycle management | ||
|  | ||
| **The goals of Autoscale Controller including:** | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can call the controller "controller", and the Autoscaler is not a Kubernetes controller actually, It's just a goroutine running at the background to do autoscaling.
   doc/edl/README.md  Outdated    
 |  | ||
| ```yaml | ||
| apiVersion: paddlepaddle.org/v1 | ||
| apiVersion: paddlepaddle.org/v1alpha1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use paddlepaddle.org/v1 so some code won't need big changes.
| v1.9. | ||
|  | ||
|  | ||
| We will support TPR first, because some of our clusters is using | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may still need this line somewhere.
| @typhoonzero , I fix the comments you mentioned above. How do you think about this commit? | 
Update EDL design doc
   doc/edl/README.md  Outdated    
 | // TrainingJobUpdater is to manager a specific TrainingJob | ||
| type TrainingJobUpdater struct { | ||
| // job is the job the TrainingJobUpdater manager. | ||
| job *v1.TrainingJob | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code format problem
   doc/edl/README.md  Outdated    
 |  | ||
| Kubrenetes provide [CustomResourceDefinition (CRD)](https://kubernetes.io/docs/concepts/api-extension/custom-resources/#custom-resources)and [custom controller](https://kubernetes.io/docs/concepts/api-extension/custom-resources/#custom-controllers) | ||
| we can use these feature to develop EDL controller, so it can flexiblly running in our | ||
| we can use these feature to develop EDL controller, so it can flexibly running in our | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running -> run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
hi, @typhoonzero @Yancey1989 , I'd like you to review this issue. In this issue. This is the design doc for paddlepaddle cluster job.