- Notifications
You must be signed in to change notification settings - Fork 79
Add TrainingJober. #521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TrainingJober. #521
Conversation
| | ||
| // Destroy destroys resource and pods. | ||
| func (c *TrainingJober) Destroy(job *paddlejob.TrainingJob) { | ||
| c.Complete(job) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为何cleanupPserver和cleanupMaster 在Complete函数里面,而cleanupTrainer在Destroy函数里面,和Complete是并列关系?
3个cleanup看起来应该是并列关系?考虑删掉Complete,把3个cleanup都放在Destroy。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的思路是这样的,当TrainerJob完成之后,Master和Pserver应该被清除掉。因为他们占用了Pod资源,但是TrainerJob还不能被清除掉,用户还要看日志。而且TrainerJob完成之后就不占用了资源了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
懂了,有道理!
| err := fmt.Errorf("trainerjob_err:%v master_err:%v pserver_err:%v", | ||
| terr, merr, perr) | ||
| log.Error(err.Error()) | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
某一个出错,但是另外的可能创建成功了,需要cleanup。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return err | ||
| } | ||
| | ||
| if m == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可能出现err == nil而同时m == nil的情况吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种的也是应该创建吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
哦,我的意思是要是不会出现就不需要处理这个情况了。
明白了,是nil就代表还不存在,所以需要创建。
| log.Error(err.Error()) | ||
| return err | ||
| } | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可能在create之前判断下job是否存在:
- 对于重复提交的情况,这里是不是可以判断下,trainer/pserver/master只要其中一个存在就报异常呢?
- 使用户使用
paddlectl kill {jobname}也有可能存在pod删不掉的情况,例如Contrainer删除超时等,所以这里可能还需要判断下Pod的存在情况。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于重复提交的情况,这里是不是可以判断下
后边判断了
trainer/pserver/master只要其中一个存在就报异常呢?使用户使用paddlectl kill {jobname}也有可能存在pod删不掉的情况,例如Contrainer删除超时等,所以这里可能还需要判断下Pod的存在情况。
这个判断应该在提交TrainingJob资源的地方做。controller的对用户来说是异步执行的。
我准备在paddlecloud里边加上类似的判断。
| if t == nil { | ||
| if err := c.createTrainer(job); err != nil { | ||
| if m == nil { | ||
| c.cleanupMaster(namespace, mname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里没有很明白,为什么m==nil的时候需要clearnup master呢?m==nil不是表示master不存在,是不是不需要再cleanup了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m==nil 表示 这轮创建之前Master不存在, 但是能走到这里代表Master创建且成功了,所以当createTrainer出错的时候要清理。
隐含的是把 Tainer,Pserver,Master当做一个整体。并且,只清理自己创建的。不清理其他的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
既然master已经在上一个if中创建成功了,那么么应该不会进入到m == nil分支中了吧? 所以看起来这里永远不会被执行
| c.cleanupTrainer(namespace, tname) | ||
| } | ||
| | ||
| if err := c.createPserver(job); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果创建PServer失败,应该对之前创建trainer和创建master的操作进行回滚。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry,放错地方了。还需要继续仔细测试。
Yancey0623 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
Fix #462