Skip to content

Conversation

@gongweibao
Copy link
Collaborator

Fix #462

controller parse "TrainingJob" resource to paddle job, including master, pserver and trainer.

@gongweibao gongweibao changed the title Submittrainingjobs Add trainingjober. Dec 12, 2017
@gongweibao gongweibao changed the title Add trainingjober. Add TrainingJober. Dec 12, 2017

// Destroy destroys resource and pods.
func (c *TrainingJober) Destroy(job *paddlejob.TrainingJob) {
c.Complete(job)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为何cleanupPservercleanupMaster Complete函数里面,而cleanupTrainerDestroy函数里面,和Complete是并列关系?

3个cleanup看起来应该是并列关系?考虑删掉Complete,把3个cleanup都放在Destroy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的思路是这样的,当TrainerJob完成之后,Master和Pserver应该被清除掉。因为他们占用了Pod资源,但是TrainerJob还不能被清除掉,用户还要看日志。而且TrainerJob完成之后就不占用了资源了。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

懂了,有道理!

err := fmt.Errorf("trainerjob_err:%v master_err:%v pserver_err:%v",
terr, merr, perr)
log.Error(err.Error())
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

某一个出错,但是另外的可能创建成功了,需要cleanup。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return err
}

if m == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可能出现err == nil而同时m == nil的情况吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种的也是应该创建吧?

Copy link
Collaborator

@helinwang helinwang Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦,我的意思是要是不会出现就不需要处理这个情况了。
明白了,是nil就代表还不存在,所以需要创建。

log.Error(err.Error())
return err
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可能在create之前判断下job是否存在:

  • 对于重复提交的情况,这里是不是可以判断下,trainer/pserver/master只要其中一个存在就报异常呢?
  • 使用户使用paddlectl kill {jobname}也有可能存在pod删不掉的情况,例如Contrainer删除超时等,所以这里可能还需要判断下Pod的存在情况。
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于重复提交的情况,这里是不是可以判断下

后边判断了

trainer/pserver/master只要其中一个存在就报异常呢?使用户使用paddlectl kill {jobname}也有可能存在pod删不掉的情况,例如Contrainer删除超时等,所以这里可能还需要判断下Pod的存在情况。

这个判断应该在提交TrainingJob资源的地方做。controller的对用户来说是异步执行的。
我准备在paddlecloud里边加上类似的判断。

if t == nil {
if err := c.createTrainer(job); err != nil {
if m == nil {
c.cleanupMaster(namespace, mname)
Copy link
Collaborator

@Yancey0623 Yancey0623 Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里没有很明白,为什么m==nil的时候需要clearnup master呢?m==nil不是表示master不存在,是不是不需要再cleanup了?

Copy link
Collaborator Author

@gongweibao gongweibao Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m==nil 表示 这轮创建之前Master不存在, 但是能走到这里代表Master创建且成功了,所以当createTrainer出错的时候要清理。
隐含的是把 Tainer,Pserver,Master当做一个整体。并且,只清理自己创建的。不清理其他的。

Copy link
Collaborator

@Yancey0623 Yancey0623 Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

既然master已经在上一个if中创建成功了,那么么应该不会进入到m == nil分支中了吧? 所以看起来这里永远不会被执行

c.cleanupTrainer(namespace, tname)
}

if err := c.createPserver(job); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果创建PServer失败,应该对之前创建trainer和创建master的操作进行回滚。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry,放错地方了。还需要继续仔细测试。

Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@gongweibao gongweibao merged commit a4292d5 into PaddlePaddle:develop Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants