Skip to content

Conversation

@asm582
Copy link
Member

@asm582 asm582 commented Jul 27, 2022

Completion status is now represented as:

completionstatus: "Complete,Failed"

In the current design services and podgroups are ignored since they do not have completion status. below scenarios were tested:

Screen Shot 2022-07-27 at 12 45 27 PM

@asm582 asm582 changed the base branch from master to quota-management July 27, 2022 16:48
Copy link
Collaborator

@dmatch01 dmatch01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a failure test case for handling non-pod creating generic items? The easiest way is to duplicate one of the new cases and change the service object to include completionstatus. This will ensure that any end user inadvertently adding this field will not force a controller failure. Thanks!

Comment on lines +1615 to +1636
//Set Appwrapper state to complete if all items in Appwrapper
//are completed
if awNew.Status.State != arbv1.AppWrapperStateCompleted {
cc.GetAllObjectsOwned(awNew)
}

if awNew.Status.State == arbv1.AppWrapperStateRunningHoldCompletion {
awNew.Status.QueueJobState = arbv1.AppWrapperCondRunningHoldCompletion
cond := GenerateAppWrapperCondition(arbv1.AppWrapperCondRunningHoldCompletion, v1.ConditionTrue, "SomeItemsCompleted", "")
awNew.Status.Conditions = append(awNew.Status.Conditions, cond)
awNew.Status.FilterIgnore = true // Update AppWrapperCondCompleted
cc.updateEtcd(awNew, "[syncQueueJob] setRunningHoldCompletion")
}

if awNew.Status.State == arbv1.AppWrapperStateCompleted {
awNew.Status.QueueJobState = arbv1.AppWrapperCondCompleted
cond := GenerateAppWrapperCondition(arbv1.AppWrapperCondCompleted, v1.ConditionTrue, "PodsCompleted", "")
awNew.Status.Conditions = append(awNew.Status.Conditions, cond)
awNew.Status.FilterIgnore = true // Update AppWrapperCondCompleted
cc.updateEtcd(awNew, "[syncQueueJob] setCompleted")
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All aw.Status.State changes are done in manageQueueJob(). Can we move the code to this fn. for consistency? The logic not quite intuitive as there are multiple queues involved thus a quick discussion would likely help here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An appwrapper is set to running state insyncQueueJob, newly added states are obtained after AW turns running, hence I added these in the same method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.status.state is being updated in the call cc.GetAllObjectsOwned(awNew). All state changes are done in managequeuejob(). Suggest moving all the logic to managequeuejob() to be consistent. The syncQueueJob() only update the .status.queuejobstate field.

countCompletionRequired := 0
ignoredItems := 0
for _, genericItem := range cqj.Spec.AggrResources.GenericItems {
objectName := genericItem.GenericTemplate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the generic item does not have the new completionstatus field, can we skip this logic for performance reasons? Please let me know your thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside Appwrapper we are finding items of 3 types of items completionRequired, completedItems and ignoredItems, so we would need to iterate all the items I believe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment response below.

Comment on lines +579 to +582
//ignore service and podgroups since these items do not have status
if status == "ignore" {
ignoredItems = ignoredItems + 1
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above regarding logic checking of new completionstatus

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start ignoring items without completionstatus I think none of the ray jobs will complete with the current design.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. We don't want to presume completion status unless the end user requests the AW generic item to be considered. This allows for end users to add run to completion objects that they specify. If the end user does not want the generic item to be considered for completion we have that flexibility. Please share your thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think the above is not entirely true. please consider the case, if AW has job and deployment as two items and only job is tracked, we still set the status as RunningHoldCompletion. So non-compute items are treated differently than the compute items.

Comment on lines +467 to +472
//service and podgroup has no conditions return ALWAYS true
//TODO: make it configurable?
if gvk.GroupKind().Kind == "Service" || gvk.GroupKind().Kind == "PodGroup" {
return "ignore"
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if generic item does not have the new field completionstatus correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, service or podgroup does not have completion status. from our initial design the idea was to hold AW in RunningHoldCompletion if any item other than podgroup or service is not complete.
Screen Shot 2022-08-01 at 6 15 15 PM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see the challenge here. As part of the AW pod phases (pending/running/completed...) are tracked. If there is a check of the pods when setting state, this would be sufficient. Iterating over each generic item that is not tracked would not be necessary. only when all tracked items indicate completionstatus the state should be changed (of which is being done). The change here is that the decision to set state to RunningHoldCompletion or Complete is when all pods associated to AW are in Failed/Completed phase states. These counts are already tracked in the AW and can be referenced from the .status field. Please let me know your thoughts. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer to this comment #192 (comment)

for _, job := range inEtcd.Items {
completionRequiredBlock := aw.CompletionStatus
if len(completionRequiredBlock) > 0 {
conditions := job.Object["status"].(map[string]interface{})["conditions"].([]interface{})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this logic is dealing with any object (including custom resources), should probably check to ensure there is a conditions field and likely to occur but also a check for status field

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen status field has other objects, for instance, status may have conditions array or loadbalancer object, does it make sense to stick with .status.conditions.type for now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with not adding the logic but a new test case should be added to handle such and event. The test case should create an object without conditions that tracks. For instance, duplicate on of the test cases and add an addition generic item that is a new namespace that tracks (completionstatus). This is a failure test case so the controller should handle this problem gracefully.

@dmatch01 dmatch01 self-requested a review August 2, 2022 12:17
Copy link
Collaborator

@dmatch01 dmatch01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were quite a bit more format changes. If they can be reduced in this PR that would be helpful. Feel free to submit a separate PR with just format change which would be great! Thanks!

an item in Appwrapper should be a candidate to complete an appwrapper.
this field should contain conditions that make appwrapper complete, for instance :-
completionconditions could be "Complete", "Failed" to make appwrapper completes.
To compare status we look at level .status.conditions[0].type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We look at all conditions not just the first one correct? Suggest using status.conditions[].type if that is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do look for all conditions.

Comment on lines +579 to +582
//ignore service and podgroups since these items do not have status
if status == "ignore" {
ignoredItems = ignoredItems + 1
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. We don't want to presume completion status unless the end user requests the AW generic item to be considered. This allows for end users to add run to completion objects that they specify. If the end user does not want the generic item to be considered for completion we have that flexibility. Please share your thoughts.

countCompletionRequired := 0
ignoredItems := 0
for _, genericItem := range cqj.Spec.AggrResources.GenericItems {
objectName := genericItem.GenericTemplate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment response below.

Comment on lines +1615 to +1636
//Set Appwrapper state to complete if all items in Appwrapper
//are completed
if awNew.Status.State != arbv1.AppWrapperStateCompleted {
cc.GetAllObjectsOwned(awNew)
}

if awNew.Status.State == arbv1.AppWrapperStateRunningHoldCompletion {
awNew.Status.QueueJobState = arbv1.AppWrapperCondRunningHoldCompletion
cond := GenerateAppWrapperCondition(arbv1.AppWrapperCondRunningHoldCompletion, v1.ConditionTrue, "SomeItemsCompleted", "")
awNew.Status.Conditions = append(awNew.Status.Conditions, cond)
awNew.Status.FilterIgnore = true // Update AppWrapperCondCompleted
cc.updateEtcd(awNew, "[syncQueueJob] setRunningHoldCompletion")
}

if awNew.Status.State == arbv1.AppWrapperStateCompleted {
awNew.Status.QueueJobState = arbv1.AppWrapperCondCompleted
cond := GenerateAppWrapperCondition(arbv1.AppWrapperCondCompleted, v1.ConditionTrue, "PodsCompleted", "")
awNew.Status.Conditions = append(awNew.Status.Conditions, cond)
awNew.Status.FilterIgnore = true // Update AppWrapperCondCompleted
cc.updateEtcd(awNew, "[syncQueueJob] setCompleted")
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.status.state is being updated in the call cc.GetAllObjectsOwned(awNew). All state changes are done in managequeuejob(). Suggest moving all the logic to managequeuejob() to be consistent. The syncQueueJob() only update the .status.queuejobstate field.

Comment on lines +467 to +472
//service and podgroup has no conditions return ALWAYS true
//TODO: make it configurable?
if gvk.GroupKind().Kind == "Service" || gvk.GroupKind().Kind == "PodGroup" {
return "ignore"
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see the challenge here. As part of the AW pod phases (pending/running/completed...) are tracked. If there is a check of the pods when setting state, this would be sufficient. Iterating over each generic item that is not tracked would not be necessary. only when all tracked items indicate completionstatus the state should be changed (of which is being done). The change here is that the decision to set state to RunningHoldCompletion or Complete is when all pods associated to AW are in Failed/Completed phase states. These counts are already tracked in the AW and can be referenced from the .status field. Please let me know your thoughts. Thanks!

Copy link
Collaborator

@dmatch01 dmatch01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized this branch is on the multi-cluster repo. Could generate the PR from your fork and delete this branch? Thanks!

@asm582
Copy link
Member Author

asm582 commented Aug 3, 2022

I just realized this branch is on the multi-cluster repo. Could generate the PR from your fork and delete this branch? Thanks!
Yes, sure

@asm582
Copy link
Member Author

asm582 commented Aug 3, 2022

Closing the PR to create new PR from fork.

@asm582 asm582 closed this Aug 3, 2022
@asm582 asm582 deleted the pod_comp_stat_2 branch August 3, 2022 12:45
@asm582
Copy link
Member Author

asm582 commented Aug 5, 2022

Can you also add a failure test case for handling non-pod creating generic items? The easiest way is to duplicate one of the new cases and change the service object to include completionstatus. This will ensure that any end user inadvertently adding this field will not force a controller failure. Thanks!

The controller will not fail, If the user adds completionstatus to the service then AW status will be Complete as in the current design Service and Podgroups are ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants