-
Couldn't load subscription status.
- Fork 64
Pod comp stat 2 #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod comp stat 2 #192
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a failure test case for handling non-pod creating generic items? The easiest way is to duplicate one of the new cases and change the service object to include completionstatus. This will ensure that any end user inadvertently adding this field will not force a controller failure. Thanks!
| //Set Appwrapper state to complete if all items in Appwrapper | ||
| //are completed | ||
| if awNew.Status.State != arbv1.AppWrapperStateCompleted { | ||
| cc.GetAllObjectsOwned(awNew) | ||
| } | ||
| | ||
| if awNew.Status.State == arbv1.AppWrapperStateRunningHoldCompletion { | ||
| awNew.Status.QueueJobState = arbv1.AppWrapperCondRunningHoldCompletion | ||
| cond := GenerateAppWrapperCondition(arbv1.AppWrapperCondRunningHoldCompletion, v1.ConditionTrue, "SomeItemsCompleted", "") | ||
| awNew.Status.Conditions = append(awNew.Status.Conditions, cond) | ||
| awNew.Status.FilterIgnore = true // Update AppWrapperCondCompleted | ||
| cc.updateEtcd(awNew, "[syncQueueJob] setRunningHoldCompletion") | ||
| } | ||
| | ||
| if awNew.Status.State == arbv1.AppWrapperStateCompleted { | ||
| awNew.Status.QueueJobState = arbv1.AppWrapperCondCompleted | ||
| cond := GenerateAppWrapperCondition(arbv1.AppWrapperCondCompleted, v1.ConditionTrue, "PodsCompleted", "") | ||
| awNew.Status.Conditions = append(awNew.Status.Conditions, cond) | ||
| awNew.Status.FilterIgnore = true // Update AppWrapperCondCompleted | ||
| cc.updateEtcd(awNew, "[syncQueueJob] setCompleted") | ||
| } | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All aw.Status.State changes are done in manageQueueJob(). Can we move the code to this fn. for consistency? The logic not quite intuitive as there are multiple queues involved thus a quick discussion would likely help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An appwrapper is set to running state insyncQueueJob, newly added states are obtained after AW turns running, hence I added these in the same method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.status.state is being updated in the call cc.GetAllObjectsOwned(awNew). All state changes are done in managequeuejob(). Suggest moving all the logic to managequeuejob() to be consistent. The syncQueueJob() only update the .status.queuejobstate field.
| countCompletionRequired := 0 | ||
| ignoredItems := 0 | ||
| for _, genericItem := range cqj.Spec.AggrResources.GenericItems { | ||
| objectName := genericItem.GenericTemplate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the generic item does not have the new completionstatus field, can we skip this logic for performance reasons? Please let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside Appwrapper we are finding items of 3 types of items completionRequired, completedItems and ignoredItems, so we would need to iterate all the items I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment response below.
| //ignore service and podgroups since these items do not have status | ||
| if status == "ignore" { | ||
| ignoredItems = ignoredItems + 1 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above regarding logic checking of new completionstatus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we start ignoring items without completionstatus I think none of the ray jobs will complete with the current design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. We don't want to presume completion status unless the end user requests the AW generic item to be considered. This allows for end users to add run to completion objects that they specify. If the end user does not want the generic item to be considered for completion we have that flexibility. Please share your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think the above is not entirely true. please consider the case, if AW has job and deployment as two items and only job is tracked, we still set the status as RunningHoldCompletion. So non-compute items are treated differently than the compute items.
| //service and podgroup has no conditions return ALWAYS true | ||
| //TODO: make it configurable? | ||
| if gvk.GroupKind().Kind == "Service" || gvk.GroupKind().Kind == "PodGroup" { | ||
| return "ignore" | ||
| } | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed if generic item does not have the new field completionstatus correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see the challenge here. As part of the AW pod phases (pending/running/completed...) are tracked. If there is a check of the pods when setting state, this would be sufficient. Iterating over each generic item that is not tracked would not be necessary. only when all tracked items indicate completionstatus the state should be changed (of which is being done). The change here is that the decision to set state to RunningHoldCompletion or Complete is when all pods associated to AW are in Failed/Completed phase states. These counts are already tracked in the AW and can be referenced from the .status field. Please let me know your thoughts. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please refer to this comment #192 (comment)
| for _, job := range inEtcd.Items { | ||
| completionRequiredBlock := aw.CompletionStatus | ||
| if len(completionRequiredBlock) > 0 { | ||
| conditions := job.Object["status"].(map[string]interface{})["conditions"].([]interface{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this logic is dealing with any object (including custom resources), should probably check to ensure there is a conditions field and likely to occur but also a check for status field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen status field has other objects, for instance, status may have conditions array or loadbalancer object, does it make sense to stick with .status.conditions.type for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with not adding the logic but a new test case should be added to handle such and event. The test case should create an object without conditions that tracks. For instance, duplicate on of the test cases and add an addition generic item that is a new namespace that tracks (completionstatus). This is a failure test case so the controller should handle this problem gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were quite a bit more format changes. If they can be reduced in this PR that would be helpful. Feel free to submit a separate PR with just format change which would be great! Thanks!
| an item in Appwrapper should be a candidate to complete an appwrapper. | ||
| this field should contain conditions that make appwrapper complete, for instance :- | ||
| completionconditions could be "Complete", "Failed" to make appwrapper completes. | ||
| To compare status we look at level .status.conditions[0].type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We look at all conditions not just the first one correct? Suggest using status.conditions[].type if that is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do look for all conditions.
| //ignore service and podgroups since these items do not have status | ||
| if status == "ignore" { | ||
| ignoredItems = ignoredItems + 1 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. We don't want to presume completion status unless the end user requests the AW generic item to be considered. This allows for end users to add run to completion objects that they specify. If the end user does not want the generic item to be considered for completion we have that flexibility. Please share your thoughts.
| countCompletionRequired := 0 | ||
| ignoredItems := 0 | ||
| for _, genericItem := range cqj.Spec.AggrResources.GenericItems { | ||
| objectName := genericItem.GenericTemplate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment response below.
| //Set Appwrapper state to complete if all items in Appwrapper | ||
| //are completed | ||
| if awNew.Status.State != arbv1.AppWrapperStateCompleted { | ||
| cc.GetAllObjectsOwned(awNew) | ||
| } | ||
| | ||
| if awNew.Status.State == arbv1.AppWrapperStateRunningHoldCompletion { | ||
| awNew.Status.QueueJobState = arbv1.AppWrapperCondRunningHoldCompletion | ||
| cond := GenerateAppWrapperCondition(arbv1.AppWrapperCondRunningHoldCompletion, v1.ConditionTrue, "SomeItemsCompleted", "") | ||
| awNew.Status.Conditions = append(awNew.Status.Conditions, cond) | ||
| awNew.Status.FilterIgnore = true // Update AppWrapperCondCompleted | ||
| cc.updateEtcd(awNew, "[syncQueueJob] setRunningHoldCompletion") | ||
| } | ||
| | ||
| if awNew.Status.State == arbv1.AppWrapperStateCompleted { | ||
| awNew.Status.QueueJobState = arbv1.AppWrapperCondCompleted | ||
| cond := GenerateAppWrapperCondition(arbv1.AppWrapperCondCompleted, v1.ConditionTrue, "PodsCompleted", "") | ||
| awNew.Status.Conditions = append(awNew.Status.Conditions, cond) | ||
| awNew.Status.FilterIgnore = true // Update AppWrapperCondCompleted | ||
| cc.updateEtcd(awNew, "[syncQueueJob] setCompleted") | ||
| } | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.status.state is being updated in the call cc.GetAllObjectsOwned(awNew). All state changes are done in managequeuejob(). Suggest moving all the logic to managequeuejob() to be consistent. The syncQueueJob() only update the .status.queuejobstate field.
| //service and podgroup has no conditions return ALWAYS true | ||
| //TODO: make it configurable? | ||
| if gvk.GroupKind().Kind == "Service" || gvk.GroupKind().Kind == "PodGroup" { | ||
| return "ignore" | ||
| } | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see the challenge here. As part of the AW pod phases (pending/running/completed...) are tracked. If there is a check of the pods when setting state, this would be sufficient. Iterating over each generic item that is not tracked would not be necessary. only when all tracked items indicate completionstatus the state should be changed (of which is being done). The change here is that the decision to set state to RunningHoldCompletion or Complete is when all pods associated to AW are in Failed/Completed phase states. These counts are already tracked in the AW and can be referenced from the .status field. Please let me know your thoughts. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized this branch is on the multi-cluster repo. Could generate the PR from your fork and delete this branch? Thanks!
|
| Closing the PR to create new PR from fork. |
The controller will not fail, If the user adds |

Completion status is now represented as:
completionstatus: "Complete,Failed"In the current design services and podgroups are ignored since they do not have completion status. below scenarios were tested: