Skip to content

Conversation

@asm582
Copy link
Member

@asm582 asm582 commented Aug 30, 2022

Read resource requirements from custompodresources section to dispatch jobs.

Tested code with 1 ray head and multiple workers each requesting 1 GPU, below is the derived resource requirement:

I0830 00:46:48.840672 1 genericresource.go:389] [GetResources] Requested total allocation resource from pods `cpu 1000.00, memory 1000000000.00, GPU 0`. I0830 00:46:48.840700 1 genericresource.go:389] [GetResources] Requested total allocation resource from pods `cpu 7000.00, memory 7000000000.00, GPU 7`. 

User is required to set below fields for all appwrappers:

 custompodresources: - replicas: 7 requests: cpu: 1 memory: 1G nvidia.com/gpu: 1 limits: cpu: 1 memory: 1G nvidia.com/gpu: 1 
@asm582 asm582 requested a review from dmatch01 August 30, 2022 00:58
@asm582
Copy link
Member Author

asm582 commented Aug 30, 2022

The below logs show previous job queued while the next job is dispatched and resources subtracted:

I0830 01:03:29.635224 1 queuejob_controller_ex.go:708] [getAggAvaiResPri] 2022-08-30 01:03:29.635141943 +0000 UTC m=+1571.825028075: Evaluating job: ray-head-urm-exp to calculate aggregated resources. I0830 01:03:29.635274 1 queuejob_controller_ex.go:804] [getAggAvaiResPri] Schedulable idle cluster resources: cpu 13150.00, memory 124290547712.00, GPU 0, subtracting dispatched resources: cpu 0.00, memory 0.00, GPU 0 and adding preemptable cluster resources: cpu 0.00, memory 0.00, GPU 0 I0830 01:03:29.635356 1 queuejob_controller_ex.go:808] [getAggAvaiResPri] cpu 13150.00, memory 124290547712.00, GPU 0 available resources to schedule I0830 01:03:32.637805 1 queuejob_controller_ex.go:699] [getAggAvaiResPri] Idle cluster resources cpu 13150.00, memory 124290547712.00, GPU 0 I0830 01:03:32.637819 1 queuejob_controller_ex.go:708] [getAggAvaiResPri] 2022-08-30 01:03:32.637813066 +0000 UTC m=+1574.827698705: Evaluating job: ray-head-urm-exp to calculate aggregated resources. I0830 01:03:32.637831 1 queuejob_controller_ex.go:708] [getAggAvaiResPri] 2022-08-30 01:03:32.637826011 +0000 UTC m=+1574.827711558: Evaluating job: ray-head-urm-exp-tiny to calculate aggregated resources. I0830 01:03:32.637889 1 queuejob_controller_ex.go:804] [getAggAvaiResPri] Schedulable idle cluster resources: cpu 13150.00, memory 124290547712.00, GPU 0, subtracting dispatched resources: cpu 0.00, memory 0.00, GPU 0 and adding preemptable cluster resources: cpu 0.00, memory 0.00, GPU 0 I0830 01:03:32.637896 1 queuejob_controller_ex.go:808] [getAggAvaiResPri] cpu 13150.00, memory 124290547712.00, GPU 0 available resources to schedule I0830 01:03:49.651238 1 queuejob_controller_ex.go:699] [getAggAvaiResPri] Idle cluster resources cpu 11150.00, memory 122290547712.00, GPU 0 I0830 01:03:49.651257 1 queuejob_controller_ex.go:708] [getAggAvaiResPri] 2022-08-30 01:03:49.651249619 +0000 UTC m=+1591.841135259: Evaluating job: ray-head-urm-exp to calculate aggregated resources. I0830 01:03:49.651270 1 queuejob_controller_ex.go:708] [getAggAvaiResPri] 2022-08-30 01:03:49.651264413 +0000 UTC m=+1591.841150259: Evaluating job: ray-head-urm-exp-tiny to calculate aggregated resources. I0830 01:03:49.651447 1 queuejob_controller_ex.go:804] [getAggAvaiResPri] Schedulable idle cluster resources: cpu 11150.00, memory 122290547712.00, GPU 0, subtracting dispatched resources: cpu 0.00, memory 0.00, GPU 0 and adding preemptable cluster resources: cpu 0.00, memory 0.00, GPU 0 I0830 01:03:49.651467 1 queuejob_controller_ex.go:808] [getAggAvaiResPri] cpu 11150.00, memory 122290547712.00, GPU 0 available resources to schedule I0830 01:04:09.667327 1 queuejob_controller_ex.go:699] [getAggAvaiResPri] Idle cluster resources cpu 11150.00, memory 122290547712.00, GPU 0 I0830 01:04:09.667327 1 queuejob_controller_ex.go:708] [getAggAvaiResPri] 2022-08-30 01:04:09.667327077 +0000 UTC m=+1611.857212113: Evaluating job: ray-head-urm-exp to calculate aggregated resources. I0830 01:04:09.667327 1 queuejob_controller_ex.go:708] [getAggAvaiResPri] 2022-08-30 01:04:09.667327077 +0000 UTC m=+1611.857212113: Evaluating job: ray-head-urm-exp-tiny to calculate aggregated resources. I0830 01:04:09.667444 1 queuejob_controller_ex.go:804] [getAggAvaiResPri] Schedulable idle cluster resources: cpu 11150.00, memory 122290547712.00, GPU 0, subtracting dispatched resources: cpu 0.00, memory 0.00, GPU 0 and adding preemptable cluster resources: cpu 0.00, memory 0.00, GPU 0 I0830 01:04:09.667444 1 queuejob_controller_ex.go:808] [getAggAvaiResPri] cpu 11150.00, memory 122290547712.00, GPU 0 available resources to schedule 
asm582 and others added 2 commits August 30, 2022 09:04
Copy link
Collaborator

@dmatch01 dmatch01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, awaiting test cases.

asm582 and others added 4 commits August 30, 2022 09:42
Test cases for CustomPodResources.
Signed-off-by: dmatch01 <darroyo@us.ibm.com>
Signed-off-by: dmatch01 <darroyo@us.ibm.com>
Corrected test case expectations.
Copy link
Collaborator

@dmatch01 dmatch01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dmatch01 dmatch01 merged commit f9ef292 into project-codeflare:quota-management Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants