This topic describes how to effectively use and configure idle resources when you submit training jobs by using subscription resource quotas in Deep Learning Containers (DLC).
Overview
Platform for AI (PAI) allows for flexible quota allocation and assignment tailored to your business scenarios. Training jobs from different business teams consume their respective quotas. However, during certain periods, some quotas may remain idle, while others may experience queuing due to quota shortages, resulting in resource mismatches and inefficiencies.
In the context of large-scale clusters and intricate organization charts, optimizing resource utilization is a critical objective for computing power services. To mitigate these challenges, DLC offers the idle resource feature, which allows you to submit computing jobs that use idle resources to enhance overall resource utilization without disrupting regular business operations.
How it works:
Idle resource jobs use idle resources from the current or other quotas without being constrained by the total or remaining resources within a quota.
When the borrowed idle resources are required to return, the idle resource jobs are terminated and the borrowed resources are automatically returned.
Idle computing jobs are enhanced by the AIMaster and EasyCKPT capabilities of PAI, which automatically improve job resumption and prevent the waste of computing power.
Prerequisites
A subscription resource quota is created and associated with the workspace. The quota can be of general computing resources or Lingjun resources. For more information, see Overview.
Submit a DLC job by using idle resources
When you submit a DLC training job through the console, you can enable Idle Resources in the Resource Information section. The following table describes the key parameters. For more information, see Submit training jobs.
Parameter
Description
Resource Quota
Select a general computing resource quota or a Lingjun resources quota.
NoteTo perform high-performance AI training and computing, use Lingjun resources. Lingjun resources are supported only in the China (Ulanqab) and Singapore regions.
Idle Resources
Valid value:
Acceptable: The job may use idle computing resources or resources from the associated quota.
Only Idle Resources: The job exclusively uses idle computing resources and no resources from the associated quota.
Jobs using idle resources run on resources beyond the associated quota, which may result in job termination if idle resources are reclaimed.
Make sure that your code incorporates a checkpoint mechanism to facilitate seamless job restarts and resumption. For more information, see Use EasyCkpt to save and resume foundation model trainings.
Automatic Fault Tolerance
To mitigate the risk of idle computing jobs being interrupted due to resource scarcity, and to enhance the efficiency and effective utilization of computing power, we recommend that you enable the Automatic Fault Tolerance feature. This ensures that when idle resources are reclaimed, the system seamlessly allocates alternative resources to resume the job. For detailed configuration instructions, refer to AIMaster: Elastic fault tolerance engine.
Monitor DLC job resource usage.
The DLC job list or the details page of the job shows the details of idle resources.
In Quota: The job uses resources in the associated quota.
Not in Quota: The job uses idle computing resources.
If idle resources used by a job are preempted or reclaimed, the status of the job pod on the details page changes to Preempted.
When a non-idle resource job from the borrowed quota group is dequeued and cannot be scheduled due to insufficient resources, the system reclaims resources for the quota group to facilitate job scheduling. At this point, the status of the idle resource job changes to Preempted.
References
To mitigate the risk of idle computing jobs being interrupted due to resource scarcity, and to enhance the efficiency and effective utilization of computing resources, we recommend that you use AIMaster: Elastic fault tolerance engine. AIMaster ensures smooth transition and uninterrupted execution in case of job preemption. Additionally, we recommend you use EasyCkpt to save and resume foundation model trainings. EasyCkpt is a tool from the PAI team, designed to minimize the loss of training progress during job preemption and facilitate automatic job resumption and recovery.