PAI-DLC lets you quickly create standalone or distributed training jobs. It uses Kubernetes to launch compute nodes, which eliminates the need to manually purchase machines and configure runtime environments. This lets you maintain your existing workflow. The service is ideal for users who need to quickly start training jobs, supports multiple deep learning frameworks, and offers flexible resource configuration options.
Prerequisites
Activate PAI and create a workspace using your Alibaba Cloud account. To do this, log on to the PAI console, select a region at the top of the page, and then grant permissions to activate the product with a single click. For more information, see Activate PAI and create a workspace.
You must grant permissions to the operating account. If you use an Alibaba Cloud account, you can skip this step. If you use a Resource Access Management (RAM) user, the user must have one of the following roles: Algorithm Developer, Algorithm O&M, or Workspace Administrator. For instructions, see the Member role configuration section in Manage workspaces.
Create a job in the console
If you are new to DLC, we recommend that you create jobs using the console. You can also create jobs using an SDK or the command line.
Go to the Create Job page.
Log on to the PAI console. At the top of the page, select the destination region and the target workspace. Then, click Enter DLC.
On the Deep Learning Containers (DLC) page, click Create Job.
Configure the parameters for the training job in the following sections.
Basic information
Configure the Job Name and Labels.
Environment context
Parameter
Description
Node Image
In addition to Official Images, the following image types are also supported:
Custom Image: You can use a custom image that is added to PAI. The image repository must be set to allow public pulls, or the image must be stored in Container Registry (ACR). For more information, see Custom images.
NoteWhen you select Lingjun resources for the resource quota and use a custom image, you must manually install RDMA to fully use the high-performance RDMA network of Lingjun. For more information, see RDMA: Use a high-performance network for distributed training.
Image URL: You can configure the URL of a custom image or an official image that is accessible over the Internet.
If it is a private image URL, click Enter Username And Password and configure the username and password for the image repository.
To improve the image pull speed, see Accelerate image pulling.
Dataset
Datasets provide the data files required for model training. Two types of datasets are supported:
Custom Dataset: You can create a custom dataset to store the required data files. You can set whether the dataset is Read-only and select a dataset version from the Version List.
Public Dataset: PAI provides public datasets that can only be mounted in read-only mode.
Mount Path: The path where the dataset is mounted in the DLC container, such as
/mnt/data
. You can retrieve the dataset from this path in your code. For more information about mount configurations, see Use cloud storage in DLC training jobs.ImportantIf you configure a CPFS dataset, you must configure a virtual private cloud (VPC) for DLC that is the same as the VPC of the CPFS. Otherwise, the submitted job may remain in the preparing state for a long time.
Direct Mount
You can also directly mount a data source path to read required data or store intermediate and result files.
Supported data source types: OSS, General-purpose NAS file system, Extreme NAS file system, and BMCPFS (only available for Lingjun resources).
Advanced Configuration: You can use advanced configurations to enable specific features for different data source types. For example:
OSS: In the advanced configuration, set
{"mountType":"ossfs"}
to mount OSS storage using ossfs.General-purpose NAS and CPFS: In the advanced configuration, set the nconnect parameter to improve the throughput for DLC containers to access NAS. For more information, see How do I resolve poor performance when accessing NAS on a Linux OS?. For example,
{"nconnect":"<example_value>"}
. Replace <example_value> with a positive integer.
For more information, see Use cloud storage in DLC training jobs.
Startup Command
Set the startup command for the job. Shell commands are supported. DLC automatically injects common environment variables for PyTorch and TensorFlow, such as
MASTER_ADDR
andWORLD_SIZE
. You can get them using$variable_name
. The following are examples of startup commands:Run Python:
python -c "print('Hello World')"
PyTorch multi-node, multi-GPU distributed training:
python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100
Set a shell file path as the startup command:
/ml/input/config/launch.sh
Resource information
Parameter
Description
Resource Type
The default value is General Computing. Only the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions support selecting Lingjun AI Computing Service.
Resource Source
Public Resources:
Billing method: Pay-as-you-go.
Scenarios: Public resources may experience queuing delays. We recommend using them for scenarios with a relatively small number of jobs and low timeliness requirements.
Limits: The maximum resources supported for a job are 2 GPUs and 8 CPU cores. To exceed this limit, contact your business manager to increase the resource limit.
Resource Quota: Includes general computing resources or Lingjun resources.
Billing method: Subscription.
Scenarios: Suitable for scenarios with a relatively large number of jobs that require high-assurance execution.
Special parameters:
Resource Quota: You can set the number of resources such as GPUs and CPUs. To prepare a resource quota, see Add a resource quota.
Priority: The execution priority of concurrently running jobs. The value ranges from 1 to 9, where 1 is the lowest priority.
Preemptible Resources:
Billing method: Pay-as-you-go.
Scenarios: If you want to reduce resource costs, you can use preemptible resources, which usually offer a certain discount.
Limits: Stable availability is not guaranteed. Resources may not be immediately preempted or may be reclaimed. For more information, see Use preemptible jobs.
Framework
The following deep learning training frameworks and tools are supported: TensorFlow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob, and Ray.
NoteWhen you select Lingjun resources for the Resource Quota, you can only submit TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray jobs.
Job Resources
Based on the Framework you select, you can configure resources for nodes of type Worker, PS, Chief, Evaluator, and GraphLearn. When you select the Ray framework, you can click Add Role to customize the Worker role to run a mix of heterogeneous resources.
Use public resources: You can configure the following parameters:
Number Of Nodes: The number of nodes to run the DLC job.
Resource Specifications: You can select resource specifications. The console displays the corresponding prices. For more billing information, see DLC billing.
Use a resource quota: You can configure the number of nodes, CPU (cores), GPU (cards), memory (GiB), and shared memory (GiB) for each node type. You can also configure the following special parameters:
Schedule On Specified Nodes: You can run jobs on specified compute nodes.
Idle Resources: When you use idle resources, jobs can run on the idle resources of other quotas, which improves resource utilization. However, when these resources are needed by the original quota's jobs, the idle computing job will be terminated and the resources will be automatically returned. For more information, see: Use idle resources.
CPU Affinity: Enabling CPU affinity binds the processes in a container or pod to specific CPU cores for execution. This can reduce CPU cache misses and context switches, thereby improving CPU utilization and application performance. It is suitable for performance-sensitive and real-time scenarios.
Use preemptible resources: In addition to the number of nodes and resource specifications, you can also configure the Bid parameter. This sets the maximum bid to request preemptible resources. You can click the
button to select a bidding method:
By discount: The maximum price is based on the market price of the resource specification, with discrete options from a 10% to 90% discount. This represents the upper limit for bidding. When the maximum bid for preemptible resources is greater than or equal to the market price and inventory is sufficient, the request for preemptible resources can be granted.
By price: The maximum bid price is within the market price range.
VPC configuration
If you do not configure a VPC, the public network and a public gateway are used. Because the public gateway has limited bandwidth, jobs may experience performance issues or fail to run.
After you configure a VPC and select the corresponding vSwitch and security group, network bandwidth, stability, and security are improved. The cluster where the job runs can then directly access services within this VPC.
ImportantWhen you use a VPC, ensure that the job's resource group instances and dataset storage (OSS) are in the same VPC and region, and that the network is connected to the code repository.
When you use a CPFS dataset, you must configure a VPC. The selected VPC must be the same as the VPC of the CPFS. Otherwise, the submitted DLC training job may remain in the Preparing state for an extended period.
When you submit a DLC job using Lingjun preemptible resources, you must configure a VPC.
You can also configure a Public Network Access Gateway in one of the following two ways:
Public Gateway: The network bandwidth is limited and may be insufficient for high concurrency or for downloading large files.
Private Gateway: To overcome the bandwidth limitations of a public gateway, you can create an Internet NAT gateway in the DLC's VPC, attach elastic IP addresses (EIPs), and configure an SNAT entry. For more information, see Improve public network access speed through a private gateway.
Fault tolerance and diagnosis
Parameter
Description
Automatic Fault Tolerance
Turn on the Automatic Fault Tolerance switch and configure the parameters. The system will provide job detection and control capabilities to promptly detect and avoid errors at the job's algorithm layer, thereby improving GPU utilization. For more information, see AIMaster: An elastic and automatic fault tolerance engine.
NoteAfter you enable automatic fault tolerance, the system starts an AIMaster instance to run with the job instance. This consumes some computing resources. The resource usage details for the AIMaster instance are as follows:
Resource quota: 1 CPU core and 1 GiB of memory.
Public resources: Uses the ecs.c6.large specification.
Health Check
Turn on the Health Check switch. The health check performs a comprehensive inspection of the resources involved in the training, automatically isolates faulty nodes, and triggers an automated O&M process in the background. This effectively reduces the possibility of encountering problems in the early stages of job training and improves the training success rate. For more information, see SanityCheck: Computing power health check.
NoteThe health check feature is supported only for PyTorch training jobs submitted based on a Lingjun resource quota, and only when the number of GPUs is greater than 0.
Roles and permissions
The following section describes the instance RAM role configuration. For more information about this feature, see Configure a DLC RAM role.
Instance RAM role
Description
PAI Default Role
Operates based on the AliyunPAIDLCDefaultRole server role. It only has permissions to access ODPS and OSS, and the permissions are more fine-grained. A temporary access credential issued based on the PAI default role:
Has the same permissions as the DLC instance owner when accessing MaxCompute tables.
Can only access the default OSS bucket configured for the current workspace when accessing OSS.
Custom Role
Select or enter a custom RAM role. When accessing cloud products based on a temporary STS credential within the instance, the permissions will be the same as those of the custom role.
Do Not Associate A Role
Does not associate a RAM role with the DLC job. This is the default option.
After configuring the parameters, click OK.
References
After you submit a training job, you can perform the following operations:
View the basic information, resource usage, and operation logs of the job. For more information, see View training details.
Manage jobs, including cloning, stopping, and deleting jobs. For more information, see Manage training jobs.
View the results analysis report using TensorBoard. For more information, see Visualization tool Tensorboard.
Set up monitoring and alerts for jobs. For more information, see Training monitoring and alerts.
View the billing details for the job. For more information, see Bill details.
Forward the logs of DLC jobs in the current workspace to a specific Simple Log Service (SLS) Logstore for custom analysis. For more information, see Subscribe to task logs.
Create a message notification rule in the Event Hub of the PAI workspace to track and monitor the status of DLC jobs. For more information, see Notifications.
For information about potential issues and their solutions when you run DLC jobs, see DLC FAQ.
For use cases of DLC, see DLC use cases.
Appendix
Create a job using an SDK or the command line
Python SDK
Step 1: Install the Alibaba Cloud Credentials tool
When you use an Alibaba Cloud SDK to call OpenAPI for resource operations, you must install the Credentials tool to configure your credential information. The requirements are as follows:
Python 3.7 or later.
Alibaba Cloud SDK V2.0.
pip install alibabacloud_credentials
Step 2: Obtain an AccessKey
This example uses AccessKey information to configure access credentials. To prevent your account information from being exposed, we recommend that you configure the AccessKey as environment variables. The environment variable names for the ID and secret are ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET, respectively.
To obtain AccessKey information, see Create an AccessKey.
To set environment variables, see Configure environment variables.
For other credential configuration methods, see Install the Credentials tool.
Step 3: Install the Python SDK
Install the workspace SDK.
pip install alibabacloud_aiworkspace20210204==3.0.1
Install the DLC SDK.
pip install alibabacloud_pai_dlc20201203==1.4.17
Step 4: Submit the job
Submit a job using public resources
The following code shows how to create and submit a job.
Submit a job using a subscription resource quota
Log on to the PAI console.
On the workspace list page, follow the instructions in the image below to view your workspace ID.
Follow the instructions in the image below to view the resource quota ID of your dedicated resource group.
Use the following code to create and submit a job. For a list of available public images, see Step 2: Prepare an image.
from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_credentials.client import Client as CredClient from alibabacloud_tea_openapi.models import Config from alibabacloud_pai_dlc20201203.models import ( CreateJobRequest, JobSpec, ResourceConfig, GetJobRequest ) # Initialize a client to access the DLC API. region = 'cn-hangzhou' # The AccessKey of an Alibaba Cloud account has permissions to access all APIs. We recommend that you use a RAM user for API access or daily O&M. # We strongly recommend that you do not save your AccessKey ID and AccessKey secret in your project code. This can lead to an AccessKey leak and threaten the security of all resources in your account. # This example shows how to use the Credentials SDK to read the AccessKey from environment variables for identity verification. cred = CredClient() client = Client( config=Config( credential=cred, region_id=region, endpoint=f'pai-dlc.{region}.aliyuncs.com', ) ) # Declare the resource configuration for the job. For image selection, you can refer to the public image list in the documentation or provide your own image URL. spec = JobSpec( type='Worker', image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04', pod_count=1, resource_config=ResourceConfig(cpu='1', memory='2Gi') ) # Declare the execution content of the job. req = CreateJobRequest( resource_id='<Replace with your resource quota ID>', workspace_id='<Replace with your WorkspaceID>', display_name='sample-dlc-job', job_type='TFJob', job_specs=[spec], user_command='echo "Hello World"', ) # Submit the job. response = client.create_job(req) # Get the job ID. job_id = response.body.job_id # Query the job status. job = client.get_job(job_id, GetJobRequest()).body print('job status:', job.status) # View the command executed by the job. job.user_command
Submit a job using preemptible resources
SpotDiscountLimit (Spot discount)
#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' # The ID of the region in which the DLC job resides, such as cn-hangzhou. cred = CredClient() workspace_id = '12****' # The ID of the workspace to which the DLC job belongs. dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotDiscountLimit": 0.4, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}')
SpotPriceLimit (Spot price)
#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' cred = CredClient() workspace_id = '12****' dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotPriceLimit": 0.011, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}')
The following table describes the key parameters:
Parameter | Description |
SpotStrategy | The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit. |
SpotDiscountLimit | The spot discount bidding type. Note
|
SpotPriceLimit | The spot price bidding type. |
UserVpc | This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides. |
Command line
Step 1: Download the client and perform user authentication
Download the Linux 64-bit or macOS version of the client tool based on your operating system and complete user authentication. For more information, see Preparations.
Step 2: Submit the job
Log on to the PAI console.
On the workspace list page, follow the instructions in the image below to view your workspace ID.
Follow the instructions in the image below to view your resource quota ID.
Prepare the parameter file
tfjob.params
by referring to the following content. For more information about how to configure the parameter file, see Submit command.name=test_cli_tfjob_001 workers=1 worker_cpu=4 worker_gpu=0 worker_memory=4Gi worker_shared_memory=4Gi worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 command=echo good && sleep 120 resource_id=<Replace with your resource quota ID> workspace_id=<Replace with your WorkspaceID>
Use the following code example to pass the params_file parameter to submit the job. You can submit the DLC job to the specified workspace and resource quota.
./dlc submit tfjob --job_file ./tfjob.params
Use the following code to view the DLC job that you submitted.
./dlc get job <jobID>
Advanced parameter list
Parameter (key) | Supported framework types | Parameter description | Parameter value (value) |
| ALL | Configures a custom resource release rule. This is optional. If not configured, all pod resources are released when the job ends. If configured, only pod-exit is currently supported, which releases a pod's resources when it exits. | pod-exit |
| ALL | Specifies whether to enable the IBGDA feature when loading the GPU driver. |
|
| ALL | Specifies whether to install the GDRCopy kernel module. Version 2.4.4 is currently installed. |
|
| ALL | Specifies whether to enable NUMA. |
|
| ALL | When submitting a job, checks whether the total resources (node specifications) in the quota can meet the specifications of all roles in the job. |
|
| PyTorch | Specifies whether to allow network communication between workers.
After this is enabled, the domain name of each worker is the worker name, such as |
|
| PyTorch | Allows users to define the network ports to be opened on each worker. It can be used with If not configured, only port 23456 is opened on the master by default. Therefore, avoid including port 23456 in this custom port list. Important This parameter is mutually exclusive with | A semicolon-separated string, where each string is a port number or a port range connected by a hyphen, such as |
| PyTorch | Allows users to request a number of network ports to be opened for each worker. It can be used with If not configured, only port 23456 is opened on the master by default. DLC randomly allocates ports for the worker based on the number of ports defined by the parameter. The specific allocated port numbers are passed to the worker through the Important
| Integer (maximum 65536) |
| Ray | When the framework is Ray, you can manually configure RayRuntimeEnv to define the runtime environment. Important The environment variable and third-party library configurations will be overwritten by this configuration. | Configure environment variables and third-party libraries ( |
| Ray | External GCS Redis address. | String |
| Ray | External GCS Redis username. | String |
| Ray | External GCS Redis password. | String |
| Ray | Number of retries for the Submitter. | Positive integer (int) |
| Ray | Configures shared memory for a node. For example, to configure 1 GiB of shared memory for each node, the sample configuration is as follows:
| Positive integer (int) |