Guide to creating AI distributed training jobs - Platform For AI

PAI-DLC lets you quickly create standalone or distributed training jobs. It uses Kubernetes to launch compute nodes, which eliminates the need to manually purchase machines and configure runtime environments. This lets you maintain your existing workflow. The service is ideal for users who need to quickly start training jobs, supports multiple deep learning frameworks, and offers flexible resource configuration options.

Prerequisites

Activate PAI and create a workspace using your Alibaba Cloud account. To do this, log on to the PAI console, select a region at the top of the page, and then grant permissions to activate the product with a single click. For more information, see Activate PAI and create a workspace.
You must grant permissions to the operating account. If you use an Alibaba Cloud account, you can skip this step. If you use a Resource Access Management (RAM) user, the user must have one of the following roles: Algorithm Developer, Algorithm O&M, or Workspace Administrator. For instructions, see the Member role configuration section in Manage workspaces.

Create a job in the console

If you are new to DLC, we recommend that you create jobs using the console. You can also create jobs using an SDK or the command line.

Go to the Create Job page.
1. Log on to the PAI console. At the top of the page, select the destination region and the target workspace. Then, click Enter DLC.
2. On the Deep Learning Containers (DLC) page, click Create Job.

Configure the parameters for the training job in the following sections.

Basic information
Configure the Job Name and Labels.

Environment context

Parameter	Description
Node Image	In addition to Official Images, the following image types are also supported: Custom Image: You can use a custom image that is added to PAI. The image repository must be set to allow public pulls, or the image must be stored in Container Registry (ACR). For more information, see Custom images. Note When you select Lingjun resources for the resource quota and use a custom image, you must manually install RDMA to fully use the high-performance RDMA network of Lingjun. For more information, see RDMA: Use a high-performance network for distributed training. Image URL: You can configure the URL of a custom image or an official image that is accessible over the Internet. If it is a private image URL, click Enter Username And Password and configure the username and password for the image repository. To improve the image pull speed, see Accelerate image pulling.
Dataset	Datasets provide the data files required for model training. Two types of datasets are supported: Custom Dataset: You can create a custom dataset to store the required data files. You can set whether the dataset is Read-only and select a dataset version from the Version List. Public Dataset: PAI provides public datasets that can only be mounted in read-only mode. Mount Path: The path where the dataset is mounted in the DLC container, such as `/mnt/data`. You can retrieve the dataset from this path in your code. For more information about mount configurations, see Use cloud storage in DLC training jobs. Important If you configure a CPFS dataset, you must configure a virtual private cloud (VPC) for DLC that is the same as the VPC of the CPFS. Otherwise, the submitted job may remain in the preparing state for a long time.
Direct Mount	You can also directly mount a data source path to read required data or store intermediate and result files. Supported data source types: OSS, General-purpose NAS file system, Extreme NAS file system, and BMCPFS (only available for Lingjun resources). Advanced Configuration: You can use advanced configurations to enable specific features for different data source types. For example: OSS: In the advanced configuration, set `{"mountType":"ossfs"}` to mount OSS storage using ossfs. General-purpose NAS and CPFS: In the advanced configuration, set the nconnect parameter to improve the throughput for DLC containers to access NAS. For more information, see How do I resolve poor performance when accessing NAS on a Linux OS?. For example, `{"nconnect":"<example_value>"}`. Replace <example_value> with a positive integer. For more information, see Use cloud storage in DLC training jobs.
Startup Command	Set the startup command for the job. Shell commands are supported. DLC automatically injects common environment variables for PyTorch and TensorFlow, such as `MASTER_ADDR` and `WORLD_SIZE`. You can get them using `$variable_name`. The following are examples of startup commands: Run Python: `python -c "print('Hello World')"` PyTorch multi-node, multi-GPU distributed training: `python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100` Set a shell file path as the startup command: `/ml/input/config/launch.sh`

Expand for more configurations: Environment variables, Third-party library configuration, Code configuration

Environment Variable

In addition to the automatically injected common environment variables for PyTorch and TensorFlow, you can provide custom environment variables in the Key:Value format. You can configure up to 20 environment variables.

Third-party Library Configuration

If the configured container image is missing some third-party libraries, you can add them in the Third-party Library Configuration section. Two methods are supported:

Third-party Library List: Directly enter the names of the third-party libraries in the text box below.
Requirements.txt File Directory: Write the third-party libraries into a requirements.txt file. Then, upload the file to the DLC container through code configuration, a dataset, or a direct mount. In the text box, specify the path of the file in the container.

Code Configuration

Upload the code files required for training to the DLC container. Two configuration methods are supported:

Online Configuration: If you have a Git code repository and have access permissions, you can create a code source to associate it with the repository. This allows DLC to obtain the job code.
Local Upload: Click the button to upload local code files. After the upload is successful, set the Mount Path to a specified path inside the container, such as /mnt/data.

Resource information

Parameter	Description
Resource Type	The default value is General Computing. Only the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions support selecting Lingjun AI Computing Service.
Resource Source	Public Resources: Billing method: Pay-as-you-go. Scenarios: Public resources may experience queuing delays. We recommend using them for scenarios with a relatively small number of jobs and low timeliness requirements. Limits: The maximum resources supported for a job are 2 GPUs and 8 CPU cores. To exceed this limit, contact your business manager to increase the resource limit. Resource Quota: Includes general computing resources or Lingjun resources. Billing method: Subscription. Scenarios: Suitable for scenarios with a relatively large number of jobs that require high-assurance execution. Special parameters: Resource Quota: You can set the number of resources such as GPUs and CPUs. To prepare a resource quota, see Add a resource quota. Priority: The execution priority of concurrently running jobs. The value ranges from 1 to 9, where 1 is the lowest priority. Preemptible Resources: Billing method: Pay-as-you-go. Scenarios: If you want to reduce resource costs, you can use preemptible resources, which usually offer a certain discount. Limits: Stable availability is not guaranteed. Resources may not be immediately preempted or may be reclaimed. For more information, see Use preemptible jobs.
Framework	The following deep learning training frameworks and tools are supported: TensorFlow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob, and Ray. Note When you select Lingjun resources for the Resource Quota, you can only submit TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray jobs.
Job Resources	Based on the Framework you select, you can configure resources for nodes of type Worker, PS, Chief, Evaluator, and GraphLearn. When you select the Ray framework, you can click Add Role to customize the Worker role to run a mix of heterogeneous resources. Use public resources: You can configure the following parameters: Number Of Nodes: The number of nodes to run the DLC job. Resource Specifications: You can select resource specifications. The console displays the corresponding prices. For more billing information, see DLC billing. Use a resource quota: You can configure the number of nodes, CPU (cores), GPU (cards), memory (GiB), and shared memory (GiB) for each node type. You can also configure the following special parameters: Schedule On Specified Nodes: You can run jobs on specified compute nodes. Idle Resources: When you use idle resources, jobs can run on the idle resources of other quotas, which improves resource utilization. However, when these resources are needed by the original quota's jobs, the idle computing job will be terminated and the resources will be automatically returned. For more information, see: Use idle resources. CPU Affinity: Enabling CPU affinity binds the processes in a container or pod to specific CPU cores for execution. This can reduce CPU cache misses and context switches, thereby improving CPU utilization and application performance. It is suitable for performance-sensitive and real-time scenarios. Use preemptible resources: In addition to the number of nodes and resource specifications, you can also configure the Bid parameter. This sets the maximum bid to request preemptible resources. You can click the button to select a bidding method: By discount: The maximum price is based on the market price of the resource specification, with discrete options from a 10% to 90% discount. This represents the upper limit for bidding. When the maximum bid for preemptible resources is greater than or equal to the market price and inventory is sufficient, the request for preemptible resources can be granted. By price: The maximum bid price is within the market price range.

Expand for more configurations: Maximum runtime, Retention period, Advanced framework configuration

Maximum Runtime

You can set the maximum runtime for a job. After the configuration is complete, jobs that exceed this duration will be stopped. The default is 30 days.

Retention Period

Configure the retention period for successful or failed jobs. Keeping jobs retained will occupy resources. Jobs that exceed this duration will be deleted.

Important

Deleted DLC jobs cannot be recovered. Proceed with caution.

Advanced Framework Configuration

For a list of supported parameters and their value descriptions, see Advanced parameter list.

The parameters ReleaseResourcePolicy, EnableNvidiaIBGDA, EnableNvidiaGDRCopy, EnablePaiNUMACoreBinding, and EnableResourcePreCheck are supported by all frameworks.
When the Framework is PyTorch, the following parameters are supported: createSvcForAllWorkers, customPortList, and customPortNumPerWorker.
Important
Because Lingjun resources do not provide custom port capabilities, the customPortNumPerWorker parameter is not supported when you submit a DLC job using Lingjun resources.
When the Framework is Ray, the following parameters are supported: RayRuntimeEnv, RayRedisAddress, RayRedisUsername, RayRedisPassword, RaySubmitterBackoffLimit, and RayObjectStoreMemoryBytes. Note that the environment variable and third-party library configurations will be overwritten by the RayRuntimeEnv configuration.

The following configuration formats are supported:

Plaintext: Configure as a comma-separated string, where each string is in the key=value format. The key is a supported advanced parameter, and the value is the value of that parameter.
JSON

Typical configuration scenarios:

Scenario 1: PyTorch advanced configuration
Using advanced configuration parameters, you can enable network communication between Workers for more flexible training methods. For example, you can use the extra open ports to start a framework like Ray in the DLC container and use it with PyTorch for more advanced distributed training. The following is a sample configuration:
```
createSvcForAllWorkers=true,customPortNumPerWorker=100
```
Then, in the Startup Command, you can configure the $JOB_NAME and $CUSTOM_PORTS environment variables to get the domain name and available port numbers. This lets you launch and connect to frameworks such as Ray.
Scenario 2: Manually configure RayRuntimeEnv for the Ray framework (including dependency libraries and environment variables)
The following is a sample configuration:
```
{"RayRuntimeEnv": "{pip: requirements.txt, env_vars: {key: value}}"}
```
Scenario 3: Custom resource release rule
Currently, you can only set the release policy to pod-exit, which automatically releases resources when your pod exits. The following is a sample configuration:
```
{ "ReleaseResourcePolicy": "pod-exit" }
```

VPC configuration
- If you do not configure a VPC, the public network and a public gateway are used. Because the public gateway has limited bandwidth, jobs may experience performance issues or fail to run.
- After you configure a VPC and select the corresponding vSwitch and security group, network bandwidth, stability, and security are improved. The cluster where the job runs can then directly access services within this VPC.
  Important
  - When you use a VPC, ensure that the job's resource group instances and dataset storage (OSS) are in the same VPC and region, and that the network is connected to the code repository.
  - When you use a CPFS dataset, you must configure a VPC. The selected VPC must be the same as the VPC of the CPFS. Otherwise, the submitted DLC training job may remain in the Preparing state for an extended period.
  - When you submit a DLC job using Lingjun preemptible resources, you must configure a VPC.
  You can also configure a Public Network Access Gateway in one of the following two ways:
  - Public Gateway: The network bandwidth is limited and may be insufficient for high concurrency or for downloading large files.
  - Private Gateway: To overcome the bandwidth limitations of a public gateway, you can create an Internet NAT gateway in the DLC's VPC, attach elastic IP addresses (EIPs), and configure an SNAT entry. For more information, see Improve public network access speed through a private gateway.

Fault tolerance and diagnosis

Parameter	Description
Automatic Fault Tolerance	Turn on the Automatic Fault Tolerance switch and configure the parameters. The system will provide job detection and control capabilities to promptly detect and avoid errors at the job's algorithm layer, thereby improving GPU utilization. For more information, see AIMaster: An elastic and automatic fault tolerance engine. Note After you enable automatic fault tolerance, the system starts an AIMaster instance to run with the job instance. This consumes some computing resources. The resource usage details for the AIMaster instance are as follows: Resource quota: 1 CPU core and 1 GiB of memory. Public resources: Uses the ecs.c6.large specification.
Health Check	Turn on the Health Check switch. The health check performs a comprehensive inspection of the resources involved in the training, automatically isolates faulty nodes, and triggers an automated O&M process in the background. This effectively reduces the possibility of encountering problems in the early stages of job training and improves the training success rate. For more information, see SanityCheck: Computing power health check. Note The health check feature is supported only for PyTorch training jobs submitted based on a Lingjun resource quota, and only when the number of GPUs is greater than 0.

Roles and permissions

The following section describes the instance RAM role configuration. For more information about this feature, see Configure a DLC RAM role.

Instance RAM role	Description
PAI Default Role	Operates based on the AliyunPAIDLCDefaultRole server role. It only has permissions to access ODPS and OSS, and the permissions are more fine-grained. A temporary access credential issued based on the PAI default role: Has the same permissions as the DLC instance owner when accessing MaxCompute tables. Can only access the default OSS bucket configured for the current workspace when accessing OSS.
Custom Role	Select or enter a custom RAM role. When accessing cloud products based on a temporary STS credential within the instance, the permissions will be the same as those of the custom role.
Do Not Associate A Role	Does not associate a RAM role with the DLC job. This is the default option.

After configuring the parameters, click OK.

References

After you submit a training job, you can perform the following operations:

View the basic information, resource usage, and operation logs of the job. For more information, see View training details.
Manage jobs, including cloning, stopping, and deleting jobs. For more information, see Manage training jobs.
View the results analysis report using TensorBoard. For more information, see Visualization tool Tensorboard.
Set up monitoring and alerts for jobs. For more information, see Training monitoring and alerts.
View the billing details for the job. For more information, see Bill details.
Forward the logs of DLC jobs in the current workspace to a specific Simple Log Service (SLS) Logstore for custom analysis. For more information, see Subscribe to task logs.
Create a message notification rule in the Event Hub of the PAI workspace to track and monitor the status of DLC jobs. For more information, see Notifications.
For information about potential issues and their solutions when you run DLC jobs, see DLC FAQ.
For use cases of DLC, see DLC use cases.

Appendix

Create a job using an SDK or the command line

Python SDK

Step 1: Install the Alibaba Cloud Credentials tool

When you use an Alibaba Cloud SDK to call OpenAPI for resource operations, you must install the Credentials tool to configure your credential information. The requirements are as follows:

Python 3.7 or later.
Alibaba Cloud SDK V2.0.

pip install alibabacloud_credentials

Step 2: Obtain an AccessKey

This example uses AccessKey information to configure access credentials. To prevent your account information from being exposed, we recommend that you configure the AccessKey as environment variables. The environment variable names for the ID and secret are ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET, respectively.

To obtain AccessKey information, see Create an AccessKey.
To set environment variables, see Configure environment variables.
For other credential configuration methods, see Install the Credentials tool.

Step 3: Install the Python SDK

Install the workspace SDK.

pip install alibabacloud_aiworkspace20210204==3.0.1

Install the DLC SDK.

pip install alibabacloud_pai_dlc20201203==1.4.17

Step 4: Submit the job

Submit a job using public resources

The following code shows how to create and submit a job.

Sample code for creating and submitting a job

#!/usr/bin/env python3 from __future__ import print_function import json import time from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import ( ListJobsRequest, ListEcsSpecsRequest, CreateJobRequest, GetJobRequest, ) from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient from alibabacloud_aiworkspace20210204.models import ( ListWorkspacesRequest, CreateDatasetRequest, ListDatasetsRequest, ListImagesRequest, ListCodeSourcesRequest ) def create_nas_dataset(client, region, workspace_id, name, nas_id, nas_path, mount_path): '''Create a NAS dataset. ''' response = client.create_dataset(CreateDatasetRequest( workspace_id=workspace_id, name=name, data_type='COMMON', data_source_type='NAS', property='DIRECTORY', uri=f'nas://{nas_id}.{region}{nas_path}', accessibility='PRIVATE', source_type='USER', options=json.dumps({ 'mountPath': mount_path }) )) return response.body.dataset_id def create_oss_dataset(client, region, workspace_id, name, oss_bucket, oss_endpoint, oss_path, mount_path): '''Create an OSS dataset. ''' response = client.create_dataset(CreateDatasetRequest( workspace_id=workspace_id, name=name, data_type='COMMON', data_source_type='OSS', property='DIRECTORY', uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}', accessibility='PRIVATE', source_type='USER', options=json.dumps({ 'mountPath': mount_path }) )) return response.body.dataset_id def wait_for_job_to_terminate(client, job_id): while True: job = client.get_job(job_id, GetJobRequest()).body print('job({}) is {}'.format(job_id, job.status)) if job.status in ('Succeeded', 'Failed', 'Stopped'): return job.status time.sleep(5) return None def main(): # Make sure that your Alibaba Cloud account has granted permissions to DLC and has sufficient permissions. region_id = 'cn-hangzhou' # The AccessKey of an Alibaba Cloud account has permissions to access all APIs. We recommend that you use a RAM user for API access or daily O&M. # We strongly recommend that you do not save your AccessKey ID and AccessKey secret in your project code. This can lead to an AccessKey leak and threaten the security of all resources in your account. # This example shows how to use the Credentials SDK to read the AccessKey from environment variables for identity verification. cred = CredClient() # 1. create client; workspace_client = AIWorkspaceClient( config=Config( credential=cred, region_id=region_id, endpoint="aiworkspace.{}.aliyuncs.com".format(region_id), ) ) dlc_client = DLCClient( config=Config( credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), ) ) print('------- Workspaces -----------') # Get the list of workspaces. You can also enter the name of the workspace you created in the workspace_name parameter. workspaces = workspace_client.list_workspaces(ListWorkspacesRequest( page_number=1, page_size=1, workspace_name='', module_list='PAI' )) for workspace in workspaces.body.workspaces: print(workspace.workspace_name, workspace.workspace_id, workspace.status, workspace.creator) if len(workspaces.body.workspaces) == 0: raise RuntimeError('found no workspaces') workspace_id = workspaces.body.workspaces[0].workspace_id print('------- Images ------------') # Get the list of images. images = workspace_client.list_images(ListImagesRequest( labels=','.join(['system.supported.dlc=true', 'system.framework=Tensorflow 1.15', 'system.pythonVersion=3.6', 'system.chipType=CPU']))) for image in images.body.images: print(json.dumps(image.to_map(), indent=2)) image_uri = images.body.images[0].image_uri print('------- Datasets ----------') # Get the dataset. datasets = workspace_client.list_datasets(ListDatasetsRequest( workspace_id=workspace_id, name='example-nas-data', properties='DIRECTORY')) for dataset in datasets.body.datasets: print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options) if len(datasets.body.datasets) == 0: # If the current dataset does not exist, create a dataset. dataset_id = create_nas_dataset( client=workspace_client, region=region_id, workspace_id=workspace_id, name='example-nas-data', # NAS file system ID. # General-purpose NAS: 31a8e4****. # Extreme NAS: Must start with extreme-, for example, extreme-0015****. # CPFS: Must start with cpfs-, for example, cpfs-125487****. nas_id='***', nas_path='/', mount_path='/mnt/data/nas') print('create dataset with id: {}'.format(dataset_id)) else: dataset_id = datasets.body.datasets[0].dataset_id print('------- Code Sources ----------') # Get the list of code sources. code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest( workspace_id=workspace_id)) for code_source in code_sources.body.code_sources: print(code_source.display_name, code_source.code_source_id, code_source.code_repo) print('-------- ECS SPECS ----------') # Get the list of node specifications for DLC. ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc')) for spec in ecs_specs.body.ecs_specs: print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type) print('-------- Create Job ----------') # Create a DLC job. create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-dlc-job', 'JobType': 'TFJob', 'JobSpecs': [ { "Type": "Worker", "Image": image_uri, "PodCount": 1, "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type, }, ], "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", 'DataSources': [ { "DataSourceId": dataset_id, }, ], })) job_id = create_job_resp.body.job_id wait_for_job_to_terminate(dlc_client, job_id) print('-------- List Jobs ----------') # Get the list of DLC jobs. jobs = dlc_client.list_jobs(ListJobsRequest( workspace_id=workspace_id, page_number=1, page_size=10, )) for job in jobs.body.jobs: print(job.display_name, job.job_id, job.workspace_name, job.status, job.job_type) pass if __name__ == '__main__': main()

Submit a job using a subscription resource quota

Log on to the PAI console.
On the workspace list page, follow the instructions in the image below to view your workspace ID.
Follow the instructions in the image below to view the resource quota ID of your dedicated resource group.

Use the following code to create and submit a job. For a list of available public images, see Step 2: Prepare an image.

from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_credentials.client import Client as CredClient from alibabacloud_tea_openapi.models import Config from alibabacloud_pai_dlc20201203.models import ( CreateJobRequest, JobSpec, ResourceConfig, GetJobRequest ) # Initialize a client to access the DLC API. region = 'cn-hangzhou' # The AccessKey of an Alibaba Cloud account has permissions to access all APIs. We recommend that you use a RAM user for API access or daily O&M. # We strongly recommend that you do not save your AccessKey ID and AccessKey secret in your project code. This can lead to an AccessKey leak and threaten the security of all resources in your account. # This example shows how to use the Credentials SDK to read the AccessKey from environment variables for identity verification. cred = CredClient() client = Client( config=Config( credential=cred, region_id=region, endpoint=f'pai-dlc.{region}.aliyuncs.com', ) ) # Declare the resource configuration for the job. For image selection, you can refer to the public image list in the documentation or provide your own image URL. spec = JobSpec( type='Worker', image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04', pod_count=1, resource_config=ResourceConfig(cpu='1', memory='2Gi') ) # Declare the execution content of the job. req = CreateJobRequest( resource_id='<Replace with your resource quota ID>', workspace_id='<Replace with your WorkspaceID>', display_name='sample-dlc-job', job_type='TFJob', job_specs=[spec], user_command='echo "Hello World"', ) # Submit the job. response = client.create_job(req) # Get the job ID. job_id = response.body.job_id # Query the job status. job = client.get_job(job_id, GetJobRequest()).body print('job status:', job.status) # View the command executed by the job. job.user_command

Submit a job using preemptible resources

SpotDiscountLimit (Spot discount)

#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' # The ID of the region in which the DLC job resides, such as cn-hangzhou. cred = CredClient() workspace_id = '12****' # The ID of the workspace to which the DLC job belongs. dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotDiscountLimit": 0.4, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}')

SpotPriceLimit (Spot price)

#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' cred = CredClient() workspace_id = '12****' dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotPriceLimit": 0.011, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}')

The following table describes the key parameters:

Parameter	Description
SpotStrategy	The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit.
SpotDiscountLimit	The spot discount bidding type. Note You cannot specify the SpotDiscountLimit and SpotPriceLimit parameters at the same time. The SpotDiscountLimit parameter is valid only for Lingjun resources.
SpotPriceLimit	The spot price bidding type.
UserVpc	This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides.

Command line

Step 1: Download the client and perform user authentication

Download the Linux 64-bit or macOS version of the client tool based on your operating system and complete user authentication. For more information, see Preparations.

Step 2: Submit the job

Log on to the PAI console.
On the workspace list page, follow the instructions in the image below to view your workspace ID.
Follow the instructions in the image below to view your resource quota ID.

Prepare the parameter file tfjob.params by referring to the following content. For more information about how to configure the parameter file, see Submit command.

name=test_cli_tfjob_001 workers=1 worker_cpu=4 worker_gpu=0 worker_memory=4Gi worker_shared_memory=4Gi worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 command=echo good && sleep 120 resource_id=<Replace with your resource quota ID> workspace_id=<Replace with your WorkspaceID>

Use the following code example to pass the params_file parameter to submit the job. You can submit the DLC job to the specified workspace and resource quota.
```
./dlc submit tfjob --job_file ./tfjob.params
```
Use the following code to view the DLC job that you submitted.
```
./dlc get job <jobID>
```

Advanced parameter list

Parameter (key)	Supported framework types	Parameter description	Parameter value (value)
`ReleaseResourcePolicy`	ALL	Configures a custom resource release rule. This is optional. If not configured, all pod resources are released when the job ends. If configured, only pod-exit is currently supported, which releases a pod's resources when it exits.	pod-exit
`EnableNvidiaIBGDA`	ALL	Specifies whether to enable the IBGDA feature when loading the GPU driver.	`true` or `false`
`EnableNvidiaGDRCopy`	ALL	Specifies whether to install the GDRCopy kernel module. Version 2.4.4 is currently installed.	`true` or `false`
`EnablePaiNUMACoreBinding`	ALL	Specifies whether to enable NUMA.	`true` or `false`
`EnableResourcePreCheck`	ALL	When submitting a job, checks whether the total resources (node specifications) in the quota can meet the specifications of all roles in the job.	`true` or `false`
`createSvcForAllWorkers`	PyTorch	Specifies whether to allow network communication between workers. If `true`, all PyTorch workers can communicate with each other. If `false` or not configured, only the master can be accessed by default. After this is enabled, the domain name of each worker is the worker name, such as `dlcxxxxx-master-0`. The job name (`dlcxxxxx`) will be passed to the worker through the `JOB_NAME` environment variable. You can then derive the domain name of the specific worker you need to access.	`true` or `false`
`customPortList`	PyTorch	Allows users to define the network ports to be opened on each worker. It can be used with `createSvcForAllWorkers` to enable network communication between workers. If not configured, only port 23456 is opened on the master by default. Therefore, avoid including port 23456 in this custom port list. Important This parameter is mutually exclusive with `customPortNumPerWorker`. Do not set both.	A semicolon-separated string, where each string is a port number or a port range connected by a hyphen, such as `10000;10001-10010` (which will be converted to 11 consecutive port numbers from 10000 to 10010)
`customPortNumPerWorker`	PyTorch	Allows users to request a number of network ports to be opened for each worker. It can be used with `createSvcForAllWorkers` to enable network communication between workers. If not configured, only port 23456 is opened on the master by default. DLC randomly allocates ports for the worker based on the number of ports defined by the parameter. The specific allocated port numbers are passed to the worker through the `CUSTOM_PORTS` environment variable for you to query. The format is a semicolon-separated group of port numbers. Important This parameter is mutually exclusive with `customPortList`. Do not set both. Because Lingjun resources do not provide custom port capabilities, the customPortNumPerWorker parameter is not supported when you submit a DLC job using Lingjun resources.	Integer (maximum 65536)
`RayRuntimeEnv`	Ray	When the framework is Ray, you can manually configure RayRuntimeEnv to define the runtime environment. Important The environment variable and third-party library configurations will be overwritten by this configuration.	Configure environment variables and third-party libraries (`{pip: requirements.txt, env_vars: {key: value}}`)
`RayRedisAddress`	Ray	External GCS Redis address.	String
`RayRedisUsername`	Ray	External GCS Redis username.	String
`RayRedisPassword`	Ray	External GCS Redis password.	String
`RaySubmitterBackoffLimit`	Ray	Number of retries for the Submitter.	Positive integer (int)
`RayObjectStoreMemoryBytes`	Ray	Configures shared memory for a node. For example, to configure 1 GiB of shared memory for each node, the sample configuration is as follows: `{ "RayObjectStoreMemoryBytes": "1073741824" }`	Positive integer (int)