Vertex AI training clusters uses Simple Linux Utility for Resource Management (Slurm) as the orchestrator for managing and scheduling jobs on your cluster.
Slurm is a widely-used, open-source cluster management and job scheduling system known for its scalability and fault tolerance.
Key capabilities of Slurm
- Slurm allocates a set of compute nodes for the exclusive use of a specific job for a defined period. This ensures a job has dedicated access to the resources it needs to run without interference.
- Slurm provides a framework for managing the complete lifecycle of a job—from submission and execution to monitoring and completion. This system is specifically designed to handle parallel jobs that run across a set of allocated nodes.
- Slurm maintains a queue of pending jobs, using a sophisticated prioritization engine to arbitrate access to compute resources. By considering factors like job size, user priority, and wait time, this system ensures fair and efficient resource utilization across the cluster.
Basic cluster configuration
Before you can run jobs, you must define the fundamental structure of your Slurm cluster. This section details the essential configuration settings, including how to organize compute nodes into partitions, specify a dedicated login node pool, and configure a shared home directory for your users.
Partitions
Partitions group nodes into logical sets, which can be useful for managing different machine types or access tiers. They are defined as a list within the partitions field of the slurm_spec.
Each partition object has the following required fields:
id: A unique identifier for the partition.node_pool_ids: A list containing the IDs of one or more node pools that belong to this partition.
For example:
"partitions": [ { "id": "a4", "node_pool_ids": [ "a4" ] } ] Login nodes
The login node pool provides dedicated nodes that serve as the primary entry point for users to interact with the cluster. The login_node_pool_id field specifies the unique identifier for this pool.
For example:
"login_node_pool_id": "login" Home directory storage
The home_directory_storage field specifies the Filestore instance to be mounted as the /home directory on all nodes in the cluster. This provides a shared, persistent home directory for all users.
You must provide the full resource name of the Filestore instance for this value.
For example:
"home_directory_storage": "projects/PROJECT_ID/locations/REGION-ZONE/instances/FILESTORE_INSTANCE_NAME"
Advanced Slurm configuration
Vertex AI training clusters lets you customize a select set of slurm.conf parameters, but be aware that these settings can only be configured during initial cluster creation and can't be changed afterward.
Accounting
Vertex AI training clusters lets you use built-in accounting features to track resource usage within your cluster. For a complete guide on how to monitor metrics like job-specific CPU time and memory usage, review the official Slurm accounting documentation.
| Parameter | Value | Example |
|---|---|---|
| AccountingStorageEnforce | Comma-separated strings | associations,limits,qos |
Preemption and priority
To manage how jobs are scheduled and prioritized, Vertex AI training clusters lets you configure Slurm's job preemption. Preemption works with the multifactor priority plugin to determine if running jobs should be paused to make way for higher-priority work.
For a complete conceptual overview, review the official Slurm documentation on the multifactor priority plugin and preemption.
Preemption parameters
| Parameter | Value | Example |
|---|---|---|
| PREEMPT_TYPE | String | preempt/partition_prio |
| PREEMPT_MODE | Comma-separated strings | SUSPEND,GANG |
| PREEMPT_EXEMPT_TIME | String | 00:00:00 |
Priority parameters
| Parameter | Value | Example |
|---|---|---|
| PRIORITY_TYPE | String | priority/multifactor |
| PRIORITY_WEIGHT_AGE | Integer | 0 |
| PRIORITY_WEIGHT_ASSOC | Integer | 0 |
| PRIORITY_WEIGHT_FAIRSHARE | Integer | 0 |
| PRIORITY_WEIGHT_JOB_SIZE | Integer | 0 |
| PRIORITY_WEIGHT_PARTITION | Integer | 0 |
| PRIORITY_WEIGHT_QOS | Integer | 0 |
| PRIORITY_WEIGHT_TRES | Comma-separated strings | cpu=100,mem=150 |
Prolog and epilog scripts
You can configure custom Bash scripts to run automatically at the start (prolog) and end (epilog) of each job using the following fields:
prolog_bash_scripts: A list of strings, where each string contains the full content of a Bash script to be executed before the job begins.epilog_bash_scripts: A list of strings, where each string contains the full content of a Bash script to be executed after the job completes.
This is useful for setting up a unique job environment or performing automated cleanup tasks.
Example cluster specification
The following example shows a complete JSON configuration for creating a training cluster. You can adapt this specification for your own needs.
{ // ... other cluster configurations ... "orchestratorSpec": { "slurmSpec": { "partitions": [ { "id": "a4", "node_pool_ids": ["a4"] } ], "login_node_pool_id": "login", "home_directory_storage": "projects/PROJECT_ID/locations/REGION-ZONE/instances/FILESTORE_INSTANCE_ID", "accounting": { "accounting_storage_enforce": "ACCOUNTING_STORAGE_ENFORCE" }, "scheduling": { "priority_type": "PRIORITY_TYPE", "priority_weight_age": PRIORITY_WEIGHT_AGE, "priority_weight_assoc": PRIORITY_WEIGHT_ASSOC, "priority_weight_fairshare": PRIORITY_WEIGHT_FAIRSHARE, "priority_weight_job_size": PRIORITY_WEIGHT_JOB_SIZE, "priority_weight_partition": PRIORITY_WEIGHT_PARTITION, "priority_weight_qos": PRIORITY_WEIGHT_QOS, "priority_weight_tres": "PRIORITY_WEIGHT_TRES", "preempt_type": "PREEMPT_TYPE", "preempt_mode": "PREEMPT_MODE", "preempt_exempt_time": "PREEMPT_EXEMPT_TIME" }, "prolog_bash_scripts": [ "#!/bin/bash\necho 'First prolog script running'", "#!/bin/bash\necho 'Second prolog script running'" ], "epilog_bash_scripts": [ "#!/bin/bash\necho 'Epilog script running'" ] // ... other Slurm settings ... } } }
Cluster management and operations
Managing a running cluster
Once your cluster is created with the chosen accounting and preemption settings, you can use Slurm's command-line tools to manage user accounts and monitor job scheduling.
Account management with sacctmgr
The sacctmgr command is the primary tool for managing user and account information in the Slurm database. For example, to add a new user to an account and grant them access to a partition, run the following command:
sudo sacctmgr add User Accounts=<account> Partition=<partition> <user> For a comprehensive list of all sacctmgr options, review the official Slurm accounting documentation.
Checking job priority
To check the priority components of each job in the queue, use the sprio utility. This is useful for understanding why certain jobs are scheduled to run before others.
See the sprio utility documentation for detailed usage.
Preemption examples
The official Slurm documentation provides several working examples of different preemption strategies. You can find these on the Slurm Preemption page.
What's next
The following focuses on the final steps of the machine learning lifecycle: managing, deploying, and monitoring your trained models.
- Deploy your model for inference: Deploy your trained model to a Vertex AI endpoint to serve online inference requests at scale.
- Manage your model's lifecycle: Use Vertex AI Model Registry to version, compare, and manage your models. A pipeline can be configured to automatically register a new model upon successful training.
- Monitor your pipeline runs and model performance:
- Pipeline Monitoring: Track the execution graph, artifacts, and performance of your pipeline runs to debug issues and optimize your orchestration.
- Model Monitoring: After deployment, set up monitoring to detect drift and anomalies in your model's prediction performance, which helps maintain the model's accuracy over time.
- Optimize costs and manage the cluster lifecycle: When using automated pipelines, manage the cluster's lifecycle by considering run frequency.
- For infrequent runs, add a final pipeline step to delete the cluster to save costs. This typically involves creating a custom pipeline component that calls the delete function.
- For frequent runs, leave the cluster active to reduce job startup time.