Source: examples/training/torchtitan
TorchTitan: Large-Scale LLM Training with SkyPilot#
TorchTitan is a PyTorch native platform for large-scale LLM training, featuring multi-dimensional parallelisms (FSDP2, Tensor/Pipeline/Context Parallel), distributed checkpointing, torch.compile, and Float8 support.
This example demonstrates how to run TorchTitan on your Kubernetes clusters, or any hypersclaers, neoclouds using SkyPilot, in addition to the instructions for runnning on Slurm.
Quick start#
Here is how to finetune Llama 3.1 on 2 nodes with 8 H100 (or 8 H200):
# Install SkyPilot (if not already installed) # More cloud setup instructions in: https://docs.skypilot.co/en/latest/getting-started/installation.html pip install "skypilot[kubernetes,aws]" # or your cloud: [gcp], [azure], etc. # Launch a cluster and start training export HF_TOKEN=... # if using a gated model from the HF Hub sky launch -c torchtitan-multinode torchtitan.yaml --env HF_TOKEN # Tail logs sky logs torchtitan-multinode # Terminate the cluster when done sky down torchtitan-multinode Configuration#
The provided torchtitan.yaml configuration:
Sets up a 2-node cluster with 8 H100 (or H200) GPUs per node
Installs PyTorch nightly and TorchTitan requirements
Downloads the Llama 3.1 tokenizer
Available model configurations#
TorchTitan includes pre-configured training recipes for:
Llama 3.1 8B:
llama3_8b.tomlLlama 3.1 70B:
llama3_70b.tomlLlama 3.1 405B:
llama3_405b.toml
Each configuration file specifies model architecture, parallelism strategies, and training hyperparameters optimized for different scales.
To use a specific training recipe, you can set it through the CONFIG_FILE env var:
sky launch -c torchtitan-multinode torchtitan.yaml \ --env HF_TOKEN \ --env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml # relative to the torchtitan's repo Scaling Up#
To scale up your training, you can increase the number of nodes or try larger models:
# Scale to more nodes sky launch -c torchtitan-8node torchtitan.yaml --num-nodes 8 # Try different model sizes (update CONFIG_FILE in torchtitan.yaml) sky launch -c torchtitan-llama3-70b torchtitan.yaml --env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml Why SkyPilot for Distributed Training?#
Simple multi-node setup: SkyPilot automatically provides environment variables (
SKYPILOT_NODE_RANK,SKYPILOT_NODE_IPS, etc.) that integrate seamlessly with PyTorch distributed training - no manual networking configuration needed.Auto-recovery: Built-in fault tolerance automatically recovers from node failures and spot preemptions, resuming from checkpoints.
Easily run on Kubernetes or clouds without code changes: SkyPilot offers a simple interface to run TorchTitan on any infrastructure:
sky launch --infra k8s torchtitan.yamlLaunch distributed training with a single command: SkyPilot automatically provides environment variables(
SKYPILOT_NODE_RANK,SKYPILOT_NODE_IPS, etc.) that integrate seamlessly with PyTorch distributed training - no manual networking configuration needed.Auto-recovery: Built-in fault tolerance automatically recovers from node failures and spot preemptions, resuming from checkpoints.
Multi-node training details#
The configuration automatically:
Detects the head node IP and sets it as the master address
Configures the correct node rank for each node
Sets up the distributed environment for PyTorch’s torchrun with key settings:
--nnodes: Uses$SKYPILOT_NUM_NODESto specify total number of nodes--nproc_per_node: Uses$SKYPILOT_NUM_GPUS_PER_NODEfor GPUs per node--node_rank: Uses$SKYPILOT_NODE_RANKto identify each node’s position--master_addr: Extracts head node IP from$SKYPILOT_NODE_IPS--master_port: Sets communication port to 8008 for distributed coordination
Included files#
torchtitan.yaml
# SkyPilot configuration for TorchTitan multi-node training # This configuration reproduces the functionality of multinode_trainer.slurm # # To launch: # sky launch -c torchtitan-cluster sky.yaml # # To stop: # sky down torchtitan-cluster # # To monitor: # sky status --refresh name: torchtitan-multinode resources: accelerators: {H100:8, H200:8} disk_size: 1024GB num_nodes: 2 workdir: . envs: CONFIG_FILE: "./torchtitan/models/llama3/train_configs/llama3_8b.toml" HF_TOKEN: "" setup: | git clone https://github.com/pytorch/torchtitan.git cd torchtitan pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall pip install -r requirements.txt python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=$HF_TOKEN run: | # Get head node IP (first node in the list) HEAD_NODE_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1) echo "Head node IP: $HEAD_NODE_IP" # SKYPILOT_NODE_RANK is automatically set by SkyPilot torchrun \ --nnodes $SKYPILOT_NUM_NODES \ --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \ --node_rank $SKYPILOT_NODE_RANK \ --master_addr=$HEAD_NODE_IP \ --master_port=8008 \ -m torchtitan.train \ --job.config_file $CONFIG_FILE \ --training.dataset c4_test