Accelerate

Accelerate is a library designed to simplify distributed training on any type of setup with PyTorch by uniting the most common frameworks (Fully Sharded Data Parallel (FSDP) and DeepSpeed) for it into a single interface. Trainer is powered by Accelerate under the hood, enabling loading big models and distributed training.

This guide will show you two ways to use Accelerate with Transformers, using FSDP as the backend. The first method demonstrates distributed training with Trainer, and the second method demonstrates adapting a PyTorch training loop. For more detailed information about Accelerate, please refer to the documentation.

pip install accelerate

Start by running accelerate config in the command line to answer a series of prompts about your training system. This creates and saves a configuration file to help Accelerate correctly set up training based on your setup.

accelerate config

Depending on your setup and the answers you provide, an example configuration file for distributing training with FSDP on one machine with two GPUs may look like the following.

compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: false fsdp_cpu_ram_efficient_loading: true fsdp_offload_params: false fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_transformer_layer_cls_to_wrap: BertLayer fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Trainer

Pass the path to the saved configuration file to TrainingArguments, and from there, pass your TrainingArguments to Trainer.

from transformers import TrainingArguments, Trainer training_args = TrainingArguments( output_dir="your-model", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=2, fsdp_config="path/to/fsdp_config", fsdp="full_shard", weight_decay=0.01, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, push_to_hub=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["test"], processing_class=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics, ) trainer.train()

Native PyTorch

Accelerate can also be added to any PyTorch training loop to enable distributed training. The Accelerator is the main entry point for adapting your PyTorch code to work with Accelerate. It automatically detects your distributed training setup and initializes all the necessary components for training. You don’t need to explicitly place your model on a device because Accelerator knows which device to move your model to.

from accelerate import Accelerator accelerator = Accelerator() device = accelerator.device

All PyTorch objects (model, optimizer, scheduler, dataloaders) should be passed to the prepare method now. This method moves your model to the appropriate device or devices, adapts the optimizer and scheduler to use AcceleratedOptimizer and AcceleratedScheduler, and creates a new shardable dataloader.

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( train_dataloader, eval_dataloader, model, optimizer )

Replace loss.backward in your training loop with Accelerates backward method to scale the gradients and determine the appropriate backward method to use depending on your framework (for example, DeepSpeed or Megatron).

for epoch in range(num_epochs): for batch in train_dataloader: outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1)

Combine everything into a function and make it callable as a script.

from accelerate import Accelerator def main(): accelerator = Accelerator() model, optimizer, training_dataloader, scheduler = accelerator.prepare( model, optimizer, training_dataloader, scheduler ) for batch in training_dataloader: optimizer.zero_grad() inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() scheduler.step() if __name__ == "__main__": main()

From the command line, call accelerate launch to run your training script. Any additional arguments or parameters can be passed here as well.

To launch your training script on two GPUs, add the --num_processes argument.

accelerate launch --num_processes=2 your_script.py

Refer to the Launching Accelerate scripts for more details.

Update on GitHub