Skip to content

Conversation

@dggaytan
Copy link
Contributor

Adding torch accelerator support to FSDP2 example and

Updates to FSDP2 example:

  • Script Renaming and Documentation Updates:

    • Renamed train.py to example.py and updated references in README.md to reflect the new filename. Added instructions to install dependencies via requirements.txt before running the example.
  • GPU Verification and Device Initialization:

    • Added a verify_min_gpu_count function to ensure at least two GPUs are available before running the example.
    • Updated device initialization in main() to dynamically detect and configure the device type using torch.accelerator. This improves compatibility with different hardware setups.

New supporting files:

  • Dependency Management:

    • Added a requirements.txt file listing required dependencies (torch>=2.7 and numpy).
  • Script for Running Examples:

    • Introduced run_example.sh to simplify launching FSDP2 example.
  • Integration into Distributed Examples:

    • Added a new function distributed_FSDP2 in run_distributed_examples.sh to include the FSDP2 example in the distributed testing workflow.

    CC: @msaroufim @malfet @dvrogozh

@netlify
Copy link

netlify bot commented Jul 21, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit 5e960d8
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68826ce9e58ebb000857417b
@meta-cla meta-cla bot added the cla signed label Jul 21, 2025
torch.distributed.init_process_group(backend="nccl", device_id=device)
if torch.accelerator.is_available():
device_type = torch.accelerator.current_accelerator()
device: torch.device = torch.device(f"{device_type}:{rank}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need device: torch.device = instead of just device =?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a flag for me, but I'll change it to use just torch.device

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

Comment on lines 47 to 48
backend = torch.distributed.get_default_backend_for_device(device)
torch.distributed.init_process_group(backend=backend, device_id=device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 2 lines should work for cpu as well. You can simplify the code:

if torch.accelerator.is_available(): ... else: device = torch.device("cpu") backend = torch.distributed.get_default_backend_for_device(device) torch.distributed.init_process_group(backend=backend, device_id=device) 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: dggaytan <diana.gaytan.munoz@intel.com>
@dggaytan dggaytan force-pushed the dggaytan/distributed_FSDP2 branch from 1f0d7d3 to 5e960d8 Compare July 24, 2025 17:27
@dggaytan dggaytan requested a review from dvrogozh July 24, 2025 17:27
@soumith soumith merged commit 5a4ca92 into pytorch:main Aug 6, 2025
9 checks passed
@soumith
Copy link
Member

soumith commented Aug 6, 2025

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants