Skip to content

Conversation

@weifengpy
Copy link
Contributor

run FSDP2 on transformer model:

torchrun --nproc_per_node 2 train.py 
  • For 1st time, it creates a "checkpoints" folder and save state dicts there
  • For 2nd time, it loads from previous checkpoints

To enable explicit prefetching

torchrun --nproc_per_node 2 train.py --explicit-prefetch 

To enable mixed precision

torchrun --nproc_per_node 2 train.py --mixed-precision 

To showcse DCP API

torchrun --nproc_per_node 2 train.py --dcp-api 
weifengpy added 2 commits May 8, 2025 16:40
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@netlify
Copy link

netlify bot commented May 8, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit d281dcd
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/681d4749abc5e40008eda968
@weifengpy weifengpy requested review from mori360 and wconstab May 8, 2025 23:54
torchrun --nproc_per_node 2 train.py --mixed-precision
```

To showcse DCP API
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

torchrun --nproc_per_node 2 train.py --mixed-precision
```

To showcse DCP API
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
cd distributed/FSDP2
torchrun --nproc_per_node 2 train.py
```
* For 1st time, it creates a "checkpoints" folder and save state dicts there
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save -> saves

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@weifengpy weifengpy merged commit 7092296 into pytorch:main May 9, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants