Skip to content

Conversation

@jafraustro
Copy link
Contributor

@jafraustro jafraustro commented Jul 8, 2025

Update DDP to use the accelerator API and switch to torchrun for distributed launches

CC: @dvrogozh , @msaroufim

@jafraustro jafraustro marked this pull request as ready for review July 8, 2025 15:06
@netlify
Copy link

netlify bot commented Jul 8, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit afdd3ce
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68712b124833f100080d2c69
Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @jafraustro : CC reviewers in PR description.

@soumith
Copy link
Member

soumith commented Jul 10, 2025

the CI is failing for Distributed examples because something cant find numpy

@jafraustro
Copy link
Contributor Author

the CI is failing for Distributed examples because something cant find numpy

Hi, I changed the torch version in requirements.txt file.

× No solution found when resolving dependencies:
╰─▶ Because only torch<=2.7.1 is available and you require torch>=2.8

- Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility) - Update README to reflect torchrun usage - Remove main.py (no longer referenced in documentation) - Update CI to test example.py script instead Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@soumith
Copy link
Member

soumith commented Jul 11, 2025

it's failing now with some new errors

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@jafraustro
Copy link
Contributor Author

it's failing now with some new errors

Hello @soumith,

The errors occurred because there were not enough GPUs available. To address this, I added a minimum GPU verification step, similar to the approach used in the tensor_parallel_example.py example. This ensures the script only runs when the required number of GPUs are present.

@soumith soumith merged commit f84bcb3 into pytorch:main Jul 14, 2025
8 checks passed
@soumith
Copy link
Member

soumith commented Jul 14, 2025

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants