Skip to content
Navigation Menu
Toggle navigation
Sign in
Appearance settings
Platform
AI CODE CREATION
GitHub Copilot
Write better code with AI
GitHub Spark
Build and deploy intelligent apps
GitHub Models
Manage and compare prompts
MCP Registry
New
Discover and integrate external tools
DEVELOPER WORKFLOWS
Actions
Automate any workflow
Codespaces
Instant dev environments
Issues
Plan and track work
Code Review
Manage code changes
APPLICATION SECURITY
GitHub Advanced Security
Find and fix vulnerabilities
Code security
Secure your code as you build
Secret protection
Stop leaks before they start
EXPLORE
Why GitHub
Documentation
Blog
Changelog
Marketplace
View all features
Solutions
BY COMPANY SIZE
Enterprises
Small and medium teams
Startups
Nonprofits
BY USE CASE
App Modernization
DevSecOps
DevOps
CI/CD
View all use cases
BY INDUSTRY
Healthcare
Financial services
Manufacturing
Government
View all industries
View all solutions
Resources
EXPLORE BY TOPIC
AI
Software Development
DevOps
Security
View all topics
EXPLORE BY TYPE
Customer stories
Events & webinars
Ebooks & reports
Business insights
GitHub Skills
SUPPORT & SERVICES
Documentation
Customer support
Community forum
Trust center
Partners
Open Source
COMMUNITY
GitHub Sponsors
Fund open source developers
PROGRAMS
Security Lab
Maintainer Community
Accelerator
Archive Program
REPOSITORIES
Topics
Trending
Collections
Enterprise
ENTERPRISE SOLUTIONS
Enterprise platform
AI-powered developer platform
AVAILABLE ADD-ONS
GitHub Advanced Security
Enterprise-grade security features
Copilot for Business
Enterprise-grade AI features
Premium Support
Enterprise-grade 24/7 support
Pricing
state:open label:distributed
Search code, repositories, users, issues, pull requests...
Search syntax tips
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Sign in
Sign up
Appearance settings
Resetting focus
You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
Dismiss alert
{{ message }}
Lightning-AI
/
pytorch-lightning
Public
Notifications
You must be signed in to change notification settings
Fork
3.6k
Star
30.4k
Code
Issues
810
Pull requests
89
Discussions
Actions
Projects
0
Wiki
Security
Uh oh!
There was an error while loading.
Please reload this page
.
Insights
Additional navigation options
Code
Issues
Pull requests
Discussions
Actions
Projects
Wiki
Security
Insights
Issues
Search Issues
state
:
open
label
:
distributed
state:open label:distributed
Search
Labels
Milestones
New issue
Search results
Open
Closed
NCCL timeout while doing multi gpu training
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
help wanted
Open to be worked on
Open to be worked on
repro needed
The issue is missing a reproducible example
The issue is missing a reproducible example
ver: 2.4.x
Status: Open.
#20832
In Lightning-AI/pytorch-lightning;
·
aditya-sanas
opened
on May 17, 2025
<code>mark_forward_method</code> does not work with <code>ModelParallelStrategy</code>
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
strategy: fsdp
Fully Sharded Data Parallel
Fully Sharded Data Parallel
ver: 2.5.x
waiting on author
Waiting on user action, correction, or update
Waiting on user action, correction, or update
Status: Open.
#20710
In Lightning-AI/pytorch-lightning;
·
tonyf
opened
on Apr 12, 2025
multi-node training runs crash because <code>ddp_weakref</code> is <code>None</code> during backward
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
strategy: ddp
DistributedDataParallel
DistributedDataParallel
ver: 2.5.x
Status: Open.
#20706
In Lightning-AI/pytorch-lightning;
·
mishooax
opened
on Apr 10, 2025
self.manual_backward() makes all gradients gone
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
optimization
ver: 2.5.x
Status: Open.
#20685
In Lightning-AI/pytorch-lightning;
·
samsara-ku
opened
on Mar 31, 2025
self.all_gather does not work on on_train_epoch_end
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
ver: 2.5.x
Status: Open.
#20683
In Lightning-AI/pytorch-lightning;
·
yoniaflalo
opened
on Mar 28, 2025
<code>ModelCheckpoint</code> not saving best model
bug
Something isn't working
Something isn't working
callback: model checkpoint
distributed
Generic distributed-related topic
Generic distributed-related topic
ver: 2.5.x
Status: Open.
#20657
In Lightning-AI/pytorch-lightning;
·
sravan953
opened
on Mar 19, 2025
ModelCheckpoint broadcast fails on multiple GPUs
bug
Something isn't working
Something isn't working
callback: model checkpoint
distributed
Generic distributed-related topic
Generic distributed-related topic
ver: 2.5.x
Status: Open.
#20597
In Lightning-AI/pytorch-lightning;
·
navotoz
opened
on Feb 20, 2025
Slurm multi-node work fine but multi-gpu doesn't
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
environment: slurm
ver: 2.4.x
Status: Open.
#20438
In Lightning-AI/pytorch-lightning;
·
atifkhanncl
opened
on Nov 22, 2024
Multi-gpu training with slurm times out
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
environment: slurm
ver: 2.3.x
Status: Open.
#20434
In Lightning-AI/pytorch-lightning;
·
nightingal3
opened
on Nov 19, 2024
NCCL backend fails during multi-node, multi-GPU training
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
environment: slurm
ver: 2.4.x
Status: Open.
#20306
In Lightning-AI/pytorch-lightning;
·
raketenolli
opened
on Sep 26, 2024
environment variable WORLD_SIZE is incorrectly set to 1 after trainer.fit is done
bug
Something isn't working
Something isn't working
distributed
Generic distributed-related topic
Generic distributed-related topic
strategy: ddp
DistributedDataParallel
DistributedDataParallel
trainer
ver: 2.4.x
Status: Open.
#20232
In Lightning-AI/pytorch-lightning;
·
simon-ging
opened
on Aug 28, 2024
Sometimes I get Dataset Errors when using the lightning module in a distributed manor
bug
Something isn't working
Something isn't working
data
distributed
Generic distributed-related topic
Generic distributed-related topic
Status: Open.
#20088
In Lightning-AI/pytorch-lightning;
·
asusdisciple
opened
on Jul 15, 2024
You can’t perform that action at this time.