Skip to content

multi-node training runs crash because ddp_weakref is None during backward #20706

@mishooax

Description

@mishooax

Bug description

Multi-node / multi-GPU training fails mid-way through because ddp_weakref is not being set correctly during the backward pass. This appears to be similar to the issue reported in #20390. I was unable to reproduce this with a small model. Also the exact moment it fails (epoch/step) can vary between training runs. Any ideas? 🙏

[rank13]: Traceback (most recent call last): [rank13]: ^^^^^^ [rank13]: File "/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main [rank13]: _run_hydra( [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra [rank13]: _run_app( [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app [rank13]: run_and_report( [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report [rank13]: raise ex [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda> [rank13]: lambda: hydra.run( [rank13]: _ = ret.return_value [rank13]: DOPTrainer(config).train() [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit [rank13]: call._call_and_handle_interrupt( [rank13]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank13]: return function(*args, **kwargs) [rank13]: self._run(model, ckpt_path=ckpt_path) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in _run [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1026, in _run_stage [rank13]: self.fit_loop.run() [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 216, in run [rank13]: self.advance(data_fetcher) [rank13]: self._optimizer_step(batch_idx, closure) [rank13]: output = fn(*args, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/core/optimizer.py", line 154, in step [rank13]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step [rank13]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank13]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 146, in __call__ [rank13]: self._result = self.closure(*args, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank13]: return func(*args, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in closure [rank13]: self._backward_fn(step_output.closure_loss) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 241, in backward_fn [rank13]: call._call_strategy_hook(self.trainer, "backward", loss, optimizer) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 323, in _call_strategy_hook [rank13]: output = fn(*args, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 213, in backward [rank13]: self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/plugins/precision/precision.py", line 73, in backward [rank13]: model.backward(tensor, *args, **kwargs) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 1097, in backward [rank13]: loss.backward(*args, **kwargs) [rank13]: File "/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward [rank13]: torch.autograd.backward( [rank13]: File "/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward [rank13]: _engine_run_backward( [rank13]: File "/lib/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward [rank13]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank13]: File "/lib/python3.11/site-packages/torch/autograd/function.py", line 307, in apply [rank13]: return user_fn(self, *args) [rank13]: ^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 260, in backward [rank13]: reducer = ddp_weakref.reducer [rank13]: ^^^^^^^^^^^^^^^^^^^ [rank13]: AttributeError: 'NoneType' object has no attribute 'reducer' [rank12]:[E410 06:07:50.769590035 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=253173, OpType=ALLGATHER, NumelIn=91356, NumelOut=730848, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

# Error messages and logs here please 

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): #- PyTorch Version (e.g., 2.5): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): 

More info

No response

cc @justusschock @lantiga

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdistributedGeneric distributed-related topicstrategy: ddpDistributedDataParallelver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions