multi-node training runs crash because ddp_weakref is None during backward

@justusschock

Bug description

Multi-node / multi-GPU training fails mid-way through because ddp_weakref is not being set correctly during the backward pass. This appears to be similar to the issue reported in #20390. I was unable to reproduce this with a small model. Also the exact moment it fails (epoch/step) can vary between training runs. Any ideas? 🙏

[rank13]: Traceback (most recent call last): [rank13]: ^^^^^^ [rank13]: File "/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main [rank13]: _run_hydra( [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra [rank13]: _run_app( [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app [rank13]: run_and_report( [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report [rank13]: raise ex [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report [rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda> [rank13]: lambda: hydra.run( [rank13]: _ = ret.return_value [rank13]: DOPTrainer(config).train() [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit [rank13]: call._call_and_handle_interrupt( [rank13]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank13]: return function(*args, **kwargs) [rank13]: self._run(model, ckpt_path=ckpt_path) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in _run [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1026, in _run_stage [rank13]: self.fit_loop.run() [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 216, in run [rank13]: self.advance(data_fetcher) [rank13]: self._optimizer_step(batch_idx, closure) [rank13]: output = fn(*args, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/core/optimizer.py", line 154, in step [rank13]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step [rank13]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank13]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 146, in __call__ [rank13]: self._result = self.closure(*args, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank13]: return func(*args, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in closure [rank13]: self._backward_fn(step_output.closure_loss) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 241, in backward_fn [rank13]: call._call_strategy_hook(self.trainer, "backward", loss, optimizer) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 323, in _call_strategy_hook [rank13]: output = fn(*args, **kwargs) [rank13]: ^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 213, in backward [rank13]: self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/plugins/precision/precision.py", line 73, in backward [rank13]: model.backward(tensor, *args, **kwargs) [rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 1097, in backward [rank13]: loss.backward(*args, **kwargs) [rank13]: File "/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward [rank13]: torch.autograd.backward( [rank13]: File "/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward [rank13]: _engine_run_backward( [rank13]: File "/lib/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward [rank13]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank13]: File "/lib/python3.11/site-packages/torch/autograd/function.py", line 307, in apply [rank13]: return user_fn(self, *args) [rank13]: ^^^^^^^^^^^^^^^^^^^^ [rank13]: File "/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 260, in backward [rank13]: reducer = ddp_weakref.reducer [rank13]: ^^^^^^^^^^^^^^^^^^^ [rank13]: AttributeError: 'NoneType' object has no attribute 'reducer' [rank12]:[E410 06:07:50.769590035 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=253173, OpType=ALLGATHER, NumelIn=91356, NumelOut=730848, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- PyTorch Lightning Version (e.g., 2.5.0): #- PyTorch Version (e.g., 2.5): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @justusschock @lantiga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi-node training runs crash because `ddp_weakref` is `None` during backward #20706

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multi-node training runs crash because ddp_weakref is None during backward #20706

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

multi-node training runs crash because `ddp_weakref` is `None` during backward #20706