[bug] Update broadcast + reduce decision ModelCheckpoint] #6410

tchaton · 2021-03-08T16:26:01Z

What does this PR do?

This PR update:

broadcast function to rely on broadcast_object_list from PyTorch.
add reduce_boolean_decision for ModelCheckpoint | EarlyStopping using all_gather. Missing for Horovod.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

…/PyTorchLightning/pytorch-lightning into bugfix/5604_ddp_model_checkpoint

…sult_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

…/PyTorchLightning/pytorch-lightning into bugfix/5604_ddp_model_checkpoint

carmocca · 2021-03-10T13:50:57Z

GPU tests are failing

pytorch_lightning/callbacks/model_checkpoint.py

pytorch_lightning/utilities/torch_distributed.py

tests/checkpointing/test_model_checkpoint.py

tests/checkpointing/test_checkpoint_callback_frequency.py

pytorch_lightning/plugins/training_type/single_device.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

…ing/pytorch-lightning into bugfix/broadcast_2

awaelchli · 2021-03-13T13:00:08Z

pytorch_lightning/plugins/training_type/training_type_plugin.py

+ self.global_rank = None
+ self.local_rank = None
+ self.world_size = None


Suggested change

self.global_rank = None

self.local_rank = None

self.world_size = None

self.global_rank = 0

self.local_rank = 0

self.world_size = 1

Can we keep it as before and not set None?
not to break any existing behavior

Hey @awaelchli. I actually think there might be a bug behind those default, which could explain issues reporting memory on gpu 0 when training on [1, 2] for example.
I will revert them back in the PR, but worth to investigate if everything is fine.

* resolve bug * update * update changelog * update PR * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * add todo * resolve issues * resolve flake8 * update * add coverage for reduce * wip * restore back to brodbact * remove test.py * resolve flake8 * update * check world size * resolve test * update * use pytorch version when defined * update on comments * update on comments * flake8 * resolve bugs * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update * update * update * update * remove test * update * resolve flake8 * update * update * update * proxy * update * update * resolve typo * prune * update parallel * update Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> (cherry picked from commit 0544efd)

tchaton and others added 30 commits March 4, 2021 12:55

resolve bug

597ae27

update

ef11927

update changelog

85b327d

update PR

47f0b2c

Merge branch 'bugfix/5604_ddp_model_checkpoint' of https://github.com…

bbe4255

…/PyTorchLightning/pytorch-lightning into bugfix/5604_ddp_model_checkpoint

Update pytorch_lightning/trainer/connectors/logger_connector/epoch_re…

1c33b48

…sult_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

add todo

6cd4713

Merge branch 'bugfix/5604_ddp_model_checkpoint' of https://github.com…

45d7239

…/PyTorchLightning/pytorch-lightning into bugfix/5604_ddp_model_checkpoint

resolve issues

b58d7fb

resolve flake8

e3a084a

update

77edbed

add coverage for reduce

6bcc88d

wip

c63bca5

restore back to brodbact

e26d301

remove test.py

ce239fd

resolve flake8

d8f1dc9

update

237bbd2

Merge branch 'bugfix/5604_ddp_model_checkpoint' of https://github.com…

f546ae4

…/PyTorchLightning/pytorch-lightning into bugfix/5604_ddp_model_checkpoint

check world size

6fbe70d

resolve test

5f25fc5

update

46cf2c6

Merge branch 'master' into bugfix/5604_ddp_model_checkpoint

7029b31

use pytorch version when defined

8523167

update on comments

f28f950

update on comments

6eae79d

Merge branch 'bugfix/5604_ddp_model_checkpoint' of https://github.com…

1cd9431

…/PyTorchLightning/pytorch-lightning into bugfix/5604_ddp_model_checkpoint

flake8

9448964

resolve bugs

1b5c90a

Merge branch 'bugfix/5604_ddp_model_checkpoint' of https://github.com…

a1264d9

…/PyTorchLightning/pytorch-lightning into bugfix/5604_ddp_model_checkpoint

Merge branch 'master' into bugfix/5604_ddp_model_checkpoint

9f3eb41

carmocca approved these changes Mar 10, 2021

View reviewed changes

SeanNaren approved these changes Mar 10, 2021

View reviewed changes

awaelchli reviewed Mar 11, 2021

View reviewed changes

awaelchli added the checkpointing Related to checkpointing label Mar 11, 2021

Update pytorch_lightning/callbacks/model_checkpoint.py

5e30377

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

awaelchli added the distributed Generic distributed-related topic label Mar 11, 2021

tchaton added 7 commits March 12, 2021 08:20

update on comennts

015fbac

Merge branch 'bugfix/broadcast_2' of https://github.com/PyTorchLightn…

c213716

…ing/pytorch-lightning into bugfix/broadcast_2

update

f668c3a

update

30feb40

update

a4bf623

fix

b482589

remove comments

1ad9c62

awaelchli approved these changes Mar 13, 2021

View reviewed changes

awaelchli mentioned this pull request Mar 14, 2021

checkpoint saving stuck when use multiple GPUs #6495

Closed

resolve bugs

b64e105

tchaton merged commit 0544efd into master Mar 14, 2021

tchaton deleted the bugfix/broadcast_2 branch March 14, 2021 17:14

carmocca mentioned this pull request Mar 15, 2021

1.2.x cherries 🍒 #6083

Closed

amogkam mentioned this pull request Mar 18, 2021

[Horovod] Fix Reduce for Horovod #6585

Closed

11 tasks

taltalim mentioned this pull request Mar 21, 2021

Training stuck at 0% after few epochs while training with DDP #5865

Closed

awaelchli mentioned this pull request Apr 21, 2021

Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] Update broadcast + reduce decision ModelCheckpoint] #6410

[bug] Update broadcast + reduce decision ModelCheckpoint] #6410

Uh oh!

tchaton commented Mar 8, 2021 •

edited

Loading

carmocca commented Mar 10, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli Mar 13, 2021 •

edited

Loading

tchaton Mar 14, 2021 •

edited

Loading

Labels

6 participants

[bug] Update broadcast + reduce decision ModelCheckpoint] #6410

[bug] Update broadcast + reduce decision ModelCheckpoint] #6410

Uh oh!

Conversation

tchaton commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

carmocca commented Mar 10, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

tchaton Mar 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Labels

6 participants

tchaton commented Mar 8, 2021 •

edited

Loading

awaelchli Mar 13, 2021 •

edited

Loading

tchaton Mar 14, 2021 •

edited

Loading