Skip to content

Conversation

@tchaton
Copy link
Contributor

@tchaton tchaton commented Mar 8, 2021

What does this PR do?

This PR update:

  • broadcast function to rely on broadcast_object_list from PyTorch.
  • add reduce_boolean_decision for ModelCheckpoint | EarlyStopping using all_gather. Missing for Horovod.

Fixes #6343

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

tchaton and others added 30 commits March 4, 2021 12:55
…sult_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
@carmocca
Copy link
Contributor

GPU tests are failing

@awaelchli awaelchli added the checkpointing Related to checkpointing label Mar 11, 2021
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
@awaelchli awaelchli added the distributed Generic distributed-related topic label Mar 11, 2021
Comment on lines 36 to 38
self.global_rank = None
self.local_rank = None
self.world_size = None
Copy link
Contributor

@awaelchli awaelchli Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.global_rank = None
self.local_rank = None
self.world_size = None
self.global_rank = 0
self.local_rank = 0
self.world_size = 1

Can we keep it as before and not set None?
not to break any existing behavior

Copy link
Contributor Author

@tchaton tchaton Mar 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @awaelchli. I actually think there might be a bug behind those default, which could explain issues reporting memory on gpu 0 when training on [1, 2] for example.
I will revert them back in the PR, but worth to investigate if everything is fine.

@tchaton tchaton merged commit 0544efd into master Mar 14, 2021
@tchaton tchaton deleted the bugfix/broadcast_2 branch March 14, 2021 17:14
@carmocca carmocca mentioned this pull request Mar 15, 2021
SeanNaren pushed a commit that referenced this pull request Mar 16, 2021
* resolve bug * update * update changelog * update PR * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * add todo * resolve issues * resolve flake8 * update * add coverage for reduce * wip * restore back to brodbact * remove test.py * resolve flake8 * update * check world size * resolve test * update * use pytorch version when defined * update on comments * update on comments * flake8 * resolve bugs * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update * update * update * update * remove test * update * resolve flake8 * update * update * update * proxy * update * update * resolve typo * prune * update parallel * update Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> (cherry picked from commit 0544efd)
Borda pushed a commit that referenced this pull request Mar 16, 2021
* resolve bug * update * update changelog * update PR * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * add todo * resolve issues * resolve flake8 * update * add coverage for reduce * wip * restore back to brodbact * remove test.py * resolve flake8 * update * check world size * resolve test * update * use pytorch version when defined * update on comments * update on comments * flake8 * resolve bugs * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update * update * update * update * remove test * update * resolve flake8 * update * update * update * proxy * update * update * resolve typo * prune * update parallel * update Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> (cherry picked from commit 0544efd)
lexierule pushed a commit that referenced this pull request Mar 16, 2021
* resolve bug * update * update changelog * update PR * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * add todo * resolve issues * resolve flake8 * update * add coverage for reduce * wip * restore back to brodbact * remove test.py * resolve flake8 * update * check world size * resolve test * update * use pytorch version when defined * update on comments * update on comments * flake8 * resolve bugs * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update * update * update * update * remove test * update * resolve flake8 * update * update * update * proxy * update * update * resolve typo * prune * update parallel * update Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> (cherry picked from commit 0544efd)
@amogkam amogkam mentioned this pull request Mar 18, 2021
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working checkpointing Related to checkpointing distributed Generic distributed-related topic

6 participants