Skip to content

Conversation

japdubengsub
Copy link

@japdubengsub japdubengsub commented Sep 5, 2024

In this pull request, the CometML logger was updated to support the recent Comet SDK.
It has been unified with the comet_ml.start() method to ensure ease of use. The unit tests have also been updated.


📚 Documentation preview 📚: https://pytorch-lightning--2.org.readthedocs.build/en/2/

@github-actions github-actions bot added the pl label Sep 5, 2024
Copy link

github-actions bot commented Sep 5, 2024

⛈️ Required checks status: Has failure 🔴

Warning
This job will need to be re-run to merge your PR. If you do not have write access to the repository, you can ask Lightning-AI/lai-frameworks to re-run it. If you push a new commit, all of CI will re-trigger.

Groups summary

🔴 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-13, lightning, 3.9, 2.1, oldest) failure
pl-cpu (macOS-14, lightning, 3.10, 2.1) failure
pl-cpu (macOS-14, lightning, 3.11, 2.2) failure
pl-cpu (macOS-14, lightning, 3.11, 2.3) failure
pl-cpu (macOS-14, lightning, 3.12, 2.4) failure
pl-cpu (ubuntu-20.04, lightning, 3.9, 2.1, oldest) failure
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.1) failure
pl-cpu (ubuntu-20.04, lightning, 3.11, 2.2) failure
pl-cpu (ubuntu-20.04, lightning, 3.11, 2.3) failure
pl-cpu (ubuntu-20.04, lightning, 3.12, 2.4) failure
pl-cpu (windows-2022, lightning, 3.9, 2.1, oldest) failure
pl-cpu (windows-2022, lightning, 3.10, 2.1) failure
pl-cpu (windows-2022, lightning, 3.11, 2.2) failure
pl-cpu (windows-2022, lightning, 3.11, 2.3) failure
pl-cpu (windows-2022, lightning, 3.12, 2.4) failure
pl-cpu (macOS-14, pytorch, 3.9, 2.1) failure
pl-cpu (ubuntu-20.04, pytorch, 3.9, 2.1) failure
pl-cpu (windows-2022, pytorch, 3.9, 2.1) failure
pl-cpu (macOS-12, pytorch, 3.10, 2.1) failure
pl-cpu (ubuntu-22.04, pytorch, 3.10, 2.1) failure
pl-cpu (windows-2022, pytorch, 3.10, 2.1) failure

These checks are required after the changes to src/lightning/pytorch/loggers/comet.py, tests/tests_pytorch/loggers/conftest.py, tests/tests_pytorch/loggers/test_comet.py.

🟡 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) (testing Lightning latest) no_status
pytorch-lightning (GPUs) (testing PyTorch latest) no_status

These checks are required after the changes to src/lightning/pytorch/loggers/comet.py, tests/tests_pytorch/loggers/conftest.py, tests/tests_pytorch/loggers/test_comet.py.

🟡 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks no_status

These checks are required after the changes to src/lightning/pytorch/loggers/comet.py.

🔴 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) failure

These checks are required after the changes to src/lightning/pytorch/loggers/comet.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/pytorch/loggers/comet.py.

🟡 install
Check ID Status
install-pkg (ubuntu-22.04, fabric, 3.9) no_status
install-pkg (ubuntu-22.04, fabric, 3.11) no_status
install-pkg (ubuntu-22.04, pytorch, 3.9) no_status
install-pkg (ubuntu-22.04, pytorch, 3.11) no_status
install-pkg (ubuntu-22.04, lightning, 3.9) no_status
install-pkg (ubuntu-22.04, lightning, 3.11) no_status
install-pkg (ubuntu-22.04, notset, 3.9) no_status
install-pkg (ubuntu-22.04, notset, 3.11) no_status
install-pkg (macOS-12, fabric, 3.9) no_status
install-pkg (macOS-12, fabric, 3.11) no_status
install-pkg (macOS-12, pytorch, 3.9) no_status
install-pkg (macOS-12, pytorch, 3.11) no_status
install-pkg (macOS-12, lightning, 3.9) no_status
install-pkg (macOS-12, lightning, 3.11) no_status
install-pkg (macOS-12, notset, 3.9) no_status
install-pkg (macOS-12, notset, 3.11) no_status
install-pkg (windows-2022, fabric, 3.9) no_status
install-pkg (windows-2022, fabric, 3.11) no_status
install-pkg (windows-2022, pytorch, 3.9) no_status
install-pkg (windows-2022, pytorch, 3.11) no_status
install-pkg (windows-2022, lightning, 3.9) no_status
install-pkg (windows-2022, lightning, 3.11) no_status
install-pkg (windows-2022, notset, 3.9) no_status
install-pkg (windows-2022, notset, 3.11) no_status

These checks are required after the changes to src/lightning/pytorch/loggers/comet.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@japdubengsub japdubengsub marked this pull request as draft September 5, 2024 12:19
Copy link

@Lothiraldan Lothiraldan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following example fails with this branch but pass with the latest version of lightnintg.

Lightning 2.4.0, experiment: https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958

Output:

CometLogger will be initialized in online mode COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958 COMET INFO: Couldn't find a Git repository in '/tmp' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere. GPU available: False, used: False TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 ---------------------------------------------------------------------------------------------------- distributed_backend=gloo All distributed processes registered. Starting with 2 processes ---------------------------------------------------------------------------------------------------- COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958 | Name | Type | Params | Mode ---------------------------------------- 0 | l1 | Linear | 7.9 K | train ---------------------------------------- 7.9 K Trainable params 0 Non-trainable params 7.9 K Total params 0.031 Total estimated model params size (MB) 1 Modules in train mode 0 Modules in eval mode Sanity Checking: | | 0/? [00:00<?, ?it/s]/home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance. Sanity Checking DataLoader 0: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 38.05it/s]/home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices. /home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance. Epoch 2: 100%|███████████████████████████████████████████████████████████████████| 469/469 [00:31<00:00, 14.73it/s, v_num=8958]`Trainer.fit` stopped: `max_epochs=3` reached. Epoch 2: 100%|███████████████████████████████████████████████████████████████████| 469/469 [00:31<00:00, 14.73it/s, v_num=8958] COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Comet.ml ExistingExperiment Summary COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Data: COMET INFO: display_summary_level : 1 COMET INFO: name : upset_soil_1490 COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958 COMET INFO: Metrics [count] (min, max): COMET INFO: train_loss [28] : (0.4863688051700592, 1.2028049230575562) COMET INFO: val_loss [3] : (0.9357529878616333, 0.9526914358139038) COMET INFO: Others: COMET INFO: Created from : pytorch-lightning COMET INFO: Parameters: COMET INFO: layer_size : 784 COMET INFO: Uploads: COMET INFO: model graph : 1 COMET INFO: COMET INFO: Please wait for metadata to finish uploading (timeout is 3600 seconds) COMET INFO: Uploading 1651 metrics, params and output messages True COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Comet.ml Experiment Summary COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Data: COMET INFO: display_summary_level : 1 COMET INFO: name : upset_soil_1490 COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958 COMET INFO: Others: COMET INFO: Created from : pytorch-lightning COMET INFO: Parameters: COMET INFO: batch_size : 64 COMET INFO: Uploads: COMET INFO: environment details : 1 COMET INFO: filename : 1 COMET INFO: installed packages : 1 COMET INFO: source_code : 2 (17.51 KB) COMET INFO: 

This branch, experiment: https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e

Output:

COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e GPU available: False, used: False TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs COMET INFO: Couldn't find a Git repository in '/tmp' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere. Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 ---------------------------------------------------------------------------------------------------- distributed_backend=gloo All distributed processes registered. Starting with 2 processes ---------------------------------------------------------------------------------------------------- | Name | Type | Params | Mode ---------------------------------------- 0 | l1 | Linear | 7.9 K | train ---------------------------------------- 7.9 K Trainable params 0 Non-trainable params 7.9 K Total params 0.031 Total estimated model params size (MB) 1 Modules in train mode 0 Modules in eval mode W0906 18:27:21.134000 140399680829248 torch/multiprocessing/spawn.py:146] Terminating process 4052339 via signal SIGTERM Traceback (most recent call last): File "/tmp/Comet_and_Pytorch_Lightning.py", line 86, in <module> main() File "/tmp/Comet_and_Pytorch_Lightning.py", line 76, in main trainer.fit(model, train_loader, eval_loader) File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 538, in fit call._call_and_handle_interrupt( File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/strategies/launchers/multiprocessing.py", line 144, in launch while not process_context.join(): ^^^^^^^^^^^^^^^^^^^^^^ File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 189, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap fn(i, *args) File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/strategies/launchers/multiprocessing.py", line 173, in _wrapping_function results = function(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 964, in _run _log_hyperparams(self) File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/loggers/utilities.py", line 93, in _log_hyperparams logger.log_hyperparams(hparams_initial) File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/loggers/comet.py", line 282, in log_hyperparams self.experiment.__internal_api__log_parameters__( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute '__internal_api__log_parameters__' COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Comet.ml Experiment Summary COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Data: COMET INFO: display_summary_level : 1 COMET INFO: name : sleepy_monastery_3541 COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e COMET INFO: Parameters: COMET INFO: batch_size : 64 COMET INFO: Uploads: COMET INFO: environment details : 1 COMET INFO: filename : 1 COMET INFO: installed packages : 1 COMET INFO: source_code : 2 (14.93 KB) COMET INFO: 

Please investigate what is happening

pl-ghost and others added 2 commits September 8, 2024 16:38
update tutorials to `3f8a254d` Co-authored-by: Borda <Borda@users.noreply.github.com>
@japdubengsub
Copy link
Author

Did some testing with following Trainer() params.

CPU

Devices Strategy Status
1 None Works
2 None Works
1 ddp_spawn Works
2 ddp_spawn Works
1 ddp_fork HANG
2 ddp_fork CRASH: torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
1 ddp_notebook HANG
2 ddp_notebook CRASH: torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
1 fsdp ValueError: The strategy fsdp requires a GPU accelerator, but got: cpu
2 fsdp ValueError: The strategy fsdp requires a GPU accelerator, but got: cpu

GPU

Devices Strategy Status
1 None Works
2 None Works
1 ddp_spawn Works
2 ddp_spawn Works
1 ddp_fork CRASH: torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
2 ddp_fork CRASH: torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
1 ddp_notebook CRASH: torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
2 ddp_notebook CRASH: torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
1 fsdp Works
2 fsdp Works

MULTI-NODE (two VM nodes, each has one CUDA-device)

Devices Nodes Strategy Status
1 2 ddp Works

With or without current PR - everything works the same.

@alexkuzmik
Copy link

@japdubengsub very nice job on testing, Sasha!
@Lothiraldan I also made a few runs, worked well. The results are the same as I had with a previous branch.

japdubengsub and others added 7 commits September 11, 2024 17:38
update tutorials to `d5273534` Co-authored-by: Borda <Borda@users.noreply.github.com>
…ning-AI#20267) * build(deps): bump Lightning-AI/utilities from 0.11.6 to 0.11.7 Bumps [Lightning-AI/utilities](https://github.com/lightning-ai/utilities) from 0.11.6 to 0.11.7. - [Release notes](https://github.com/lightning-ai/utilities/releases) - [Changelog](https://github.com/Lightning-AI/utilities/blob/main/CHANGELOG.md) - [Commits](Lightning-AI/utilities@v0.11.6...v0.11.7) --- updated-dependencies: - dependency-name: Lightning-AI/utilities dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Apply suggestions from code review --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
…ing-AI#20266) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](peter-evans/create-pull-request@v6...v7) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
) * Update favicon * Update favicons - all sizes
TresYap and others added 12 commits January 6, 2025 18:50
When loading a pytorch-lightning model from MLFlow, I get `TypeError: Type parameter +_R_co without a default follows type parameter with a default`. This happens whenever doing `import pytorch_lightning as pl` which is done by packages like MLFlow. Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
* Fix TBPTT example * Make example self-contained * Update imports * Add test
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> Co-authored-by: Jirka B <j.borovec+github@gmail.com>
* test: flaky terminated with signal SIGABRT * str
* Update twine to 6.0.1 for Python 3.13 * Pin pkginfo * Go with twine 6.0.1
…ning-AI#20569) Bumps [Lightning-AI/utilities](https://github.com/lightning-ai/utilities) from 0.11.9 to 0.12.0. - [Release notes](https://github.com/lightning-ai/utilities/releases) - [Changelog](https://github.com/Lightning-AI/utilities/blob/main/CHANGELOG.md) - [Commits](Lightning-AI/utilities@v0.11.9...v0.12.0) --- updated-dependencies: - dependency-name: Lightning-AI/utilities dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@github-actions github-actions bot added the data label Feb 17, 2025
amorehead and others added 17 commits February 18, 2025 11:24
Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
…#20574) Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
…latency significantly. (Lightning-AI#20594) * Move save_hparams_to_yaml to log_hparams instead of auto save with metric * Fix params to be optional * Adjust test * Fix test_csv, test_no_name --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… to ensure `tensorboard` logs can sync to `wandb` (Lightning-AI#20610)
) * Add checkpoint artifact path prefix to MLflow logger Add a new `checkpoint_artifact_path_prefix` parameter to the MLflow logger. * Modify `src/lightning/pytorch/loggers/mlflow.py` to include the new parameter in the `MLFlowLogger` class constructor and use it in the `after_save_checkpoint` method. * Update the documentation in `docs/source-pytorch/visualize/loggers.rst` to include the new `checkpoint_artifact_path_prefix` parameter. * Add a new test in `tests/tests_pytorch/loggers/test_mlflow.py` to verify the functionality of the `checkpoint_artifact_path_prefix` parameter and ensure it is used in the artifact path. * Add CHANGELOG * Fix MLflow logger test for `checkpoint_path_prefix` * Update stale documentation --------- Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
…ning-AI#20631) * build(deps): bump Lightning-AI/utilities from 0.12.0 to 0.14.0 Bumps [Lightning-AI/utilities](https://github.com/lightning-ai/utilities) from 0.12.0 to 0.14.0. - [Release notes](https://github.com/lightning-ai/utilities/releases) - [Changelog](https://github.com/Lightning-AI/utilities/blob/main/CHANGELOG.md) - [Commits](Lightning-AI/utilities@v0.12.0...v0.14.0) --- updated-dependencies: - dependency-name: Lightning-AI/utilities dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Apply suggestions from code review --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
* Allow a custom parser class when using LightningCLI * Update changelog
* ci: resolve standalone testing * faster * merge * printenv * here * list * prune * process * printf * stdout * ./ * -e * .coverage * all * rev * notes * notes * notes
* bump: testing with future torch 2.6 * bump `typing-extensions` * TORCHINDUCTOR_CACHE_DIR * bitsandbytes * Apply suggestions from code review * _TORCH_LESS_EQUAL_2_6 --------- Co-authored-by: Luca Antiga <luca.antiga@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Luca Antiga <luca@lightning.ai>
…ger-update # Conflicts: #	src/lightning/pytorch/CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment