Skip to content

ModelCheckpoint not saving best model #20657

@sravan953

Description

@sravan953

Bug description

In DDP, the ModelCheckpooint (configuration below) does not save the best model despite lower validation losses being achieved in later epochs. Configuration:

checkpoint_callback = ModelCheckpoint( dirpath=path_save_model, filename="best_loss", monitor="val_loss", mode="min", every_n_epochs=1, verbose=True, ) 

Here is a snippet of the output log:

Epoch 292: 100% 40/40 [00:49<00:00, 1.24s/it, v_num=0, l1=0.00589, msssim=0.00441, lpips=0.0184, loss=0.0287, lr=0.000128, val_l1=0.00746, val_msssim=0.0122, val_lpips=0.0299, val_loss=0.0495]Epoch 292, global step 11720: 'val_loss' reached 0.04955 (best 0.04955), saving model to '/qumulo/sravan/Projects/pmc_lit/experiments/e3_uwsyn_multi_noprompt/250318_2110_30k_uw50pc_ghosting/best_loss.ckpt' as top 1 Epoch 297: 100% 40/40 [00:49<00:00, 1.24s/it, v_num=0, l1=0.0141, msssim=0.00869, lpips=0.0113, loss=0.0341, lr=0.000124, val_l1=0.00779, val_msssim=0.0121, val_lpips=0.0266, val_loss=0.0466]Epoch 297, global step 11920: 'val_loss' was not in top 1 

Clearly, epoch 297 should be saved to disk

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

Epoch 292: 100% 40/40 [00:49<00:00, 1.24s/it, v_num=0, l1=0.00589, msssim=0.00441, lpips=0.0184, loss=0.0287, lr=0.000128, val_l1=0.00746, val_msssim=0.0122, val_lpips=0.0299, val_loss=0.0495]Epoch 292, global step 11720: 'val_loss' reached 0.04955 (best 0.04955), saving model to '250318_2110_30k_uw50pc/best_loss.ckpt' as top 1 Epoch 297: 100% 40/40 [00:49<00:00, 1.24s/it, v_num=0, l1=0.0141, msssim=0.00869, lpips=0.0113, loss=0.0341, lr=0.000124, val_l1=0.00779, val_msssim=0.0121, val_lpips=0.0266, val_loss=0.0466]Epoch 297, global step 11920: 'val_loss' was not in top 1 

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.0 #- PyTorch Version (e.g., 2.5): 2.3.1+cu121 #- Python version (e.g., 3.12): 3.10.12 #- OS (e.g., Linux): Ubuntu 22.04.3 LTS #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): pip 

More info

No response

cc @justusschock

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions