Releases: Lightning-AI/pytorch-lightning
Simplifications & new docs
This release focused on a ton of bug fixes, small optimizations to training but most importantly, clean new docs!
Major changes
We have released New documentation, please bear with us as we fix broken links and patch in missing pieces.
This project moved to new org PyTorchLightning, so no longer the root sits on WilliamFalcon/PyTorchLightning.
We have added own custom Tensorboard logger as default logger.
We have upgrade Continues Integration to speed up the automatic testing.
We have fixed GAN training - supporting multiple optimizers.
Complete changelog
Added
- Added support for resuming from a specific checkpoint via
resume_from_checkpointargument (#516) - Added support for
ReduceLROnPlateauscheduler (#320) - Added support for Apex mode
O2in conjunction with Data Parallel (#493) - Added option (
save_top_k) to save the top k models in theModelCheckpointclass (#128) - Added
on_train_startandon_train_endhooks toModelHooks(#598) - Added
TensorBoardLogger(#607) - Added support for weight summary of model with multiple inputs (#543)
- Added
map_locationargument toload_from_metricsandload_from_checkpoint(#625) - Added option to disable validation by setting
val_percent_check=0(#649) - Added
NeptuneLoggerclass (#648) - Added
WandbLoggerclass (#627)
Changed
- Changed the default progress bar to print to stdout instead of stderr (#531)
- Renamed
step_idxtostep,epoch_idxtoepoch,max_num_epochstomax_epochsandmin_num_epochstomin_epochs(#589) - Renamed several
Traineratributes: (#567)total_batch_nbtototal_batches,nb_val_batchestonum_val_batches,nb_training_batchestonum_training_batches,max_nb_epochstomax_epochs,min_nb_epochstomin_epochs,nb_test_batchestonum_test_batches,- and
nb_val_batchestonum_val_batches(#567)
- Changed gradient logging to use parameter names instead of indexes (#660)
- Changed the default logger to
TensorBoardLogger(#609) - Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
Deprecated
- Deprecated
max_nb_epochsandmin_nb_epochs(#567) - Deprecated the
on_sanity_check_starthook inModelHooks(#598)
Removed
- Removed the
save_best_onlyargument fromModelCheckpoint, usesave_top_k=1instead (#128)
Fixed
- Fixed a bug which ocurred when using Adagrad with cuda (#554)
- Fixed a bug where training would be on the GPU despite setting
gpus=0orgpus=[](#561) - Fixed an error with
print_nan_gradientswhen some parameters do not require gradient (#579) - Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
- Fixed support for PyTorch 1.1.0 (#552)
- Fixed an issue with early stopping when using a
val_check_interval < 1.0inTrainer(#492) - Fixed bugs relating to the
CometLoggerobject that would cause it to not work properly (#481) - Fixed a bug that would occur when returning
-1fromon_batch_startfollowing an early exit or when the batch wasNone(#509) - Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
- Fixed a bug where batch 'segments' would remain on the GPU when using
truncated_bptt > 1(#532) - Fixed a bug when using
IterableDataset(#547](#547)) - Fixed a bug where
.itemwas called on non-tensor objects (#602) - Fixed a bug where
Trainer.trainwould crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs(#608) - Fixed a bug where early stopping would begin two epochs early (#617)
- Fixed a bug where
num_training_batchesandnum_test_batcheswould sometimes be rounded down to zero (#649) - Fixed a bug where an additional batch would be processed when manually setting
num_training_batches(#653) - Fixed a bug when batches did not have a
.copymethod (#701) - Fixed a bug when using
log_gpu_memory=Truein Python 3.6 (#715) - Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
- Fixed a bug where
on_train_endwas not called when early stopping (#723)
Contributors
@akhti, @alumae, @awaelchli, @Borda, @borisdayma, @ctlaltdefeat, @dreamgonfly, @elliotwaite, @fdiehl, @goodok, @haossr, @HarshSharma12, @Ir1d, @jakubczakon, @jeffling, @kuynzereb, @MartinPernus, @matthew-z, @MikeScarp, @mpariente, @neggert, @rwesterman, @ryanwongsa, @schwobr, @tullie, @vikmary, @VSJMilewski, @williamFalcon, @YehCF
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Generalization!
Generalization release
The main focus of this release was on adding flexibility and generalization to support broad research cases.
Next release will be Dec 7th (every 30 days).
Internal Facebook support
@lorenzoFabbri @tullie @myleott @ashwinb @shootingsoul @vreis
These features were added to support FAIR, FAIAR and broader ML across other FB teams.
In general, we can expose any part that isn't exposed yet where someone might want to override the lightning implementation.
- Added truncated back propagation through time support (thanks @tullie).
Trainer(truncated_bptt_steps=2)- Added iterable datasets.
# return iterabledataset def train_dataloader(...): ds = IterableDataset(...) return Dataloader(ds) # set validation to a fix number of batches # (checks val every 100 train epochs) Trainer(val_check_interval=100)- Add ability to customize backward and other training parts:
def backward(self, use_amp, loss, optimizer): """ Override backward with your own implementation if you need to :param use_amp: Whether amp was requested or not :param loss: Loss is already scaled by accumulated grads :param optimizer: Current optimizer being used :return: """ if use_amp: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward()- DDP custom implementation support (override these hooks):
def configure_ddp(self, model, device_ids): """ Override to init DDP in a different way or use your own wrapper. Must return model. :param model: :param device_ids: :return: DDP wrapped model """ model = LightningDistributedDataParallel( model, device_ids=device_ids, find_unused_parameters=True ) return model def init_ddp_connection(self, proc_rank, world_size): """ Connect all procs in the world using the env:// init Use the first node as the root address """ # use slurm job id for the port number # guarantees unique ports across jobs from same grid search try: # use the last 4 numbers in the job id as the id default_port = os.environ['SLURM_JOB_ID'] default_port = default_port[-4:] # all ports should be in the 10k+ range default_port = int(default_port) + 15000 except Exception as e: default_port = 12910 # if user gave a port number, use that one instead try: default_port = os.environ['MASTER_PORT'] except Exception: os.environ['MASTER_PORT'] = str(default_port) # figure out the root node addr try: root_node = os.environ['SLURM_NODELIST'].split(' ')[0] except Exception: root_node = '127.0.0.2' root_node = self.trainer.resolve_root_node_address(root_node) os.environ['MASTER_ADDR'] = root_node dist.init_process_group('nccl', rank=proc_rank, world_size=world_size)- Support for your own apex init or implementation.
def configure_apex(self, amp, model, optimizers, amp_level): """ Override to init AMP your own way Must return a model and list of optimizers :param amp: :param model: :param optimizers: :param amp_level: :return: Apex wrapped model and optimizers """ model, optimizers = amp.initialize( model, optimizers, opt_level=amp_level, ) return model, optimizers- DDP2 implementation (inspired by parlai and @stephenroller).
DDP2 acts as DP in the node and DDP across nodes.
As a result, an optional method is introducedtraining_end
where you can use the outputs oftraining_step(performed on each GPU with a portion of the batch),
to do something with the outputs of all batches on the node (ie: negative sampling).
Trainer(distributed_backend='ddp2') def training_step(...): # x is 1/nb_gpus of the full batch out = model(x) return {'out': out} def training_end(self, outputs): # all_outs has outs from ALL gpus all_outs = outputs['out'] loss = softmax(all_outs) return {'loss': loss}Logging
- More logger diversity including Comet.ml.
- Versioned logs for all loggers.
- switched from print to logging
progress bar
- now the progress bar has a full bar for the full train + val epochs and a second bar visible only during val.
loading
- checkpoints now store hparams
- no need to pass tags.csv to restore state because it lives in the checkpoint.
Slurm resubmit with apex + ddp
- Fixes issue of ddp restore weights blowing out GPU memory (load on cpu first then GPU).
- Saves apex states automatically and restores it for a checkpoint.
Refactoring
- internal code made modular through Mixins for ease of readability and to minimize merge conflicts.
Docs
- Tons of doc improvements.
Thanks!
Thank you to the amazing contributor community! Especially @neggert and @Borda for reviewing PRs and taking care of a good number of Github issues. The community is thriving and has really embraced making Lightning better.
Great job everyone!
Simpler interface, new features
0.5.1
Simpler interface
All trainers now have a default logger, early stopping and checkpoint object. To modify the behavior, pass in your own versions of those.
- Removed collisions with logger versions by tying it to job id.
Features
- Added new DDP implementation. It uses DP in a node but allows multiple nodes. Useful for models which need negative samples, etc...
Trainer(distributed_backend='ddp2')- support for LBFGS. If you pass in LBFGS Lightning handles the closure for you automatically.
- No longer need to set master port, Lightning does it for you using the job id.
Minor changes
-
training_step and validation_end now return two separate dicts, one for the progress bar and one for logging.
-
Added options to memory printing: 'min_max' logs only the max/min memory use. 'all' logs all the GPUs on the root node.
API clean up
This release has breaking API changes. See #124 for all details.
Syntax changes are:
in trainer options use: train, test, val for data: val_dataloader, test_dataloader, train_dataloader data_batch -> batch prog -> progress gradient_clip -> gradient_clip_val add_log_row_interval -> row_log_interval Various ddt improvements
This release does the following:
- Moves SLURM resubmit from test-tube to PL (which removes the need for cluster parameter).
- Cluster checkpoint done by Lightning now (not test-tube). Also doesn't require a checkpoint object to restore weights when on cluster.
- Loads all models on CPU when restoring weights to avoid OOM issues in PyTorch. User now needs to move to GPU manually. However, if using Lightning, lightning will move to correct GPUs automatically.
- Fixes various subtle bugs in DDP implementation.
- documentation updates
New features
- validation_step, val_dataloader are now optional.
- enabled multiple dataloaders for validation.
- support for latest test-tube logger optimized for PT 1.2.0.
- lr_scheduler now activated after epoch
Stable fully-featured release
0.4.0
0.4.0 is the first public release after a short period testing with public users. Thanks for all the help ironing out bugs to get Lightning to run on everything from notebooks to local to server machines.
This release includes:
- Extensively tested code.
- Cleaner API to accommodate the various research use cases
New features
- No need for experiment object in trainer.
- Training continuation (not just weights, but also epoch, global step, etc...)
- if the folder the checkpoint callback uses has weights, it loads the last weights automatically.
- training step and validation step don't reduce outputs automatically anymore. This fixes issues with reducing generated outputs for example (images, text).
- 16-bit can now be used with a single GPU (no DP or DDP in this case). bypasses issue with NVIDIA apex and PT compatibility for DP+16-bit training.
Simple data loader
Simplified data loader.
Added a decorator to do lazy loading internally:
Old:
@property def tng_dataloader(self): if self._tng_dataloader is None: self._tng_dataloader = DataLoader(...) return self.tng_dataloderNow:
@ptl.data_loader def tng_dataloader(self): return DataLoader(...)Tests!
Fully tested!
Includes:
- Code coverage (99%)
- Full tests that run multiple models in different configs
- Full tests that test specific functionality in trainer.