21 Jan 22:51

f80127d

This release focused on a ton of bug fixes, small optimizations to training but most importantly, clean new docs!

Major changes

We have released New documentation, please bear with us as we fix broken links and patch in missing pieces.
This project moved to new org PyTorchLightning, so no longer the root sits on WilliamFalcon/PyTorchLightning.
We have added own custom Tensorboard logger as default logger.
We have upgrade Continues Integration to speed up the automatic testing.
We have fixed GAN training - supporting multiple optimizers.

Complete changelog

Added

Added support for resuming from a specific checkpoint via resume_from_checkpoint argument (#516)
Added support for ReduceLROnPlateau scheduler (#320)
Added support for Apex mode O2 in conjunction with Data Parallel (#493)
Added option (save_top_k) to save the top k models in the ModelCheckpoint class (#128)
Added on_train_start and on_train_end hooks to ModelHooks (#598)
Added TensorBoardLogger (#607)
Added support for weight summary of model with multiple inputs (#543)
Added map_location argument to load_from_metrics and load_from_checkpoint (#625)
Added option to disable validation by setting val_percent_check=0 (#649)
Added NeptuneLogger class (#648)
Added WandbLogger class (#627)

Changed

Changed the default progress bar to print to stdout instead of stderr (#531)
Renamed step_idx to step, epoch_idx to epoch, max_num_epochs to max_epochs and min_num_epochs to min_epochs (#589)
Renamed several Trainer atributes: (#567)
- total_batch_nb to total_batches,
- nb_val_batches to num_val_batches,
- nb_training_batches to num_training_batches,
- max_nb_epochs to max_epochs,
- min_nb_epochs to min_epochs,
- nb_test_batches to num_test_batches,
- and nb_val_batches to num_val_batches (#567)
Changed gradient logging to use parameter names instead of indexes (#660)
Changed the default logger to TensorBoardLogger (#609)
Changed the directory for tensorboard logging to be the same as model checkpointing (#706)

Deprecated

Deprecated max_nb_epochs and min_nb_epochs (#567)
Deprecated the on_sanity_check_start hook in ModelHooks (#598)

Removed

Removed the save_best_only argument from ModelCheckpoint, use save_top_k=1 instead (#128)

Fixed

Fixed a bug which ocurred when using Adagrad with cuda (#554)
Fixed a bug where training would be on the GPU despite setting gpus=0 or gpus=[] (#561)
Fixed an error with print_nan_gradients when some parameters do not require gradient (#579)
Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
Fixed support for PyTorch 1.1.0 (#552)
Fixed an issue with early stopping when using a val_check_interval < 1.0 in Trainer (#492)
Fixed bugs relating to the CometLogger object that would cause it to not work properly (#481)
Fixed a bug that would occur when returning -1 from on_batch_start following an early exit or when the batch was None (#509)
Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
Fixed a bug where batch 'segments' would remain on the GPU when using truncated_bptt > 1 (#532)
Fixed a bug when using IterableDataset (#547](#547))
Fixed a bug where .item was called on non-tensor objects (#602)
Fixed a bug where Trainer.train would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already at max_epochs (#608)
Fixed a bug where early stopping would begin two epochs early (#617)
Fixed a bug where num_training_batches and num_test_batches would sometimes be rounded down to zero (#649)
Fixed a bug where an additional batch would be processed when manually setting num_training_batches (#653)
Fixed a bug when batches did not have a .copy method (#701)
Fixed a bug when using log_gpu_memory=True in Python 3.6 (#715)
Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
Fixed a bug where on_train_end was not called when early stopping (#723)

Contributors

@akhti, @alumae, @awaelchli, @Borda, @borisdayma, @ctlaltdefeat, @dreamgonfly, @elliotwaite, @fdiehl, @goodok, @haossr, @HarshSharma12, @Ir1d, @jakubczakon, @jeffling, @kuynzereb, @MartinPernus, @matthew-z, @MikeScarp, @mpariente, @neggert, @rwesterman, @ryanwongsa, @schwobr, @tullie, @vikmary, @VSJMilewski, @williamFalcon, @YehCF

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Assets 2

06 Nov 20:04

williamFalcon

0.5.3

aab4fe8

Generalization!

Generalization release

The main focus of this release was on adding flexibility and generalization to support broad research cases.

Next release will be Dec 7th (every 30 days).

Internal Facebook support

@lorenzoFabbri @tullie @myleott @ashwinb @shootingsoul @vreis
These features were added to support FAIR, FAIAR and broader ML across other FB teams.

In general, we can expose any part that isn't exposed yet where someone might want to override the lightning implementation.

Added truncated back propagation through time support (thanks @tullie).

Trainer(truncated_bptt_steps=2)

Added iterable datasets.

# return iterabledataset def train_dataloader(...): ds = IterableDataset(...) return Dataloader(ds) # set validation to a fix number of batches # (checks val every 100 train epochs) Trainer(val_check_interval=100)

Add ability to customize backward and other training parts:

 def backward(self, use_amp, loss, optimizer): """  Override backward with your own implementation if you need to  :param use_amp: Whether amp was requested or not  :param loss: Loss is already scaled by accumulated grads  :param optimizer: Current optimizer being used  :return:  """ if use_amp: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward()

DDP custom implementation support (override these hooks):

 def configure_ddp(self, model, device_ids): """  Override to init DDP in a different way or use your own wrapper.  Must return model.  :param model:  :param device_ids:  :return: DDP wrapped model  """ model = LightningDistributedDataParallel( model, device_ids=device_ids, find_unused_parameters=True ) return model def init_ddp_connection(self, proc_rank, world_size): """  Connect all procs in the world using the env:// init  Use the first node as the root address  """ # use slurm job id for the port number # guarantees unique ports across jobs from same grid search try: # use the last 4 numbers in the job id as the id default_port = os.environ['SLURM_JOB_ID'] default_port = default_port[-4:] # all ports should be in the 10k+ range default_port = int(default_port) + 15000 except Exception as e: default_port = 12910 # if user gave a port number, use that one instead try: default_port = os.environ['MASTER_PORT'] except Exception: os.environ['MASTER_PORT'] = str(default_port) # figure out the root node addr try: root_node = os.environ['SLURM_NODELIST'].split(' ')[0] except Exception: root_node = '127.0.0.2' root_node = self.trainer.resolve_root_node_address(root_node) os.environ['MASTER_ADDR'] = root_node dist.init_process_group('nccl', rank=proc_rank, world_size=world_size)

Support for your own apex init or implementation.

 def configure_apex(self, amp, model, optimizers, amp_level): """  Override to init AMP your own way  Must return a model and list of optimizers  :param amp:  :param model:  :param optimizers:  :param amp_level:  :return: Apex wrapped model and optimizers  """ model, optimizers = amp.initialize( model, optimizers, opt_level=amp_level, ) return model, optimizers

DDP2 implementation (inspired by parlai and @stephenroller).
DDP2 acts as DP in the node and DDP across nodes.
As a result, an optional method is introduced training_end
where you can use the outputs of training_step (performed on each GPU with a portion of the batch),
to do something with the outputs of all batches on the node (ie: negative sampling).

Trainer(distributed_backend='ddp2') def training_step(...): # x is 1/nb_gpus of the full batch out = model(x) return {'out': out} def training_end(self, outputs): # all_outs has outs from ALL gpus  all_outs = outputs['out'] loss = softmax(all_outs) return {'loss': loss}

Logging

More logger diversity including Comet.ml.
Versioned logs for all loggers.
switched from print to logging

progress bar

now the progress bar has a full bar for the full train + val epochs and a second bar visible only during val.

loading

checkpoints now store hparams
no need to pass tags.csv to restore state because it lives in the checkpoint.

Slurm resubmit with apex + ddp

Fixes issue of ddp restore weights blowing out GPU memory (load on cpu first then GPU).
Saves apex states automatically and restores it for a checkpoint.

Refactoring

internal code made modular through Mixins for ease of readability and to minimize merge conflicts.

Docs

Tons of doc improvements.

Thanks!

Thank you to the amazing contributor community! Especially @neggert and @Borda for reviewing PRs and taking care of a good number of Github issues. The community is thriving and has really embraced making Lightning better.

Great job everyone!

Assets 2

05 Oct 21:10

williamFalcon

0.5.1

0eb6950

Simpler interface, new features

0.5.1

Simpler interface

All trainers now have a default logger, early stopping and checkpoint object. To modify the behavior, pass in your own versions of those.

Removed collisions with logger versions by tying it to job id.

Features

Added new DDP implementation. It uses DP in a node but allows multiple nodes. Useful for models which need negative samples, etc...

Trainer(distributed_backend='ddp2')

support for LBFGS. If you pass in LBFGS Lightning handles the closure for you automatically.
No longer need to set master port, Lightning does it for you using the job id.

Minor changes

training_step and validation_end now return two separate dicts, one for the progress bar and one for logging.
Added options to memory printing: 'min_max' logs only the max/min memory use. 'all' logs all the GPUs on the root node.

Assets 2

26 Sep 14:47

williamFalcon

0.5.0

c2a0846

API clean up

This release has breaking API changes. See #124 for all details.
Syntax changes are:

in trainer options use: train, test, val for data: val_dataloader, test_dataloader, train_dataloader data_batch -> batch prog -> progress gradient_clip -> gradient_clip_val add_log_row_interval -> row_log_interval

Assets 2

16 Sep 14:54

williamFalcon

0.4.9

974afba

Various ddt improvements

This release does the following:

Moves SLURM resubmit from test-tube to PL (which removes the need for cluster parameter).
Cluster checkpoint done by Lightning now (not test-tube). Also doesn't require a checkpoint object to restore weights when on cluster.
Loads all models on CPU when restoring weights to avoid OOM issues in PyTorch. User now needs to move to GPU manually. However, if using Lightning, lightning will move to correct GPUs automatically.
Fixes various subtle bugs in DDP implementation.
documentation updates

Assets 2

12 Aug 20:11

williamFalcon

0.4.4

a78ee48

New features

validation_step, val_dataloader are now optional.
enabled multiple dataloaders for validation.
support for latest test-tube logger optimized for PT 1.2.0.
lr_scheduler now activated after epoch

Assets 2

08 Aug 16:39

williamFalcon

0.4.0

4a2dca3

Stable fully-featured release

0.4.0

0.4.0 is the first public release after a short period testing with public users. Thanks for all the help ironing out bugs to get Lightning to run on everything from notebooks to local to server machines.

This release includes:

Extensively tested code.
Cleaner API to accommodate the various research use cases

New features

No need for experiment object in trainer.
Training continuation (not just weights, but also epoch, global step, etc...)
- if the folder the checkpoint callback uses has weights, it loads the last weights automatically.
training step and validation step don't reduce outputs automatically anymore. This fixes issues with reducing generated outputs for example (images, text).
16-bit can now be used with a single GPU (no DP or DDP in this case). bypasses issue with NVIDIA apex and PT compatibility for DP+16-bit training.

Assets 2

25 Jul 17:28

williamFalcon

0.3.6

a1dd4d3

Simple data loader

Simplified data loader.

Added a decorator to do lazy loading internally:

Old:

@property def tng_dataloader(self): if self._tng_dataloader is None: self._tng_dataloader = DataLoader(...) return self.tng_dataloder

Now:

@ptl.data_loader def tng_dataloader(self): return DataLoader(...)

Assets 2

25 Jul 02:09

williamFalcon

0.3.5

b8c7baa

Tests!

Fully tested!

Includes:

Code coverage (99%)
Full tests that run multiple models in different configs
Full tests that test specific functionality in trainer.

Assets 2

Releases: Lightning-AI/pytorch-lightning

Simplifications & new docs

Major changes

Complete changelog

Added

Changed

Deprecated

Removed

Fixed

Contributors

Uh oh!

Generalization!

Generalization release

Internal Facebook support

Logging

progress bar

loading

Slurm resubmit with apex + ddp

Refactoring

Docs

Thanks!

Uh oh!

Simpler interface, new features

0.5.1

Simpler interface

Features

Minor changes

Uh oh!

API clean up

Uh oh!

Various ddt improvements

Uh oh!

New features

Uh oh!

Stable fully-featured release

0.4.0

This release includes:

New features

Uh oh!

Simple data loader

Uh oh!

Tests!

Uh oh!