Fix layer decay to work as intended with optimizers #2532
Closed
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Summary
In the current
create_optimizerimplementation, thelayer_decayparameter does not have any effect. This pull request removes the unusedlr_scaleand instead directly sets thelrvalues of the parameter groups, scaled according to their layer depth. This change leads to significantly improved training performance, as the learning rate is now correctly applied per layer.What was changed
lr_scaleparameter in param groups.Results
A simple CIFAR-10 test shows that the previous implementation had minimal effect when setting layer_decay, while the new version properly applies the decay and improves performance.
Experimental Setup
Using AdamW optimizer from
create_optimizer_v2, with PyTorch Lightning (deterministic=True)I chose to explicitly apply the
lrvalues in theparam_groups_layer_decayfunction, as relying on alr_scaleparameter would have required significantly more effort to propagate and support correctly across all relevant components.