Gradient Cache is a production-ready PyTorch extension that reduces GPU memory usage by 90%+ during neural network training through intelligent gradient compression and CPU offloading.
- 90%+ Memory Savings: Compress gradients by 100x with minimal accuracy impact
- Larger Batch Sizes: Train with 2-3x larger batches on the same hardware
- Simple Integration: Just 3 lines of code to add to any training loop
- Universal Compatibility: Works with any PyTorch model and optimizer
- Production Ready: Tested on A100 and T4 GPUs with real models
| Model | Parameters | Memory Saved | Compression |
|---|---|---|---|
| GPT-2 Small | 124M | 479 MB/step | 100x |
| GPT-2 Medium | 350M | ~1.3 GB/step | 100x |
| Custom NN | 50M | 144 MB/step | 100x |
pip install gradient-cacheOr install from source:
git clone https://github.com/JonSnow1807/gradient-cache cd gradient-cache pip install -e .Add gradient cache to any PyTorch training loop with just 3 lines:
import gradient_cache # Create your model model = create_your_model().cuda() # Add gradient cache (1 line) hook_manager = gradient_cache.create_gradient_cache(model, compression_ratio=100) # Normal training loop optimizer = torch.optim.Adam(model.parameters()) for batch in dataloader: loss = model(batch).mean() loss.backward() # Compress gradients (1 line) hook_manager.compress_and_free_gradients() # Restore gradients and update (1 line) hook_manager.apply_gradients() optimizer.step() optimizer.zero_grad()Use the decorator for automatic integration:
from metaflow import FlowSpec, step import gradient_cache class MyTrainingFlow(FlowSpec): @step @gradient_cache.optimize(compression_ratio=100) def train(self): # Your training code - no changes needed! model = create_model() optimizer = torch.optim.Adam(model.parameters()) # ... rest of trainingimport pytorch_lightning as pl import gradient_cache class MyModel(pl.LightningModule): def __init__(self): super().__init__() self.model = create_model() self.hook_manager = gradient_cache.create_gradient_cache(self.model) def training_step(self, batch, batch_idx): loss = self.model(batch).mean() return loss def on_after_backward(self): self.hook_manager.compress_and_free_gradients() def optimizer_step(self, *args, **kwargs): self.hook_manager.apply_gradients() super().optimizer_step(*args, **kwargs)# Conservative - 10x compression (keep 10%) hook_manager = gradient_cache.create_gradient_cache(model, compression_ratio=10) # Aggressive - 1000x compression (keep 0.1%) hook_manager = gradient_cache.create_gradient_cache(model, compression_ratio=1000)# Don't compress embeddings or output layers hook_manager = gradient_cache.GradientCacheHookManager( model, compression_ratio=100, exclude_layers=['embedding', 'lm_head'] )# Enable verbose mode hook_manager = gradient_cache.create_gradient_cache(model, verbose=True) # Get compression statistics stats = hook_manager.get_compression_summary() print(f"Compression ratio: {stats['overall_compression_ratio']:.1f}x") print(f"Memory saved: {stats['memory_saved_mb']:.1f} MB")- Gradient Computation: Normal backward pass computes gradients
- Compression: Keep only top 1% of gradient values by magnitude
- CPU Offload: Move compressed gradients to system RAM
- GPU Memory Release: Free GPU memory for next batch
- Gradient Restoration: Restore gradients for optimizer step
- Cost Savings: Use smaller, cheaper GPU instances
- Larger Models: Train models that don't fit in GPU memory
- Faster Research: Iterate quickly with larger batch sizes
- Easy Integration: No model architecture changes needed
Run the test suite:
python tests/test_gradient_cache.pyIf you use Gradient Cache in your research, please cite:
@software{gradient_cache, title = {Gradient Cache: GPU Memory-Efficient Training}, author = {Gradient Cache Contributors}, year = {2024}, url = {https://github.com/gradient-cache/gradient-cache} }Apache License 2.0 - see LICENSE for details.
We welcome contributions! Please submit issues and pull requests on GitHub.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Built with β€οΈ for the ML community