This repository contains a full experimental evaluation of speculative decoding using a lightweight draft model (gpt2, 124M params) to accelerate generation from a larger target model (gpt2-xl, 1.5B params).
The goal of this project was to measure how performance (latency, tokens/sec, acceptance rate, and overall speedup) varies with different:
- Temperatures
- Gamma values (number of draft tokens proposed per step)
Through this experiment, I built a complete benchmarking setup, implemented a custom speculative decoding loop, instrumented detailed metrics, and generated a comparative visualization showing how gamma and temperature influence final speedup.
- Implemented speculative decoding from scratch using PyTorch + HuggingFace Transformers
- Benchmarked generation time, tokens/sec, acceptance rate, and overall speedup
- Explored 40+ combinations of (temperature, gamma)
- Identified the globally optimal speedup across all settings
- Generated a final visualization summarizing how gamma affects latency and speedup at each temperature
- Key observations:
- Moderate gamma values produce the best speedup
- Very large gamma → low acceptance → slower performance
- Temperature directly influences acceptance and stability
- The draft model can meaningfully accelerate decoding when tuned properly
This repository provides a reproducible and visual understanding of speculative decoding efficiency — useful for research, inference optimization, and system-level ML work.
