Skip to content

aankitdas/speculative-decoding

Repository files navigation

Speculative Decoding Speedup Experiment

This repository contains a full experimental evaluation of speculative decoding using a lightweight draft model (gpt2, 124M params) to accelerate generation from a larger target model (gpt2-xl, 1.5B params).
The goal of this project was to measure how performance (latency, tokens/sec, acceptance rate, and overall speedup) varies with different:

  • Temperatures
  • Gamma values (number of draft tokens proposed per step)

Through this experiment, I built a complete benchmarking setup, implemented a custom speculative decoding loop, instrumented detailed metrics, and generated a comparative visualization showing how gamma and temperature influence final speedup.


What This Experiment Achieved

  • Implemented speculative decoding from scratch using PyTorch + HuggingFace Transformers
  • Benchmarked generation time, tokens/sec, acceptance rate, and overall speedup
  • Explored 40+ combinations of (temperature, gamma)
  • Identified the globally optimal speedup across all settings
  • Generated a final visualization summarizing how gamma affects latency and speedup at each temperature
  • Key observations:
    • Moderate gamma values produce the best speedup
    • Very large gamma → low acceptance → slower performance
    • Temperature directly influences acceptance and stability
    • The draft model can meaningfully accelerate decoding when tuned properly

This repository provides a reproducible and visual understanding of speculative decoding efficiency — useful for research, inference optimization, and system-level ML work.


Final Plot

Speculative Decoding Results


Sources

  1. OpenAI GPT-2 Model Card
  2. GPT-2 XL Model Card
  3. Speculative Decoding Paper
  4. Speculative Decoding Tech Review Paper by Deepmind

About

Exploring Speculative Decoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published