Top 23 Python Inference Projects

vllm

1 61 65,886 10.0 Python

A high-throughput and memory-efficient inference and serving engine for LLMs

Project mention: Getting Started with Mooncake: Installation, Execution & Troubleshooting | dev.to | 2025-12-11

git clone -b v0.8.5 https://github.com/vllm-project/vllm.git --recursive cd vllm python use_existing_torch.py
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
ColossalAI

2 42 41,299 9.1 Python

Making large AI models cheaper, faster and more accessible
DeepSpeed

3 53 41,052 9.6 Python

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Project mention: All Data and AI Weekly #193 - June 9, 2025 | dev.to | 2025-06-09
sglang

4 12 21,914 10.0 Python

SGLang is a fast serving framework for large language models and vision language models.

Project mention: GLM-4.7: Advancing the Coding Capability | news.ycombinator.com | 2025-12-22

No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...
faster-whisper

5 25 19,503 6.8 Python

Faster Whisper transcription with CTranslate2
ml-engineering

6 12 16,071 9.1 Python

Machine Learning Engineering Open Book

Project mention: Real-time Nvidia GPU dashboard | news.ycombinator.com | 2025-10-06

For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.
But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).
Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...
text-generation-inference

7 31 10,710 9.1 Python

Large Language Model Text Generation Inference

Project mention: Complete Large Language Model (LLM) Learning Roadmap | dev.to | 2025-04-11

Resource: TGI (Text Generation Inference)
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
server

8 30 10,131 9.1 Python

The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)

Project mention: Gluon: a GPU programming language based on the same compiler stack as Triton | news.ycombinator.com | 2025-09-17

Also it REALLY jams me up that this is a thing, complicating discussions: https://github.com/triton-inference-server/server
inference

9 2 8,864 9.7 Python

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.
adversarial-robustness-toolbox

10 8 5,734 9.7 Python

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
superduper

11 25 5,235 9.6 Python

Superduper: End-to-end framework for building custom AI applications and agents.
torch2trt

12 5 4,828 7.2 Python

An easy to use PyTorch to TensorRT converter
open_model_zoo

13 5 4,330 7.1 Python

Pre-trained Deep Learning models and demos (high quality and extremely fast)
gpustack

14 2 4,249 9.9 Python

GPU cluster manager for optimized AI model deployment

Project mention: Ollama has a native front end chatbot now | news.ycombinator.com | 2025-07-30

GPUStack doesn't seem to have the problem of lowest common denominator but supports many architectures.
https://github.com/gpustack/gpustack
FastDeploy

15 5 3,598 10.0 Python

High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
optimum

16 8 3,217 9.0 Python

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
ao

17 4 2,573 9.9 Python

PyTorch native quantization and sparsity for training and inference

Project mention: Gemma 3 270M re-implemented in pure PyTorch for local tinkering | news.ycombinator.com | 2025-08-20
inference

18 8 2,117 10.0 Python

Turn any computer or edge device into a command center for your computer vision projects. (by roboflow)
DeepSpeed-MII

19 6 2,083 6.6 Python

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
dstack

20 22 1,982 9.8 Python

dstack is an open-source control plane for running development, training, and inference jobs on GPUs—across hyperscalers, neoclouds, or on-prem.

Project mention: Orchestrating GPUs in data centers and private clouds | news.ycombinator.com | 2025-02-18

Super excited to hear any feedback.
[1] https://github.com/dstackai/dstack/issues/2184
transformer-deploy

21 8 1,689 0.0 Python

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
any-llm

22 3 1,505 9.8 Python

Communicate with an LLM provider using a single interface

Project mention: Show HN: OSS app to find LLMs across multiple LLM providers (Azure, AWS, etc.) | news.ycombinator.com | 2025-09-08
budgetml

23 4 1,345 0.0 Python

Deploy a ML inference service on a budget in less than 10 lines of code.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Inference discussion

Python Inference related posts

GLM-4.7: Advancing the Coding Capability

2 projects | news.ycombinator.com | 22 Dec 2025
Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model

3 projects | news.ycombinator.com | 10 Dec 2025
Show HN: ElasticMM – 4.2× Faster Multimodal LLM Serving (NeurIPS 2025 Oral)

1 project | news.ycombinator.com | 14 Dec 2025
Show HN: OSS app to find LLMs across multiple LLM providers (Azure, AWS, etc.)

1 project | news.ycombinator.com | 8 Sep 2025
Show HN: Any-LLM chat demo – switch between ChatGPT, Claude, Ollama, in one chat

1 project | news.ycombinator.com | 22 Aug 2025
Kitten TTS: 25MB CPU-Only, Open-Source Voice Model

19 projects | news.ycombinator.com | 5 Aug 2025
How Distillation Makes AI Models Smaller and Cheaper

2 projects | news.ycombinator.com | 24 Jul 2025
A note from our sponsor - InfluxDB
www.influxdata.com | 22 Dec 2025

InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source Inference projects in Python? This list will help you:

#	Project	Stars
1	vllm	65,886
2	ColossalAI	41,299
3	DeepSpeed	41,052
4	sglang	21,914
5	faster-whisper	19,503
6	ml-engineering	16,071
7	text-generation-inference	10,710
8	server	10,131
9	inference	8,864
10	adversarial-robustness-toolbox	5,734
11	superduper	5,235
12	torch2trt	4,828
13	open_model_zoo	4,330
14	gpustack	4,249
15	FastDeploy	3,598
16	optimum	3,217
17	ao	2,573
18	inference	2,117
19	DeepSpeed-MII	2,083
20	dstack	1,982
21	transformer-deploy	1,689
22	any-llm	1,505
23	budgetml	1,345

Python Inference

Top 23 Python Inference Projects

Python Inference discussion

Python Inference related posts

GLM-4.7: Advancing the Coding Capability

Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model

Show HN: ElasticMM – 4.2× Faster Multimodal LLM Serving (NeurIPS 2025 Oral)

Show HN: OSS app to find LLMs across multiple LLM providers (Azure, AWS, etc.)

Show HN: Any-LLM chat demo – switch between ChatGPT, Claude, Ollama, in one chat

Kitten TTS: 25MB CPU-Only, Open-Source Voice Model

How Distillation Makes AI Models Smaller and Cheaper

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?