InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 23 Python Inference Projects
- Project mention: Getting Started with Mooncake: Installation, Execution & Troubleshooting | dev.to | 2025-12-11
git clone -b v0.8.5 https://github.com/vllm-project/vllm.git --recursive cd vllm python use_existing_torch.py
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
-
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
-
No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...
-
-
For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.
But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).
Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...
-
Resource: TGI (Text Generation Inference)
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
server
The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)
Project mention: Gluon: a GPU programming language based on the same compiler stack as Triton | news.ycombinator.com | 2025-09-17Also it REALLY jams me up that this is a thing, complicating discussions: https://github.com/triton-inference-server/server
-
inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.
-
adversarial-robustness-toolbox
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
-
-
-
-
GPUStack doesn't seem to have the problem of lowest common denominator but supports many architectures.
https://github.com/gpustack/gpustack
-
FastDeploy
High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
-
optimum
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
- Project mention: Gemma 3 270M re-implemented in pure PyTorch for local tinkering | news.ycombinator.com | 2025-08-20
-
inference
Turn any computer or edge device into a command center for your computer vision projects. (by roboflow)
-
-
dstack
dstack is an open-source control plane for running development, training, and inference jobs on GPUs—across hyperscalers, neoclouds, or on-prem.
Project mention: Orchestrating GPUs in data centers and private clouds | news.ycombinator.com | 2025-02-18Super excited to hear any feedback.
[1] https://github.com/dstackai/dstack/issues/2184
-
transformer-deploy
Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
- Project mention: Show HN: OSS app to find LLMs across multiple LLM providers (Azure, AWS, etc.) | news.ycombinator.com | 2025-09-08
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Inference discussion
Python Inference related posts
-
GLM-4.7: Advancing the Coding Capability
-
Qwen3-Omni-Flash-2025-12-01:a next-generation native multimodal large model
-
Show HN: ElasticMM – 4.2× Faster Multimodal LLM Serving (NeurIPS 2025 Oral)
-
Show HN: OSS app to find LLMs across multiple LLM providers (Azure, AWS, etc.)
-
Show HN: Any-LLM chat demo – switch between ChatGPT, Claude, Ollama, in one chat
-
Kitten TTS: 25MB CPU-Only, Open-Source Voice Model
-
How Distillation Makes AI Models Smaller and Cheaper
- A note from our sponsor - InfluxDB www.influxdata.com | 22 Dec 2025
Index
What are some of the best open-source Inference projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | vllm | 65,886 |
| 2 | ColossalAI | 41,299 |
| 3 | DeepSpeed | 41,052 |
| 4 | sglang | 21,914 |
| 5 | faster-whisper | 19,503 |
| 6 | ml-engineering | 16,071 |
| 7 | text-generation-inference | 10,710 |
| 8 | server | 10,131 |
| 9 | inference | 8,864 |
| 10 | adversarial-robustness-toolbox | 5,734 |
| 11 | superduper | 5,235 |
| 12 | torch2trt | 4,828 |
| 13 | open_model_zoo | 4,330 |
| 14 | gpustack | 4,249 |
| 15 | FastDeploy | 3,598 |
| 16 | optimum | 3,217 |
| 17 | ao | 2,573 |
| 18 | inference | 2,117 |
| 19 | DeepSpeed-MII | 2,083 |
| 20 | dstack | 1,982 |
| 21 | transformer-deploy | 1,689 |
| 22 | any-llm | 1,505 |
| 23 | budgetml | 1,345 |