Python Inference

Open-source Python projects categorized as Inference

Top 23 Python Inference Projects

  1. vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

    Project mention: Getting Started with Mooncake: Installation, Execution & Troubleshooting | dev.to | 2025-12-11

    git clone -b v0.8.5 https://github.com/vllm-project/vllm.git --recursive cd vllm python use_existing_torch.py

  2. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  3. ColossalAI

    Making large AI models cheaper, faster and more accessible

  4. DeepSpeed

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

    Project mention: All Data and AI Weekly #193 - June 9, 2025 | dev.to | 2025-06-09
  5. sglang

    SGLang is a fast serving framework for large language models and vision language models.

    Project mention: GLM-4.7: Advancing the Coding Capability | news.ycombinator.com | 2025-12-22

    No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...

  6. faster-whisper

    Faster Whisper transcription with CTranslate2

  7. ml-engineering

    Machine Learning Engineering Open Book

    Project mention: Real-time Nvidia GPU dashboard | news.ycombinator.com | 2025-10-06

    For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.

    But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).

    Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...

  8. text-generation-inference

    Large Language Model Text Generation Inference

    Project mention: Complete Large Language Model (LLM) Learning Roadmap | dev.to | 2025-04-11

    Resource: TGI (Text Generation Inference)

  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. server

    The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)

    Project mention: Gluon: a GPU programming language based on the same compiler stack as Triton | news.ycombinator.com | 2025-09-17

    Also it REALLY jams me up that this is a thing, complicating discussions: https://github.com/triton-inference-server/server

  11. inference

    Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

  12. adversarial-robustness-toolbox

    Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

  13. superduper

    Superduper: End-to-end framework for building custom AI applications and agents.

  14. torch2trt

    An easy to use PyTorch to TensorRT converter

  15. open_model_zoo

    Pre-trained Deep Learning models and demos (high quality and extremely fast)

  16. gpustack

    GPU cluster manager for optimized AI model deployment

    Project mention: Ollama has a native front end chatbot now | news.ycombinator.com | 2025-07-30

    GPUStack doesn't seem to have the problem of lowest common denominator but supports many architectures.

    https://github.com/gpustack/gpustack

  17. FastDeploy

    High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

  18. optimum

    🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

  19. ao

    PyTorch native quantization and sparsity for training and inference

    Project mention: Gemma 3 270M re-implemented in pure PyTorch for local tinkering | news.ycombinator.com | 2025-08-20
  20. inference

    Turn any computer or edge device into a command center for your computer vision projects. (by roboflow)

  21. DeepSpeed-MII

    MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

  22. dstack

    dstack is an open-source control plane for running development, training, and inference jobs on GPUs—across hyperscalers, neoclouds, or on-prem.

    Project mention: Orchestrating GPUs in data centers and private clouds | news.ycombinator.com | 2025-02-18

    Super excited to hear any feedback.

    [1] https://github.com/dstackai/dstack/issues/2184

  23. transformer-deploy

    Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

  24. any-llm

    Communicate with an LLM provider using a single interface

    Project mention: Show HN: OSS app to find LLMs across multiple LLM providers (Azure, AWS, etc.) | news.ycombinator.com | 2025-09-08
  25. budgetml

    Deploy a ML inference service on a budget in less than 10 lines of code.

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Inference discussion

Python Inference related posts

  • GLM-4.7: Advancing the Coding Capability

    2 projects | news.ycombinator.com | 22 Dec 2025
  • Qwen3-Omni-Flash-2025-12-01:a next-generation native multimodal large model

    3 projects | news.ycombinator.com | 10 Dec 2025
  • Show HN: ElasticMM – 4.2× Faster Multimodal LLM Serving (NeurIPS 2025 Oral)

    1 project | news.ycombinator.com | 14 Dec 2025
  • Show HN: OSS app to find LLMs across multiple LLM providers (Azure, AWS, etc.)

    1 project | news.ycombinator.com | 8 Sep 2025
  • Show HN: Any-LLM chat demo – switch between ChatGPT, Claude, Ollama, in one chat

    1 project | news.ycombinator.com | 22 Aug 2025
  • Kitten TTS: 25MB CPU-Only, Open-Source Voice Model

    19 projects | news.ycombinator.com | 5 Aug 2025
  • How Distillation Makes AI Models Smaller and Cheaper

    2 projects | news.ycombinator.com | 24 Jul 2025
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 22 Dec 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source Inference projects in Python? This list will help you:

# Project Stars
1 vllm 65,886
2 ColossalAI 41,299
3 DeepSpeed 41,052
4 sglang 21,914
5 faster-whisper 19,503
6 ml-engineering 16,071
7 text-generation-inference 10,710
8 server 10,131
9 inference 8,864
10 adversarial-robustness-toolbox 5,734
11 superduper 5,235
12 torch2trt 4,828
13 open_model_zoo 4,330
14 gpustack 4,249
15 FastDeploy 3,598
16 optimum 3,217
17 ao 2,573
18 inference 2,117
19 DeepSpeed-MII 2,083
20 dstack 1,982
21 transformer-deploy 1,689
22 any-llm 1,505
23 budgetml 1,345

Sponsored
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io

Did you know that Python is
the 2nd most popular programming language
based on number of references?