Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →
Top 23 Python Evaluation Projects
- Project mention: Evalúa y Mejora Tus Agentes: Evaluación Automatizada con RAGAS para Agentes de Producción | dev.to | 2025-10-15
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
- Project mention: DeepFabric – Generate High-Quality Synthetic Datasets at Scale | news.ycombinator.com | 2025-09-26
-
AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
-
-
-
uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Project mention: Launch HN: Lucidic (YC W25) – Debug, test, and evaluate AI agents in production | news.ycombinator.com | 2025-07-30 -
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
-
- Project mention: Just dropped ragbits v1.0 and create-ragbits-app – spin up a RAG app in minutes | news.ycombinator.com | 2025-06-04
Whether you're prototyping or scaling, this stack is built to grow with you — with real tooling, not just examples.
Source code: https://github.com/deepsense-ai/ragbits
Would love to hear your feedback or ideas — and if you’re building RAG apps, give create-ragbits-app a shot and tell us how it goes
-
-
-
intellagent
A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions
Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01When it comes to evaluating conversational agents, there are some smart ways to do it. Take a framework like IntellAgent. It uses AI to test other AI! It's a three-step process designed to make testing more thorough and realistic than just having a person manually try things out.
-
semantic-kitti-api
SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
- Project mention: Fine-Tuning with GRPO Datasets: A Developer's Guide to DeepFabric's GRPO Formatter | dev.to | 2025-10-21
DeepFabric Documentation
-
long-form-factuality
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
-
-
TrustLLM
-
-
-
errant
ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.
-
FActScore
A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Evaluation discussion
Python Evaluation related posts
-
Gemini 3 Flash: frontier intelligence built for speed
-
Show HN: EvalView pytest style tests for AI agents (budgets, hallucinations)
-
Benchmark that evaluates LLMs using 759 NYT Connections puzzles
-
NYT Connections LLM Benchmark
-
7 Ways to Create High-Quality Evaluation Datasets for LLMs
-
Show HN: Hegelion – Force your LLM to argue with itself before answering
-
Gemini 3
- A note from our sponsor - Stream getstream.io | 25 Dec 2025
Index
What are some of the best open-source Evaluation projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | ragas | 11,831 |
| 2 | opencompass | 6,461 |
| 3 | Kiln | 4,485 |
| 4 | AutoRAG | 4,485 |
| 5 | promptbench | 2,767 |
| 6 | evaluate | 2,383 |
| 7 | uptrain | 2,326 |
| 8 | lighteval | 2,206 |
| 9 | avalanche | 1,998 |
| 10 | EvalAI | 1,967 |
| 11 | ragbits | 1,599 |
| 12 | pycm | 1,490 |
| 13 | torch-fidelity | 1,157 |
| 14 | intellagent | 1,158 |
| 15 | semantic-kitti-api | 875 |
| 16 | deepfabric | 675 |
| 17 | long-form-factuality | 660 |
| 18 | ranx | 628 |
| 19 | TrustLLM | 619 |
| 20 | simpleeval | 557 |
| 21 | reclist | 467 |
| 22 | errant | 456 |
| 23 | FActScore | 411 |