Top 23 Python Evaluation Projects

ragas

1 13 11,831 9.7 Python

Supercharge Your LLM Application Evaluations 🚀

Project mention: Evalúa y Mejora Tus Agentes: Evaluación Automatizada con RAGAS para Agentes de Producción | dev.to | 2025-10-15
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
opencompass

2 1 6,461 9.3 Python

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Kiln

3 15 4,485 10.0 Python

Easily build AI systems with Evals, RAG, Agents, fine-tuning, synthetic data, and more.

Project mention: DeepFabric – Generate High-Quality Synthetic Datasets at Scale | news.ycombinator.com | 2025-09-26
AutoRAG

4 5 4,485 8.1 Python

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
promptbench

5 4 2,767 0.0 Python

A unified evaluation framework for large language models
evaluate

6 4 2,383 8.0 Python

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
uptrain

7 37 2,326 9.3 Python

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Project mention: Launch HN: Lucidic (YC W25) – Debug, test, and evaluate AI agents in production | news.ycombinator.com | 2025-07-30
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
lighteval

8 1 2,206 9.5 Python

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
avalanche

9 1 1,998 7.8 Python

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
EvalAI

10 4 1,967 9.1 Python

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
ragbits

11 2 1,599 9.5 Python

Building blocks for rapid development of GenAI applications

Project mention: Just dropped ragbits v1.0 and create-ragbits-app – spin up a RAG app in minutes | news.ycombinator.com | 2025-06-04

Whether you're prototyping or scaling, this stack is built to grow with you — with real tooling, not just examples.
Source code: https://github.com/deepsense-ai/ragbits
Would love to hear your feedback or ideas — and if you’re building RAG apps, give create-ragbits-app a shot and tell us how it goes
pycm

12 18 1,490 6.6 Python

Multi-class confusion matrix library in Python
torch-fidelity

13 3 1,157 6.1 Python

High-fidelity performance metrics for generative models in PyTorch
intellagent

14 2 1,158 9.3 Python

A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions

Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01

When it comes to evaluating conversational agents, there are some smart ways to do it. Take a framework like IntellAgent. It uses AI to test other AI! It's a three-step process designed to make testing more thorough and realistic than just having a person manually try things out.
semantic-kitti-api

15 1 875 3.2 Python

SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
deepfabric

16 10 675 9.6 Python

Curate High Quality Datasets, Train, Evaluate and Ship! 🚀

Project mention: Fine-Tuning with GRPO Datasets: A Developer's Guide to DeepFabric's GRPO Formatter | dev.to | 2025-10-21

DeepFabric Documentation
long-form-factuality

17 2 660 0.0 Python

Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
ranx

18 1 628 3.4 Python

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
TrustLLM

19 1 619 6.8 Python

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

Project mention: Evaluating Trustworthiness in DistilGPT2 | dev.to | 2025-06-08

TrustLLM
simpleeval

20 5 557 7.9 Python

Simple Safe Sandboxed Extensible Expression Evaluator for Python
reclist

21 1 467 7.8 Python

Behavioral "black-box" testing for recommender systems
errant

22 2 456 4.5 Python

ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.
FActScore

23 1 411 3.3 Python

A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Evaluation discussion

Python Evaluation related posts

Gemini 3 Flash: frontier intelligence built for speed

4 projects | news.ycombinator.com | 17 Dec 2025
Show HN: EvalView pytest style tests for AI agents (budgets, hallucinations)

1 project | news.ycombinator.com | 17 Dec 2025
Benchmark that evaluates LLMs using 759 NYT Connections puzzles

1 project | news.ycombinator.com | 12 Dec 2025
NYT Connections LLM Benchmark

1 project | news.ycombinator.com | 11 Dec 2025
7 Ways to Create High-Quality Evaluation Datasets for LLMs

2 projects | dev.to | 27 Nov 2025
Show HN: Hegelion – Force your LLM to argue with itself before answering

1 project | news.ycombinator.com | 24 Nov 2025
Gemini 3

1 project | news.ycombinator.com | 18 Nov 2025
A note from our sponsor - Stream
getstream.io | 25 Dec 2025

Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →

Index

What are some of the best open-source Evaluation projects in Python? This list will help you:

#	Project	Stars
1	ragas	11,831
2	opencompass	6,461
3	Kiln	4,485
4	AutoRAG	4,485
5	promptbench	2,767
6	evaluate	2,383
7	uptrain	2,326
8	lighteval	2,206
9	avalanche	1,998
10	EvalAI	1,967
11	ragbits	1,599
12	pycm	1,490
13	torch-fidelity	1,157
14	intellagent	1,158
15	semantic-kitti-api	875
16	deepfabric	675
17	long-form-factuality	660
18	ranx	628
19	TrustLLM	619
20	simpleeval	557
21	reclist	467
22	errant	456
23	FActScore	411

Python Evaluation

Top 23 Python Evaluation Projects

Python Evaluation discussion

Python Evaluation related posts

Gemini 3 Flash: frontier intelligence built for speed

Show HN: EvalView pytest style tests for AI agents (budgets, hallucinations)

Benchmark that evaluates LLMs using 759 NYT Connections puzzles

NYT Connections LLM Benchmark

7 Ways to Create High-Quality Evaluation Datasets for LLMs

Show HN: Hegelion – Force your LLM to argue with itself before answering

Gemini 3

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?