Top 15 Python evaluation-framework Projects

deepeval

1 28 12,683 10.0 Python

The LLM Evaluation Framework

Project mention: Has anyone ever used the Python framework "Deepeval"? | news.ycombinator.com | 2025-06-23
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
lm-evaluation-harness

2 39 11,007 9.5 Python

A framework for few-shot evaluation of language models.

Project mention: Ask HN: Who is hiring? (September 2025) | news.ycombinator.com | 2025-09-01

I am a prev:SWE Intern at AWS India(was extended return offer), in the Q for Business Team, where I work on integration of BidirectionalStreaming using TTS models to integrate low latency voice mode usage in our application; and an active contributor to Pickle's Glass[ https://github.com/pickle-com/glass ], and HyprNote[ https://www.linkedin.com/company/hyprnote/ ].
My previous experiences include:
Experience: WorldQuant (Quant Research), AstraZeneca (ML engineer for drug discovery using LLMs), TransHumanity(AI Engineering solving Traffic), Alma [ https://www.linkedin.com/company/tryalma/ ] (AI engineering to simplify the visa process for immigration), Univ. of Missouri (biomedical GraphRAG research).
Competitions: Top 0.016% in Amazon ML Challenge 2024 (12th/75,000 teams) and ranked 11th/50,000 teams in Amazon HackOn 2024, UC Berkeley AGENTX-2025.
Open Source: EleutherAI's LMEvaluationHarness[ https://github.com/EleutherAI/lm-evaluation-harness ], HuggingFace's nanoVLM( https://github.com/huggingface/nanoVLM )[integrating metrics and GRPO training techniques], Pickle's Glass[added search functionality, adding knowledge base support for enterprises] , HyprNote[ https://github.com/fastrepl/hyprnote ] [removing silent responses and quantizing models for effective summarization]
I've attached all my profile links, hereby.
Kiln

3 15 4,485 10.0 Python

Easily build AI systems with Evals, RAG, Agents, fine-tuning, synthetic data, and more.

Project mention: DeepFabric – Generate High-Quality Synthetic Datasets at Scale | news.ycombinator.com | 2025-09-26
lighteval

4 1 2,206 9.5 Python

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
continuous-eval

5 4 515 7.7 Python

Data-Driven Evaluation for LLM-Powered Applications
AgentLab

6 1 486 9.6 Python

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
tonic_validate

7 6 320 4.1 Python

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
MixEval

8 1 252 8.9 Python

The official evaluation suite and dynamic data release for MixEval.
PyDGN

9 1 222 2.8 Python

A research library for automating experiments on Deep Graph Networks
eureka-ml-insights

10 1 173 8.4 Python

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
vectory

11 5 71 10.0 Python

Vectory provides a collection of tools to track and compare embedding versions.
MATEval

12 1 28 4.5 Python

MATEval is the first multi-agent framework simulating human collaborative discussion for open-ended text evaluation.

Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01

MATEval is a cool framework specifically for evaluating data using multiple agents. It has a team of language model agents discuss and debate the accuracy and clarity of complex datasets, using techniques like structured discussions, agents reflecting on their own thinking, breaking down problems step-by-step (chain-of-thought), and refining their assessments based on feedback. It shows how a team of agents can evaluate things more thoroughly than a single agent.
tvallogging

13 1 8 5.9 Python

A tool for evaluating and tracking your RAG experiments. This repo contains the Python SDK for logging to Tonic Validate.
SigmaEval

14 4 1 9.2 Python

E2E testing of Gen AI apps

Project mention: Show HN: E2E Testing for Chatbots | news.ycombinator.com | 2025-10-29
spark-llm-eval

15 2 2 7.5 Python

Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration

Project mention: Why I Built a Spark-Native LLM Evaluation Framework | dev.to | 2025-12-16

The framework is open source and available on PyPI (pip install spark-llm-eval). If you're doing LLM evaluation at scale on Spark/Databricks, I'd love to hear what works and what doesn't. The space is evolving fast, and I don't pretend to have all the answers.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python evaluation-framework discussion

Python evaluation-framework related posts

Why I Built a Spark-Native LLM Evaluation Framework

1 project | dev.to | 16 Dec 2025
Show HN: E2E Testing for Chatbots

2 projects | news.ycombinator.com | 29 Oct 2025
End-to-end testing of Gen AI Apps

1 project | dev.to | 11 Oct 2025
Tech Trend Blog list over 200 blogs

4 projects | dev.to | 6 Sep 2025
Revival Hijacking: How Deleted PyPI Packages Become Threats

3 projects | dev.to | 2 Aug 2025
Has anyone ever used the Python framework "Deepeval"?

1 project | news.ycombinator.com | 23 Jun 2025
Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

2 projects | news.ycombinator.com | 20 Feb 2025
A note from our sponsor - SaaSHub
www.saashub.com | 24 Dec 2025

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source evaluation-framework projects in Python? This list will help you:

#	Project	Stars
1	deepeval	12,683
2	lm-evaluation-harness	11,007
3	Kiln	4,485
4	lighteval	2,206
5	continuous-eval	515
6	AgentLab	486
7	tonic_validate	320
8	MixEval	252
9	PyDGN	222
10	eureka-ml-insights	173
11	vectory	71
12	MATEval	28
13	tvallogging	8
14	SigmaEval	1
15	spark-llm-eval	2

Python evaluation-framework

Top 15 Python evaluation-framework Projects

Python evaluation-framework discussion

Python evaluation-framework related posts

Why I Built a Spark-Native LLM Evaluation Framework

Show HN: E2E Testing for Chatbots

End-to-end testing of Gen AI Apps

Tech Trend Blog list over 200 blogs

Revival Hijacking: How Deleted PyPI Packages Become Threats

Has anyone ever used the Python framework "Deepeval"?

Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?