SaaSHub helps you find the best software and product alternatives Learn more →
Top 15 Python evaluation-framework Projects
- Project mention: Has anyone ever used the Python framework "Deepeval"? | news.ycombinator.com | 2025-06-23
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
I am a prev:SWE Intern at AWS India(was extended return offer), in the Q for Business Team, where I work on integration of BidirectionalStreaming using TTS models to integrate low latency voice mode usage in our application; and an active contributor to Pickle's Glass[ https://github.com/pickle-com/glass ], and HyprNote[ https://www.linkedin.com/company/hyprnote/ ].
My previous experiences include:
Experience: WorldQuant (Quant Research), AstraZeneca (ML engineer for drug discovery using LLMs), TransHumanity(AI Engineering solving Traffic), Alma [ https://www.linkedin.com/company/tryalma/ ] (AI engineering to simplify the visa process for immigration), Univ. of Missouri (biomedical GraphRAG research).
Competitions: Top 0.016% in Amazon ML Challenge 2024 (12th/75,000 teams) and ranked 11th/50,000 teams in Amazon HackOn 2024, UC Berkeley AGENTX-2025.
Open Source: EleutherAI's LMEvaluationHarness[ https://github.com/EleutherAI/lm-evaluation-harness ], HuggingFace's nanoVLM( https://github.com/huggingface/nanoVLM )[integrating metrics and GRPO training techniques], Pickle's Glass[added search functionality, adding knowledge base support for enterprises] , HyprNote[ https://github.com/fastrepl/hyprnote ] [removing silent responses and quantizing models for effective summarization]
I've attached all my profile links, hereby.
- Project mention: DeepFabric – Generate High-Quality Synthetic Datasets at Scale | news.ycombinator.com | 2025-09-26
-
-
-
AgentLab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
-
tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
-
-
eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
-
-
MATEval
MATEval is the first multi-agent framework simulating human collaborative discussion for open-ended text evaluation.
Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01MATEval is a cool framework specifically for evaluating data using multiple agents. It has a team of language model agents discuss and debate the accuracy and clarity of complex datasets, using techniques like structured discussions, agents reflecting on their own thinking, breaking down problems step-by-step (chain-of-thought), and refining their assessments based on feedback. It shows how a team of agents can evaluate things more thoroughly than a single agent.
-
tvallogging
A tool for evaluating and tracking your RAG experiments. This repo contains the Python SDK for logging to Tonic Validate.
-
-
spark-llm-eval
Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration
The framework is open source and available on PyPI (pip install spark-llm-eval). If you're doing LLM evaluation at scale on Spark/Databricks, I'd love to hear what works and what doesn't. The space is evolving fast, and I don't pretend to have all the answers.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python evaluation-framework discussion
Python evaluation-framework related posts
-
Why I Built a Spark-Native LLM Evaluation Framework
-
Show HN: E2E Testing for Chatbots
-
End-to-end testing of Gen AI Apps
-
Tech Trend Blog list over 200 blogs
-
Revival Hijacking: How Deleted PyPI Packages Become Threats
-
Has anyone ever used the Python framework "Deepeval"?
-
Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps
- A note from our sponsor - SaaSHub www.saashub.com | 24 Dec 2025
Index
What are some of the best open-source evaluation-framework projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | deepeval | 12,683 |
| 2 | lm-evaluation-harness | 11,007 |
| 3 | Kiln | 4,485 |
| 4 | lighteval | 2,206 |
| 5 | continuous-eval | 515 |
| 6 | AgentLab | 486 |
| 7 | tonic_validate | 320 |
| 8 | MixEval | 252 |
| 9 | PyDGN | 222 |
| 10 | eureka-ml-insights | 173 |
| 11 | vectory | 71 |
| 12 | MATEval | 28 |
| 13 | tvallogging | 8 |
| 14 | SigmaEval | 1 |
| 15 | spark-llm-eval | 2 |