Python evaluation-framework

Open-source Python projects categorized as evaluation-framework

Top 15 Python evaluation-framework Projects

evaluation-framework
  1. deepeval

    The LLM Evaluation Framework

    Project mention: Has anyone ever used the Python framework "Deepeval"? | news.ycombinator.com | 2025-06-23
  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. lm-evaluation-harness

    A framework for few-shot evaluation of language models.

    Project mention: Ask HN: Who is hiring? (September 2025) | news.ycombinator.com | 2025-09-01

    I am a prev:SWE Intern at AWS India(was extended return offer), in the Q for Business Team, where I work on integration of BidirectionalStreaming using TTS models to integrate low latency voice mode usage in our application; and an active contributor to Pickle's Glass[ https://github.com/pickle-com/glass ], and HyprNote[ https://www.linkedin.com/company/hyprnote/ ].

    My previous experiences include:

    Experience: WorldQuant (Quant Research), AstraZeneca (ML engineer for drug discovery using LLMs), TransHumanity(AI Engineering solving Traffic), Alma [ https://www.linkedin.com/company/tryalma/ ] (AI engineering to simplify the visa process for immigration), Univ. of Missouri (biomedical GraphRAG research).

    Competitions: Top 0.016% in Amazon ML Challenge 2024 (12th/75,000 teams) and ranked 11th/50,000 teams in Amazon HackOn 2024, UC Berkeley AGENTX-2025.

    Open Source: EleutherAI's LMEvaluationHarness[ https://github.com/EleutherAI/lm-evaluation-harness ], HuggingFace's nanoVLM( https://github.com/huggingface/nanoVLM )[integrating metrics and GRPO training techniques], Pickle's Glass[added search functionality, adding knowledge base support for enterprises] , HyprNote[ https://github.com/fastrepl/hyprnote ] [removing silent responses and quantizing models for effective summarization]

    I've attached all my profile links, hereby.

  4. Kiln

    Easily build AI systems with Evals, RAG, Agents, fine-tuning, synthetic data, and more.

    Project mention: DeepFabric – Generate High-Quality Synthetic Datasets at Scale | news.ycombinator.com | 2025-09-26
  5. lighteval

    Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

  6. continuous-eval

    Data-Driven Evaluation for LLM-Powered Applications

  7. AgentLab

    AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

  8. tonic_validate

    Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

  9. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  10. MixEval

    The official evaluation suite and dynamic data release for MixEval.

  11. PyDGN

    A research library for automating experiments on Deep Graph Networks

  12. eureka-ml-insights

    A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

  13. vectory

    Vectory provides a collection of tools to track and compare embedding versions.

  14. MATEval

    MATEval is the first multi-agent framework simulating human collaborative discussion for open-ended text evaluation.

    Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01

    MATEval is a cool framework specifically for evaluating data using multiple agents. It has a team of language model agents discuss and debate the accuracy and clarity of complex datasets, using techniques like structured discussions, agents reflecting on their own thinking, breaking down problems step-by-step (chain-of-thought), and refining their assessments based on feedback. It shows how a team of agents can evaluate things more thoroughly than a single agent.

  15. tvallogging

    A tool for evaluating and tracking your RAG experiments. This repo contains the Python SDK for logging to Tonic Validate.

  16. SigmaEval

    E2E testing of Gen AI apps

    Project mention: Show HN: E2E Testing for Chatbots | news.ycombinator.com | 2025-10-29
  17. spark-llm-eval

    Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration

    Project mention: Why I Built a Spark-Native LLM Evaluation Framework | dev.to | 2025-12-16

    The framework is open source and available on PyPI (pip install spark-llm-eval). If you're doing LLM evaluation at scale on Spark/Databricks, I'd love to hear what works and what doesn't. The space is evolving fast, and I don't pretend to have all the answers.

  18. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python evaluation-framework discussion

Python evaluation-framework related posts

  • Why I Built a Spark-Native LLM Evaluation Framework

    1 project | dev.to | 16 Dec 2025
  • Show HN: E2E Testing for Chatbots

    2 projects | news.ycombinator.com | 29 Oct 2025
  • End-to-end testing of Gen AI Apps

    1 project | dev.to | 11 Oct 2025
  • Tech Trend Blog list over 200 blogs

    4 projects | dev.to | 6 Sep 2025
  • Revival Hijacking: How Deleted PyPI Packages Become Threats

    3 projects | dev.to | 2 Aug 2025
  • Has anyone ever used the Python framework "Deepeval"?

    1 project | news.ycombinator.com | 23 Jun 2025
  • Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

    2 projects | news.ycombinator.com | 20 Feb 2025
  • A note from our sponsor - SaaSHub
    www.saashub.com | 24 Dec 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source evaluation-framework projects in Python? This list will help you:

# Project Stars
1 deepeval 12,683
2 lm-evaluation-harness 11,007
3 Kiln 4,485
4 lighteval 2,206
5 continuous-eval 515
6 AgentLab 486
7 tonic_validate 320
8 MixEval 252
9 PyDGN 222
10 eureka-ml-insights 173
11 vectory 71
12 MATEval 28
13 tvallogging 8
14 SigmaEval 1
15 spark-llm-eval 2

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?