Python Evaluation

Open-source Python projects categorized as Evaluation

Top 23 Python Evaluation Projects

  1. ragas

    Supercharge Your LLM Application Evaluations 🚀

    Project mention: Evalúa y Mejora Tus Agentes: Evaluación Automatizada con RAGAS para Agentes de Producción | dev.to | 2025-10-15
  2. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  3. opencompass

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

  4. Kiln

    Easily build AI systems with Evals, RAG, Agents, fine-tuning, synthetic data, and more.

    Project mention: DeepFabric – Generate High-Quality Synthetic Datasets at Scale | news.ycombinator.com | 2025-09-26
  5. AutoRAG

    AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

  6. promptbench

    A unified evaluation framework for large language models

  7. evaluate

    🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

  8. uptrain

    UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

    Project mention: Launch HN: Lucidic (YC W25) – Debug, test, and evaluate AI agents in production | news.ycombinator.com | 2025-07-30
  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. lighteval

    Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

  11. avalanche

    Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

  12. EvalAI

    :cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

  13. ragbits

    Building blocks for rapid development of GenAI applications

    Project mention: Just dropped ragbits v1.0 and create-ragbits-app – spin up a RAG app in minutes | news.ycombinator.com | 2025-06-04

    Whether you're prototyping or scaling, this stack is built to grow with you — with real tooling, not just examples.

    Source code: https://github.com/deepsense-ai/ragbits

    Would love to hear your feedback or ideas — and if you’re building RAG apps, give create-ragbits-app a shot and tell us how it goes

  14. pycm

    Multi-class confusion matrix library in Python

  15. torch-fidelity

    High-fidelity performance metrics for generative models in PyTorch

  16. intellagent

    A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions

    Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01

    When it comes to evaluating conversational agents, there are some smart ways to do it. Take a framework like IntellAgent. It uses AI to test other AI! It's a three-step process designed to make testing more thorough and realistic than just having a person manually try things out.

  17. semantic-kitti-api

    SemanticKITTI API for visualizing dataset, processing data, and evaluating results.

  18. deepfabric

    Curate High Quality Datasets, Train, Evaluate and Ship! 🚀

    Project mention: Fine-Tuning with GRPO Datasets: A Developer's Guide to DeepFabric's GRPO Formatter | dev.to | 2025-10-21

    DeepFabric Documentation

  19. long-form-factuality

    Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

  20. ranx

    ⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

  21. TrustLLM

    [ICML 2024] TrustLLM: Trustworthiness in Large Language Models

    Project mention: Evaluating Trustworthiness in DistilGPT2 | dev.to | 2025-06-08

    TrustLLM

  22. simpleeval

    Simple Safe Sandboxed Extensible Expression Evaluator for Python

  23. reclist

    Behavioral "black-box" testing for recommender systems

  24. errant

    ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.

  25. FActScore

    A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Evaluation discussion

Python Evaluation related posts

  • Gemini 3 Flash: frontier intelligence built for speed

    4 projects | news.ycombinator.com | 17 Dec 2025
  • Show HN: EvalView pytest style tests for AI agents (budgets, hallucinations)

    1 project | news.ycombinator.com | 17 Dec 2025
  • Benchmark that evaluates LLMs using 759 NYT Connections puzzles

    1 project | news.ycombinator.com | 12 Dec 2025
  • NYT Connections LLM Benchmark

    1 project | news.ycombinator.com | 11 Dec 2025
  • 7 Ways to Create High-Quality Evaluation Datasets for LLMs

    2 projects | dev.to | 27 Nov 2025
  • Show HN: Hegelion – Force your LLM to argue with itself before answering

    1 project | news.ycombinator.com | 24 Nov 2025
  • Gemini 3

    1 project | news.ycombinator.com | 18 Nov 2025
  • A note from our sponsor - Stream
    getstream.io | 25 Dec 2025
    Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →

Index

What are some of the best open-source Evaluation projects in Python? This list will help you:

# Project Stars
1 ragas 11,831
2 opencompass 6,461
3 Kiln 4,485
4 AutoRAG 4,485
5 promptbench 2,767
6 evaluate 2,383
7 uptrain 2,326
8 lighteval 2,206
9 avalanche 1,998
10 EvalAI 1,967
11 ragbits 1,599
12 pycm 1,490
13 torch-fidelity 1,157
14 intellagent 1,158
15 semantic-kitti-api 875
16 deepfabric 675
17 long-form-factuality 660
18 ranx 628
19 TrustLLM 619
20 simpleeval 557
21 reclist 467
22 errant 456
23 FActScore 411

Sponsored
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io

Did you know that Python is
the 2nd most popular programming language
based on number of references?