The LLM Evaluation Framework
- Updated
Dec 22, 2025 - Python
The LLM Evaluation Framework
[NeurIPS D&B '25] The one-stop repository for large language model (LLM) unlearning. Supports TOFU, MUSE, WMDP, and many unlearning methods with easy feature extensibility.
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Measure of estimated confidence for non-hallucinative nature of outputs generated by Transformer-based Language Models.
In this we evaluate the LLM responses and find accuracy
Tools for systematic large language model evaluations
VerifyAI is a simple UI application to test GenAI outputs
This repo is for an streamlit application that provides a user-friendly interface for evaluating large language models (LLMs) using the beyondllm package.
Add a description, image, and links to the llm-evaluation-metrics topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation-metrics topic, visit your repo's landing page and select "manage topics."