Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.
python benchmark evaluation gemini openai code-generation claude streamlit humaneval llm ollama swe-bench mbpp bigcodebench
- Updated
Dec 3, 2025 - Python