A flexible framework for comparing Retrieval-Augmented Generation (RAG) systems side-by-side, with support for subjective quality evaluation using LLMs.
- Multi-tool Support: Compare multiple RAG tools in parallel
- Flexible Adapters: Easy-to-extend adapter pattern for adding new tools
- Multiple Output Formats: Display, JSON, Markdown, and summary formats
- Performance Metrics: Automatic latency measurement and result statistics
- LLM Evaluation: Support for subjective quality assessment using Claude 4.1 Opus
- Rich CLI: Beautiful terminal output with tables and panels
- Comprehensive Testing: 90+ tests ensuring reliability
- Python 3.9+
- uv - Fast Python package installer and resolver
To install uv:
# On macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Or with Homebrew brew install uv # Or with pip pip install uv
# Clone the repository git clone https://github.com/ansari-project/ragdiff.git cd ragdiff # Install dependencies with uv uv sync --all-extras # Install all dependencies including dev tools # Or install only core dependencies uv sync # Or install with goodmem support uv sync --extra goodmem # Copy environment template cp .env.example .env # Edit .env and add your API keys
Create a configs/tools.yaml
file:
tools: mawsuah: api_key_env: VECTARA_API_KEY corpus_id: ${VECTARA_CORPUS_ID} base_url: https://api.vectara.io timeout: 30 goodmem: api_key_env: GOODMEM_API_KEY base_url: https://api.goodmem.ai timeout: 30 llm: model: claude-opus-4-1-20250805 api_key_env: ANTHROPIC_API_KEY
# Compare all configured tools uv run python -m src.cli compare "What is Islamic inheritance law?" # Compare specific tools uv run python -m src.cli compare "Your query" --tool mawsuah --tool goodmem # Adjust number of results uv run python -m src.cli compare "Your query" --top-k 10
# Default display format (side-by-side) uv run python -m src.cli compare "Your query" # JSON output uv run python -m src.cli compare "Your query" --format json # Markdown output uv run python -m src.cli compare "Your query" --format markdown # Summary output uv run python -m src.cli compare "Your query" --format summary # Save to file uv run python -m src.cli compare "Your query" --output results.json --format json
Run multiple queries and get comprehensive analysis:
# Basic batch comparison uv run python -m src.cli batch inputs/tafsir-test-queries.txt \ --config configs/tafsir.yaml \ --top-k 10 \ --format json # With LLM evaluation (generates holistic summary) uv run python -m src.cli batch inputs/tafsir-test-queries.txt \ --config configs/tafsir.yaml \ --evaluate \ --top-k 10 \ --format json # Custom output directory uv run python -m src.cli batch inputs/tafsir-test-queries.txt \ --config configs/tafsir.yaml \ --evaluate \ --output-dir my-results \ --format jsonl
The batch command with --evaluate
generates:
- Individual query results in JSON/JSONL/CSV format
- Latency statistics (P50, P95, P99)
- LLM evaluation summary showing wins and quality scores
- Holistic summary (markdown file) with:
- Query-by-query breakdown with winners and scores
- Common themes: win distribution, recurring issues
- Key differentiators: what makes winner better vs loser weaknesses
- Overall verdict with production recommendation
Convert holistic summary to PDF:
# Generate PDF from markdown summary python md2pdf.py outputs/holistic_summary_TIMESTAMP.md
# List available tools uv run python -m src.cli list-tools # Validate configuration uv run python -m src.cli validate-config # Run quick test uv run python -m src.cli quick-test # Get help uv run python -m src.cli --help uv run python -m src.cli compare --help uv run python -m src.cli batch --help
ragdiff/ ├── src/ │ ├── core/ # Core models and configuration │ │ ├── models.py # Data models (RagResult, ComparisonResult, etc.) │ │ └── config.py # Configuration management │ ├── adapters/ # Tool adapters │ │ ├── base.py # Base adapter implementing SearchVectara interface │ │ ├── mawsuah.py # Vectara/Mawsuah adapter │ │ ├── goodmem.py # Goodmem adapter with mock fallback │ │ └── factory.py # Adapter factory │ ├── comparison/ # Comparison engine │ │ └── engine.py # Parallel/sequential search execution │ ├── display/ # Display formatters │ │ └── formatter.py # Multiple output format support │ └── cli.py # Typer CLI implementation ├── tests/ # Comprehensive test suite ├── configs/ # Configuration files └── requirements.txt # Python dependencies
The tool follows the SPIDER protocol for systematic development:
- Specification: Clear goals for subjective RAG comparison
- Planning: Phased implementation approach
- Implementation: Clean architecture with separation of concerns
- Defense: Comprehensive test coverage (90+ tests)
- Evaluation: Expert review and validation
- Commit: Version control with clear history
- BaseRagTool: Abstract base implementing SearchVectara interface
- Adapters: Tool-specific implementations (Mawsuah, Goodmem)
- ComparisonEngine: Orchestrates parallel/sequential searches
- ComparisonFormatter: Handles multiple output formats
- Config: Manages YAML configuration with environment variables
- Create a new adapter in
src/adapters/
:
from .base import BaseRagTool from ..core.models import RagResult class MyToolAdapter(BaseRagTool): def search(self, query: str, top_k: int = 5) -> List[RagResult]: # Implement tool-specific search results = self.client.search(query, limit=top_k) return [self._convert_to_rag_result(r) for r in results]
- Register in
src/adapters/factory.py
:
ADAPTER_REGISTRY["mytool"] = MyToolAdapter
- Add configuration in
configs/tools.yaml
:
tools: mytool: api_key_env: MYTOOL_API_KEY base_url: https://api.mytool.com
# Run all tests uv run pytest tests/ # Run specific test file uv run pytest tests/test_cli.py # Run with coverage uv run pytest tests/ --cov=src
The project uses:
- Black for formatting
- Ruff for linting
- MyPy for type checking
# Format code with Black uv run black src/ tests/ # Check linting with Ruff uv run ruff check src/ tests/ # Type checking with MyPy uv run mypy src/
Required environment variables:
VECTARA_API_KEY
: For Mawsuah/Vectara accessVECTARA_CORPUS_ID
: Vectara corpus IDGOODMEM_API_KEY
: For Goodmem access (optional, uses mock if not set)ANTHROPIC_API_KEY
: For LLM evaluation (optional)
[Your License]
Contributions welcome! Please follow the existing code style and add tests for new features.
Built following the SPIDER protocol for systematic development.