Overview

Relevant source files

Purpose and Scope

This document provides a high-level introduction to the rag-pipelines repository, a framework for building domain-specific Retrieval-Augmented Generation (RAG) systems. This overview covers the repository's purpose, standardized architecture, core components, and available domain-specific implementations.

For detailed information on specific topics:

Installation and first-time setup: see Getting Started
Architectural patterns and design decisions: see Architecture and Design Patterns
Individual domain pipeline implementations: see Domain-Specific Pipelines
Core utility APIs and usage: see Core Utilities

Sources: README.md1-147

What is RAG Pipelines?

The rag-pipelines repository is a production-grade framework for implementing Retrieval-Augmented Generation systems tailored to specific knowledge domains. Unlike general-purpose RAG solutions, this framework emphasizes:

Domain specialization: Pre-configured pipelines for medical and financial domains with domain-appropriate metadata schemas, retrieval strategies, and evaluation metrics
Standardized architecture: All pipelines follow a consistent two-phase pattern (indexing and execution) orchestrated via LangGraph state machines
Configuration-driven deployment: YAML-based configuration enables new pipeline deployment without code changes
Production-ready evaluation: Built-in comprehensive evaluation using DeepEval metrics and Confident AI tracing

The repository currently implements six production pipelines across two domains:

Medical domain (4 pipelines): HealthBench, MedCaseReasoning, MetaMedQA, PubMedQA
Financial domain (2 pipelines): FinanceBench, Earnings Calls

Sources: README.md13-35 pyproject.toml1-10

Repository Structure and Code Organization

The following diagram maps the repository's physical structure to its logical components:

Repository Layout Diagram

The repository follows a clear separation of concerns:

src/rag_pipelines/utils/: Reusable core utilities imported by all pipelines
Domain pipeline directories: Self-contained modules with indexing scripts, RAG execution scripts, and YAML configurations
baml_client/ and baml_src/: Auto-generated BAML workflow code (excluded from linting and type checking per pyproject.toml90-94)
Root-level configs: Project-wide settings for dependencies, testing, and code quality

Sources: README.md1-147 pyproject.toml1-166 .gitignore1-208

Core Architectural Patterns

All pipelines in this repository follow three fundamental architectural patterns:

1. Two-Phase Pipeline Architecture

Each pipeline operates in two distinct phases:

Phase 1 (Indexing): Offline ETL process executed via *_indexing.py scripts, configured by *_indexing_config.yml
Phase 2 (RAG Execution): Online query-time workflow executed via *_rag.py scripts, configured by *_rag_config.yml

For detailed phase specifications, see Two-Phase RAG Pattern.

Sources: README.md36-51

2. LangGraph State Machine Orchestration

Phase 2 (RAG Execution) is implemented as a StateGraph from the langgraph library. Each pipeline defines:

RAGState: A TypedDict containing question, context, retrieved documents, metadata, answer, and evaluation results
Pipeline nodes: Functions that read from and write to RAGState, implementing the five standard nodes shown above
Graph structure: Directed edges defining execution order between nodes

For LangGraph implementation details, see LangGraph State Workflows and RAGState and Pipeline Nodes.

Sources: README.md17 pyproject.toml16

3. Configuration-Driven Design

All pipeline customization occurs via YAML configuration files, not code modification:

Configuration File	Purpose	Key Sections
`*_indexing_config.yml`	Controls Phase 1 indexing process	`data_loader`, `chunker`, `metadata_extractor`, `embeddings`, `vector_store`
`*_rag_config.yml`	Controls Phase 2 RAG execution	`metadata_extractor`, `retriever`, `reranker`, `llm`, `evaluation_metrics`

For configuration schema and usage, see Configuration Management.

Sources: README.md23

Core Utility Components

The src/rag_pipelines/utils/ directory provides three critical utilities used by all pipelines:

Component Overview Table

Component	Class Name	Primary Purpose	Key Methods
Metadata Extraction	`MetadataExtractor`	LLM-powered structured metadata extraction from text	`extract_metadata()`
Contextual Reranking	`ContextualReranker`	Relevance-based document reordering using transformer models	`rerank()`
Document Loading	`UnstructuredDocumentLoader`, `UnstructuredAPIDocumentLoader`	PDF/document parsing via Unstructured API	`load()`, `lazy_load()`
Document Chunking	`UnstructuredChunker`	Strategy-based document chunking (basic, by_title)	`chunk_documents()`

Metadata Extraction Flow

The MetadataExtractor class in src/rag_pipelines/utils/metadata_extractor.py converts user-provided JSON schemas into Pydantic models, then uses LLM structured output to extract metadata conforming to that schema. Only successfully extracted (non-null) fields appear in the result dictionary.

For detailed API documentation, see Metadata Extraction.

Sources: README.md52-84

Contextual Reranking Architecture

The ContextualReranker class in src/rag_pipelines/utils/contextual_reranker.py uses Contextual AI's instruction-following reranker models to score and reorder retrieved documents based on query relevance. Supports custom instructions for domain-specific reranking behavior.

For detailed API documentation, see Contextual Reranking.

Sources: README.md54-61

Technology Stack and Dependencies

Primary Dependencies

The following table lists core dependencies with their roles and version constraints from pyproject.toml13-39:

Category	Package	Version Constraint	Purpose
Orchestration	`langgraph`	Latest	State machine workflow orchestration
LangChain Core	`langchain-core`	Latest	Base abstractions for LLM applications
LLM Integration	`langchain-groq`	Latest	Groq API client for LLM inference
Embeddings	`langchain-huggingface`, `sentence-transformers`	Latest	Dense vector embeddings
Vector Database	`langchain-milvus`	Latest	Milvus vector store integration
Document Processing	`langchain-unstructured`, `unstructured[pdf]`	Latest	PDF parsing and chunking
Evaluation	`deepeval`	`>=3.7.0`	RAG evaluation metrics
Workflow DSL	`baml-py`	`==0.214.0`	BAML declarative workflows
Data Loading	`datasets`	Latest	HuggingFace datasets
Financial Data	`edgartools`	Latest	SEC EDGAR filings

External Service Integrations

Environment variables required for external service authentication are documented in README.md109-116 For setup instructions, see Getting Started.

Sources: pyproject.toml13-39 README.md109-116

Domain-Specific Pipeline Catalog

The repository provides six production-ready RAG pipelines organized by domain:

Medical Domain Pipelines

Pipeline	Module Path	Dataset Source	Primary Use Case	Milvus Collection
HealthBench	`src/rag_pipelines/healthbench/`	`Tonic/Health-Bench` (HuggingFace)	Multi-turn medical conversations with expert rubric evaluation	`healthbench`
MedCaseReasoning	`src/rag_pipelines/medcasereasoning/`	Medical case studies	Clinical case analysis and diagnostic reasoning	`medcasereasoning`
MetaMedQA	`src/rag_pipelines/metamedqa/`	`qiaojin/MetaMedQA` (HuggingFace)	USMLE medical exam preparation and medical textbook QA	`metamedqa`
PubMedQA	`src/rag_pipelines/pubmedqa/`	`qiaojin/PubMedQA` (HuggingFace)	Biomedical research questions from PubMed articles	`pubmedqa`

For detailed medical pipeline documentation, see Medical Domain Pipelines.

Sources: README.md29-32

Financial Domain Pipelines

Pipeline	Module Path	Dataset Source	Primary Use Case	Milvus Collection
FinanceBench	`src/rag_pipelines/financebench/`	`patronus-ai/financebench` (GitHub)	SEC filings QA (10-K, 10-Q, 8-K)	`financebench`
Earnings Calls	`src/rag_pipelines/earnings_calls/`	`lamini/earnings-calls-qa` (HuggingFace)	Earnings call transcript analysis for 2800+ companies	`earnings_calls`

For detailed financial pipeline documentation, see Financial Domain Pipelines.

Sources: README.md33-34

Shared Architecture Pattern

All six pipelines implement the same code structure:

<pipeline_name>/ ├── <pipeline_name>_indexing.py # Phase 1: Indexing script ├── <pipeline_name>_rag.py # Phase 2: RAG execution script ├── <pipeline_name>_indexing_config.yml # Indexing configuration └── <pipeline_name>_rag_config.yml # RAG execution configuration

This standardization enables:

Code reuse: All pipelines share the same core utilities from src/rag_pipelines/utils/
Consistent patterns: Developers familiar with one pipeline can immediately understand others
Rapid deployment: New domain pipelines require only configuration and dataset specification

Sources: README.md118-138

Pipeline Execution Workflow

Phase 1: Indexing Execution

To populate a pipeline's vector database:

The indexing script performs:

Dataset loading from HuggingFace or GitHub
Document chunking via UnstructuredChunker
Metadata extraction via MetadataExtractor (LLM-powered)
Embedding generation via HuggingFaceEmbeddings
Storage in Milvus collection with hybrid search (dense + BM25 sparse vectors)

Sources: README.md118-127

Phase 2: RAG Execution

To run RAG evaluation on a pipeline:

The RAG script executes a LangGraph StateGraph with five sequential nodes:

MetadataExtractionNode: Extracts query metadata for filtering
DocumentRetrievalNode: Performs hybrid search in Milvus
DocumentRerankerNode: Reranks via ContextualReranker
AnswerGenerationNode: Generates answer via ChatGroq LLM
EvaluationNode: Evaluates against ground truth via DeepEval metrics

For detailed node specifications, see RAGState and Pipeline Nodes.

Sources: README.md130-138

Evaluation and Quality Assurance

Evaluation Framework

The repository uses DeepEval (version >=3.7.0 per pyproject.toml28) for comprehensive RAG evaluation:

Metric Category	Metrics	Purpose
Retrieval Quality	Contextual Recall, Contextual Precision	Measures quality of retrieved documents
Generation Quality	Answer Relevancy, Faithfulness	Measures LLM answer quality and factual grounding
Overall Relevancy	Contextual Relevancy	End-to-end relevance assessment

Evaluation results are traced to Confident AI for debugging and performance analysis.

For evaluation configuration and metric interpretation, see Evaluation and Tracing.

Sources: README.md20-21 README.md50 pyproject.toml28

Code Quality Infrastructure

The repository implements multi-layered quality assurance:

Key quality tools configured in pyproject.toml45-166:

Ruff: Linting and formatting (excludes baml_client/, baml_src/)
mypy: Strict type checking (excludes generated code and tests)
pytest: Test framework with async support and coverage
pre-commit: Enforces quality gates before commits

For development setup and contribution guidelines, see Development Guide.

Sources: pyproject.toml45-166 .gitignore1-208

Getting Started

To begin using the repository:

Install dependencies via uv package manager
Configure environment variables (Groq API, Milvus, Unstructured)
Run indexing for desired pipeline
Execute RAG evaluation to test pipeline

For detailed setup instructions, see Getting Started.

For contributing code or extending pipelines, see Development Guide.

Sources: README.md86-138