Skip to content

codejay12/arxiv-paper-curator

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

The Mother of AI Project

Phase 1 RAG Systems: arXiv Paper Curator

A Learner-Focused Journey into Production RAG Systems

Learn to build modern AI systems from the ground up through hands-on implementation

Master the most in-demand AI engineering skills: RAG (Retrieval-Augmented Generation)

Python Version FastAPI OpenSearch Docker Status


RAG Architecture

πŸ“– About This Course

This is a learner-focused project where you'll build a complete research assistant system that automatically fetches academic papers, understands their content, and answers your research questions using advanced RAG techniques.

The arXiv Paper Curator will teach you to build a production-grade RAG system using industry best practices. You'll master the architecture, implementation, and deployment of AI systems that professionals use in the real world.

By the end of this course, you'll have your own AI research assistant and the skills to build similar systems for any domain.


πŸš€ Quick Start

πŸ“‹ Prerequisites

  • Docker Desktop (with Docker Compose)
  • Python 3.12+
  • UV Package Manager (Install Guide)
  • 8GB+ RAM and 20GB+ free disk space

⚑ Get Started

# 1. Clone and setup git clone <repository-url> cd arxiv-paper-curator uv sync # 2. Start all services docker compose up --build -d # 3. Verify everything works curl http://localhost:8000/health

πŸ“š Weekly Learning Path

Week Topic Blog Post Code Release
Week 0 The Mother of AI project - 6 phases The Mother of AI project -
Week 1 Infrastructure Foundation The Infrastructure That Powers RAG Systems week1.0
Week 2 Data Ingestion Pipeline Building Data Ingestion Pipelines for RAG week2.0
Week 3 Hybrid Search Implementation Coming Soon Coming Soon
Week 4 Advanced Chunking & Retrieval Coming Soon Coming Soon
Week 5 Full RAG Pipeline Coming Soon Coming Soon
Week 6 Production Deployment Coming Soon Coming Soon

πŸ“₯ Clone a specific week's release:

# Clone a specific week's code git clone --branch <WEEK_TAG> https://github.com/jamwithai/arxiv-paper-curator cd arxiv-paper-curator uv sync docker compose down -v docker compose up --build -d # Replace <WEEK_TAG> with: week1.0, week2.0, etc.

πŸ“Š Access Your Services

Service URL Purpose
API Documentation http://localhost:8000/docs Interactive API testing
Airflow Dashboard http://localhost:8080 Workflow management
OpenSearch Dashboards http://localhost:5601 Hybrid search engine UI

NOTE: Check airflow/simple_auth_manager_passwords.json.generated for Airflow username and password


πŸ“š Week 1: Infrastructure Foundation βœ…

Start here! Master the infrastructure that powers modern RAG systems.

🎯 Learning Objectives

  • Complete infrastructure setup with Docker Compose
  • FastAPI development with automatic documentation and health checks
  • PostgreSQL database configuration and management
  • OpenSearch hybrid search engine setup
  • Ollama local LLM service configuration
  • Service orchestration and health monitoring
  • Professional development environment with code quality tools

πŸ—οΈ Architecture Overview

Week 1 Infrastructure Setup

Infrastructure Components:

  • FastAPI: REST endpoints with async support (Port 8000)
  • PostgreSQL 16: Paper metadata storage (Port 5432)
  • OpenSearch 2.19: Search engine with dashboards (Ports 9200, 5601)
  • Apache Airflow 3.0: Workflow orchestration (Port 8080)
  • Ollama: Local LLM server (Port 11434)

πŸ““ Setup Guide

# Launch the Week 1 notebook uv run jupyter notebook notebooks/week1/week1_setup.ipynb

βœ… Success Criteria

Complete when you can:

πŸ“– Deep Dive

Blog Post: The Infrastructure That Powers RAG Systems - Detailed walkthrough and production insights


πŸ“š Week 2: Data Ingestion Pipeline βœ…

Building on Week 1 infrastructure: Learn to fetch, process, and store academic papers automatically.

🎯 Learning Objectives

  • arXiv API integration with rate limiting and retry logic
  • Scientific PDF parsing using Docling
  • Automated data ingestion pipelines with Apache Airflow
  • Metadata extraction and storage workflows
  • Complete paper processing from API to database

πŸ—οΈ Architecture Overview

Week 2 Data Ingestion Architecture

Data Pipeline Components:

  • MetadataFetcher: 🎯 Main orchestrator coordinating the entire pipeline
  • ArxivClient: Rate-limited paper fetching with retry logic
  • PDFParserService: Docling-powered scientific document processing
  • Airflow DAGs: Automated daily paper ingestion workflows
  • PostgreSQL Storage: Structured paper metadata and content

πŸ““ Implementation Guide

# Launch the Week 2 notebook  uv run jupyter notebook notebooks/week2/week2_arxiv_integration.ipynb

πŸ’» Code Examples

arXiv API Integration:

# Example: Fetch papers with rate limiting from src.services.arxiv.factory import make_arxiv_client async def fetch_recent_papers(): client = make_arxiv_client() papers = await client.search_papers( query="cat:cs.AI", max_results=10, from_date="20240801", to_date="20240807" ) return papers

PDF Processing Pipeline:

# Example: Parse PDF with Docling from src.services.pdf_parser.factory import make_pdf_parser_service async def process_paper_pdf(pdf_url: str): parser = make_pdf_parser_service() parsed_content = await parser.parse_pdf_from_url(pdf_url) return parsed_content # Structured content with text, tables, figures

Complete Ingestion Workflow:

# Example: Full paper ingestion pipeline from src.services.metadata_fetcher import make_metadata_fetcher async def ingest_papers(): fetcher = make_metadata_fetcher() results = await fetcher.fetch_and_store_papers( query="cat:cs.AI", max_results=5, from_date="20240807" ) return results # Papers stored in database with full content

βœ… Success Criteria

Complete when you can:

  • Fetch papers from arXiv API: Test in Week 2 notebook
  • Parse PDF content with Docling: View extracted structured content
  • Run Airflow DAG: arxiv_paper_ingestion executes successfully
  • Verify database storage: Papers appear in PostgreSQL with full content
  • API endpoints work: /papers returns stored papers with metadata

πŸ“– Deep Dive

Blog Post: Building Data Ingestion Pipelines for RAG - arXiv API integration and PDF processing


πŸ“š Future Weeks: Complete RAG System

Building on Weeks 1-2 foundation: Advanced RAG techniques and production deployment.

Future Weeks Overview (6-Week Course)

  • Week 3: OpenSearch hybrid search implementation with BM25 + semantic vectors
  • Week 4: Context-aware chunking and retrieval evaluation with nDCG metrics
  • Week 5: Full RAG pipeline with LLM integration and prompt optimization
  • Week 6: Observability with Langfuse, A/B testing, and production deployment

πŸ”§ Reference & Development Guide

πŸ› οΈ Technology Stack

Service Purpose Status
FastAPI REST API with automatic docs βœ… Ready
PostgreSQL 16 Paper metadata and content storage βœ… Ready
OpenSearch 2.19 Hybrid search engine βœ… Ready
Apache Airflow 3.0 Workflow automation βœ… Ready
Ollama Local LLM serving βœ… Ready

Development Tools: UV, Ruff, MyPy, Pytest, Docker Compose

πŸ—οΈ Project Structure

arxiv-paper-curator/ β”œβ”€β”€ src/ # Main application code β”‚ β”œβ”€β”€ main.py # FastAPI application β”‚ β”œβ”€β”€ routers/ # API endpoints β”‚ β”œβ”€β”€ models/ # Database models (SQLAlchemy) β”‚ β”œβ”€β”€ repositories/ # Data access layer β”‚ β”œβ”€β”€ schemas/ # Pydantic validation schemas β”‚ β”œβ”€β”€ services/ # Business logic β”‚ β”‚ β”œβ”€β”€ arxiv/ # ✨ NEW: arXiv API client β”‚ β”‚ β”œβ”€β”€ pdf_parser/ # ✨ NEW: Docling PDF processing β”‚ β”‚ β”œβ”€β”€ metadata_fetcher.py # ✨ NEW: Complete ingestion pipeline β”‚ β”‚ └── ollama/ # Ollama LLM service β”‚ β”œβ”€β”€ db/ # Database configuration β”‚ β”œβ”€β”€ config.py # Environment configuration β”‚ └── dependencies.py # Dependency injection β”‚ β”œβ”€β”€ notebooks/ # Learning materials β”‚ β”œβ”€β”€ week1/ # Week 1: Infrastructure setup β”‚ β”‚ └── week1_setup.ipynb # Complete setup guide β”‚ └── week2/ # ✨ NEW: Week 2 materials β”‚ └── week2_data_ingestion.ipynb # Data pipeline guide β”‚ β”œβ”€β”€ airflow/ # Workflow orchestration β”‚ β”œβ”€β”€ dags/ # Workflow definitions β”‚ β”‚ β”œβ”€β”€ arxiv_ingestion/ # ✨ NEW: arXiv ingestion modules β”‚ β”‚ └── arxiv_paper_ingestion.py # ✨ NEW: Main ingestion DAG β”‚ └── requirements-airflow.txt # ✨ NEW: Airflow dependencies β”‚ β”œβ”€β”€ tests/ # Comprehensive test suite β”œβ”€β”€ static/ # Assets (images, GIFs) └── compose.yml # Service orchestration 

πŸ”§ Essential Commands

Using the Makefile (Recommended)

# View all available commands make help # Quick workflow make start # Start all services make health # Check all services health make test # Run tests make stop # Stop services

All Available Commands

Command Description
make start Start all services
make stop Stop all services
make restart Restart all services
make status Show service status
make logs Show service logs
make health Check all services health
make setup Install Python dependencies
make format Format code
make lint Lint and type check
make test Run tests
make test-cov Run tests with coverage
make clean Clean up everything

Direct Commands (Alternative)

# If you prefer using commands directly docker compose up --build -d # Start services docker compose ps # Check status docker compose logs # View logs uv run pytest # Run tests

πŸŽ“ Target Audience

Who Why
AI/ML Engineers Learn production RAG architecture beyond tutorials
Software Engineers Build end-to-end AI applications with best practices
Data Scientists Implement production AI systems using modern tools

πŸ› οΈ Troubleshooting

Common Issues:

  • Services not starting? Wait 2-3 minutes, check docker compose logs
  • Port conflicts? Stop other services using ports 8000, 8080, 5432, 9200
  • Memory issues? Increase Docker Desktop memory allocation

Get Help:

  • Check the comprehensive Week 1 notebook troubleshooting section
  • Review service logs: docker compose logs [service-name]
  • Complete reset: docker compose down --volumes && docker compose up --build -d

πŸ’° Cost Structure

This course is completely free! You'll only need minimal costs for optional services:

  • Local Development: $0 (everything runs locally)
  • Optional Cloud APIs: ~$2-5 for external LLM services (if chosen)

πŸŽ‰ Ready to Start Your AI Engineering Journey?

Begin with the Week 1 setup notebook and build your first production RAG system!

For learners who want to master modern AI engineering

Built with love by Jam With AI


πŸ“„ License

MIT License - see LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 52.9%
  • Jupyter Notebook 44.3%
  • Dockerfile 1.5%
  • Other 1.3%