The Mother of AI Project

Phase 1 RAG Systems: arXiv Paper Curator

A Learner-Focused Journey into Production RAG Systems

Learn to build modern AI systems from the ground up through hands-on implementation

Master the most in-demand AI engineering skills: RAG (Retrieval-Augmented Generation)

📖 About This Course

This is a learner-focused project where you'll build a complete research assistant system that automatically fetches academic papers, understands their content, and answers your research questions using advanced RAG techniques.

The arXiv Paper Curator will teach you to build a production-grade RAG system using industry best practices. You'll master the architecture, implementation, and deployment of AI systems that professionals use in the real world.

By the end of this course, you'll have your own AI research assistant and the skills to build similar systems for any domain.

🚀 Quick Start

📋 Prerequisites

Docker Desktop (with Docker Compose)
Python 3.12+
UV Package Manager (Install Guide)
8GB+ RAM and 20GB+ free disk space

⚡ Get Started

# 1. Clone and setup git clone <repository-url> cd arxiv-paper-curator uv sync # 2. Start all services docker compose up --build -d # 3. Verify everything works curl http://localhost:8000/health

📚 Weekly Learning Path

Week	Topic	Blog Post	Code Release
Week 0	The Mother of AI project - 6 phases	The Mother of AI project	-
Week 1	Infrastructure Foundation	The Infrastructure That Powers RAG Systems	week1.0
Week 2	Data Ingestion Pipeline	Building Data Ingestion Pipelines for RAG	week2.0
Week 3	Hybrid Search Implementation	Coming Soon	Coming Soon
Week 4	Advanced Chunking & Retrieval	Coming Soon	Coming Soon
Week 5	Full RAG Pipeline	Coming Soon	Coming Soon
Week 6	Production Deployment	Coming Soon	Coming Soon

📥 Clone a specific week's release:

# Clone a specific week's code git clone --branch <WEEK_TAG> https://github.com/jamwithai/arxiv-paper-curator cd arxiv-paper-curator uv sync docker compose down -v docker compose up --build -d # Replace <WEEK_TAG> with: week1.0, week2.0, etc.

📊 Access Your Services

Service	URL	Purpose
API Documentation	http://localhost:8000/docs	Interactive API testing
Airflow Dashboard	http://localhost:8080	Workflow management
OpenSearch Dashboards	http://localhost:5601	Hybrid search engine UI

NOTE: Check airflow/simple_auth_manager_passwords.json.generated for Airflow username and password

📚 Week 1: Infrastructure Foundation ✅

Start here! Master the infrastructure that powers modern RAG systems.

🎯 Learning Objectives

Complete infrastructure setup with Docker Compose
FastAPI development with automatic documentation and health checks
PostgreSQL database configuration and management
OpenSearch hybrid search engine setup
Ollama local LLM service configuration
Service orchestration and health monitoring
Professional development environment with code quality tools

🏗️ Architecture Overview

Infrastructure Components:

FastAPI: REST endpoints with async support (Port 8000)
PostgreSQL 16: Paper metadata storage (Port 5432)
OpenSearch 2.19: Search engine with dashboards (Ports 9200, 5601)
Apache Airflow 3.0: Workflow orchestration (Port 8080)
Ollama: Local LLM server (Port 11434)

📓 Setup Guide

# Launch the Week 1 notebook uv run jupyter notebook notebooks/week1/week1_setup.ipynb

✅ Success Criteria

Complete when you can:

Start all services with docker compose up -d
Access API docs at http://localhost:8000/docs
Login to Airflow at http://localhost:8080
Browse OpenSearch at http://localhost:5601
All tests pass: uv run pytest

📖 Deep Dive

Blog Post: The Infrastructure That Powers RAG Systems - Detailed walkthrough and production insights

📚 Week 2: Data Ingestion Pipeline ✅

Building on Week 1 infrastructure: Learn to fetch, process, and store academic papers automatically.

🎯 Learning Objectives

arXiv API integration with rate limiting and retry logic
Scientific PDF parsing using Docling
Automated data ingestion pipelines with Apache Airflow
Metadata extraction and storage workflows
Complete paper processing from API to database

🏗️ Architecture Overview

Data Pipeline Components:

MetadataFetcher: 🎯 Main orchestrator coordinating the entire pipeline
ArxivClient: Rate-limited paper fetching with retry logic
PDFParserService: Docling-powered scientific document processing
Airflow DAGs: Automated daily paper ingestion workflows
PostgreSQL Storage: Structured paper metadata and content

📓 Implementation Guide

# Launch the Week 2 notebook  uv run jupyter notebook notebooks/week2/week2_arxiv_integration.ipynb

💻 Code Examples

arXiv API Integration:

# Example: Fetch papers with rate limiting from src.services.arxiv.factory import make_arxiv_client async def fetch_recent_papers(): client = make_arxiv_client() papers = await client.search_papers( query="cat:cs.AI", max_results=10, from_date="20240801", to_date="20240807" ) return papers

PDF Processing Pipeline:

# Example: Parse PDF with Docling from src.services.pdf_parser.factory import make_pdf_parser_service async def process_paper_pdf(pdf_url: str): parser = make_pdf_parser_service() parsed_content = await parser.parse_pdf_from_url(pdf_url) return parsed_content # Structured content with text, tables, figures

Complete Ingestion Workflow:

# Example: Full paper ingestion pipeline from src.services.metadata_fetcher import make_metadata_fetcher async def ingest_papers(): fetcher = make_metadata_fetcher() results = await fetcher.fetch_and_store_papers( query="cat:cs.AI", max_results=5, from_date="20240807" ) return results # Papers stored in database with full content

✅ Success Criteria

Complete when you can:

Fetch papers from arXiv API: Test in Week 2 notebook
Parse PDF content with Docling: View extracted structured content
Run Airflow DAG: arxiv_paper_ingestion executes successfully
Verify database storage: Papers appear in PostgreSQL with full content
API endpoints work: /papers returns stored papers with metadata

📖 Deep Dive

Blog Post: Building Data Ingestion Pipelines for RAG - arXiv API integration and PDF processing

📚 Future Weeks: Complete RAG System

Building on Weeks 1-2 foundation: Advanced RAG techniques and production deployment.

Future Weeks Overview (6-Week Course)

Week 3: OpenSearch hybrid search implementation with BM25 + semantic vectors
Week 4: Context-aware chunking and retrieval evaluation with nDCG metrics
Week 5: Full RAG pipeline with LLM integration and prompt optimization
Week 6: Observability with Langfuse, A/B testing, and production deployment

🔧 Reference & Development Guide

🛠️ Technology Stack

Service	Purpose	Status
FastAPI	REST API with automatic docs	✅ Ready
PostgreSQL 16	Paper metadata and content storage	✅ Ready
OpenSearch 2.19	Hybrid search engine	✅ Ready
Apache Airflow 3.0	Workflow automation	✅ Ready
Ollama	Local LLM serving	✅ Ready

Development Tools: UV, Ruff, MyPy, Pytest, Docker Compose

🏗️ Project Structure

arxiv-paper-curator/ ├── src/ # Main application code │ ├── main.py # FastAPI application │ ├── routers/ # API endpoints │ ├── models/ # Database models (SQLAlchemy) │ ├── repositories/ # Data access layer │ ├── schemas/ # Pydantic validation schemas │ ├── services/ # Business logic │ │ ├── arxiv/ # ✨ NEW: arXiv API client │ │ ├── pdf_parser/ # ✨ NEW: Docling PDF processing │ │ ├── metadata_fetcher.py # ✨ NEW: Complete ingestion pipeline │ │ └── ollama/ # Ollama LLM service │ ├── db/ # Database configuration │ ├── config.py # Environment configuration │ └── dependencies.py # Dependency injection │ ├── notebooks/ # Learning materials │ ├── week1/ # Week 1: Infrastructure setup │ │ └── week1_setup.ipynb # Complete setup guide │ └── week2/ # ✨ NEW: Week 2 materials │ └── week2_data_ingestion.ipynb # Data pipeline guide │ ├── airflow/ # Workflow orchestration │ ├── dags/ # Workflow definitions │ │ ├── arxiv_ingestion/ # ✨ NEW: arXiv ingestion modules │ │ └── arxiv_paper_ingestion.py # ✨ NEW: Main ingestion DAG │ └── requirements-airflow.txt # ✨ NEW: Airflow dependencies │ ├── tests/ # Comprehensive test suite ├── static/ # Assets (images, GIFs) └── compose.yml # Service orchestration

🔧 Essential Commands

Using the Makefile (Recommended)

# View all available commands make help # Quick workflow make start # Start all services make health # Check all services health make test # Run tests make stop # Stop services

All Available Commands

Command	Description
`make start`	Start all services
`make stop`	Stop all services
`make restart`	Restart all services
`make status`	Show service status
`make logs`	Show service logs
`make health`	Check all services health
`make setup`	Install Python dependencies
`make format`	Format code
`make lint`	Lint and type check
`make test`	Run tests
`make test-cov`	Run tests with coverage
`make clean`	Clean up everything

Direct Commands (Alternative)

# If you prefer using commands directly docker compose up --build -d # Start services docker compose ps # Check status docker compose logs # View logs uv run pytest # Run tests

🎓 Target Audience

Who	Why
AI/ML Engineers	Learn production RAG architecture beyond tutorials
Software Engineers	Build end-to-end AI applications with best practices
Data Scientists	Implement production AI systems using modern tools

🛠️ Troubleshooting

Common Issues:

Services not starting? Wait 2-3 minutes, check docker compose logs
Port conflicts? Stop other services using ports 8000, 8080, 5432, 9200
Memory issues? Increase Docker Desktop memory allocation

Get Help:

Check the comprehensive Week 1 notebook troubleshooting section
Review service logs: docker compose logs [service-name]
Complete reset: docker compose down --volumes && docker compose up --build -d

💰 Cost Structure

This course is completely free! You'll only need minimal costs for optional services:

Local Development: $0 (everything runs locally)
Optional Cloud APIs: ~$2-5 for external LLM services (if chosen)

🎉 Ready to Start Your AI Engineering Journey?

Begin with the Week 1 setup notebook and build your first production RAG system!

For learners who want to master modern AI engineering

Built with love by Jam With AI

📄 License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
airflow		airflow
notebooks		notebooks
src		src
static		static
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
compose.yml		compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

codejay12/arxiv-paper-curator

Folders and files

Latest commit

History

Repository files navigation

The Mother of AI Project

Phase 1 RAG Systems: arXiv Paper Curator

A Learner-Focused Journey into Production RAG Systems

📖 About This Course

🚀 Quick Start

📋 Prerequisites

⚡ Get Started

📚 Weekly Learning Path

📊 Access Your Services

NOTE: Check airflow/simple_auth_manager_passwords.json.generated for Airflow username and password

📚 Week 1: Infrastructure Foundation ✅

🎯 Learning Objectives

🏗️ Architecture Overview

📓 Setup Guide

✅ Success Criteria

📖 Deep Dive

📚 Week 2: Data Ingestion Pipeline ✅

🎯 Learning Objectives

🏗️ Architecture Overview

📓 Implementation Guide

💻 Code Examples

✅ Success Criteria

📖 Deep Dive

📚 Future Weeks: Complete RAG System

Future Weeks Overview (6-Week Course)

🔧 Reference & Development Guide

🛠️ Technology Stack

🏗️ Project Structure

🔧 Essential Commands

Using the Makefile (Recommended)

All Available Commands

Direct Commands (Alternative)

🎓 Target Audience

🛠️ Troubleshooting

💰 Cost Structure

🎉 Ready to Start Your AI Engineering Journey?

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages