Learn to build modern AI systems from the ground up through hands-on implementation
Master the most in-demand AI engineering skills: RAG (Retrieval-Augmented Generation)
This is a learner-focused project where you'll build a complete research assistant system that automatically fetches academic papers, understands their content, and answers your research questions using advanced RAG techniques.
The arXiv Paper Curator will teach you to build a production-grade RAG system using industry best practices. You'll master the architecture, implementation, and deployment of AI systems that professionals use in the real world.
By the end of this course, you'll have your own AI research assistant and the skills to build similar systems for any domain.
- Docker Desktop (with Docker Compose)
- Python 3.12+
- UV Package Manager (Install Guide)
- 8GB+ RAM and 20GB+ free disk space
# 1. Clone and setup git clone <repository-url> cd arxiv-paper-curator uv sync # 2. Start all services docker compose up --build -d # 3. Verify everything works curl http://localhost:8000/health| Week | Topic | Blog Post | Code Release |
|---|---|---|---|
| Week 0 | The Mother of AI project - 6 phases | The Mother of AI project | - |
| Week 1 | Infrastructure Foundation | The Infrastructure That Powers RAG Systems | week1.0 |
| Week 2 | Data Ingestion Pipeline | Building Data Ingestion Pipelines for RAG | week2.0 |
| Week 3 | Hybrid Search Implementation | Coming Soon | Coming Soon |
| Week 4 | Advanced Chunking & Retrieval | Coming Soon | Coming Soon |
| Week 5 | Full RAG Pipeline | Coming Soon | Coming Soon |
| Week 6 | Production Deployment | Coming Soon | Coming Soon |
π₯ Clone a specific week's release:
# Clone a specific week's code git clone --branch <WEEK_TAG> https://github.com/jamwithai/arxiv-paper-curator cd arxiv-paper-curator uv sync docker compose down -v docker compose up --build -d # Replace <WEEK_TAG> with: week1.0, week2.0, etc.| Service | URL | Purpose |
|---|---|---|
| API Documentation | http://localhost:8000/docs | Interactive API testing |
| Airflow Dashboard | http://localhost:8080 | Workflow management |
| OpenSearch Dashboards | http://localhost:5601 | Hybrid search engine UI |
Start here! Master the infrastructure that powers modern RAG systems.
- Complete infrastructure setup with Docker Compose
- FastAPI development with automatic documentation and health checks
- PostgreSQL database configuration and management
- OpenSearch hybrid search engine setup
- Ollama local LLM service configuration
- Service orchestration and health monitoring
- Professional development environment with code quality tools
Infrastructure Components:
- FastAPI: REST endpoints with async support (Port 8000)
- PostgreSQL 16: Paper metadata storage (Port 5432)
- OpenSearch 2.19: Search engine with dashboards (Ports 9200, 5601)
- Apache Airflow 3.0: Workflow orchestration (Port 8080)
- Ollama: Local LLM server (Port 11434)
# Launch the Week 1 notebook uv run jupyter notebook notebooks/week1/week1_setup.ipynbComplete when you can:
- Start all services with
docker compose up -d - Access API docs at http://localhost:8000/docs
- Login to Airflow at http://localhost:8080
- Browse OpenSearch at http://localhost:5601
- All tests pass:
uv run pytest
Blog Post: The Infrastructure That Powers RAG Systems - Detailed walkthrough and production insights
Building on Week 1 infrastructure: Learn to fetch, process, and store academic papers automatically.
- arXiv API integration with rate limiting and retry logic
- Scientific PDF parsing using Docling
- Automated data ingestion pipelines with Apache Airflow
- Metadata extraction and storage workflows
- Complete paper processing from API to database
Data Pipeline Components:
- MetadataFetcher: π― Main orchestrator coordinating the entire pipeline
- ArxivClient: Rate-limited paper fetching with retry logic
- PDFParserService: Docling-powered scientific document processing
- Airflow DAGs: Automated daily paper ingestion workflows
- PostgreSQL Storage: Structured paper metadata and content
# Launch the Week 2 notebook uv run jupyter notebook notebooks/week2/week2_arxiv_integration.ipynbarXiv API Integration:
# Example: Fetch papers with rate limiting from src.services.arxiv.factory import make_arxiv_client async def fetch_recent_papers(): client = make_arxiv_client() papers = await client.search_papers( query="cat:cs.AI", max_results=10, from_date="20240801", to_date="20240807" ) return papersPDF Processing Pipeline:
# Example: Parse PDF with Docling from src.services.pdf_parser.factory import make_pdf_parser_service async def process_paper_pdf(pdf_url: str): parser = make_pdf_parser_service() parsed_content = await parser.parse_pdf_from_url(pdf_url) return parsed_content # Structured content with text, tables, figuresComplete Ingestion Workflow:
# Example: Full paper ingestion pipeline from src.services.metadata_fetcher import make_metadata_fetcher async def ingest_papers(): fetcher = make_metadata_fetcher() results = await fetcher.fetch_and_store_papers( query="cat:cs.AI", max_results=5, from_date="20240807" ) return results # Papers stored in database with full contentComplete when you can:
- Fetch papers from arXiv API: Test in Week 2 notebook
- Parse PDF content with Docling: View extracted structured content
- Run Airflow DAG:
arxiv_paper_ingestionexecutes successfully - Verify database storage: Papers appear in PostgreSQL with full content
- API endpoints work:
/papersreturns stored papers with metadata
Blog Post: Building Data Ingestion Pipelines for RAG - arXiv API integration and PDF processing
Building on Weeks 1-2 foundation: Advanced RAG techniques and production deployment.
- Week 3: OpenSearch hybrid search implementation with BM25 + semantic vectors
- Week 4: Context-aware chunking and retrieval evaluation with nDCG metrics
- Week 5: Full RAG pipeline with LLM integration and prompt optimization
- Week 6: Observability with Langfuse, A/B testing, and production deployment
| Service | Purpose | Status |
|---|---|---|
| FastAPI | REST API with automatic docs | β Ready |
| PostgreSQL 16 | Paper metadata and content storage | β Ready |
| OpenSearch 2.19 | Hybrid search engine | β Ready |
| Apache Airflow 3.0 | Workflow automation | β Ready |
| Ollama | Local LLM serving | β Ready |
Development Tools: UV, Ruff, MyPy, Pytest, Docker Compose
arxiv-paper-curator/ βββ src/ # Main application code β βββ main.py # FastAPI application β βββ routers/ # API endpoints β βββ models/ # Database models (SQLAlchemy) β βββ repositories/ # Data access layer β βββ schemas/ # Pydantic validation schemas β βββ services/ # Business logic β β βββ arxiv/ # β¨ NEW: arXiv API client β β βββ pdf_parser/ # β¨ NEW: Docling PDF processing β β βββ metadata_fetcher.py # β¨ NEW: Complete ingestion pipeline β β βββ ollama/ # Ollama LLM service β βββ db/ # Database configuration β βββ config.py # Environment configuration β βββ dependencies.py # Dependency injection β βββ notebooks/ # Learning materials β βββ week1/ # Week 1: Infrastructure setup β β βββ week1_setup.ipynb # Complete setup guide β βββ week2/ # β¨ NEW: Week 2 materials β βββ week2_data_ingestion.ipynb # Data pipeline guide β βββ airflow/ # Workflow orchestration β βββ dags/ # Workflow definitions β β βββ arxiv_ingestion/ # β¨ NEW: arXiv ingestion modules β β βββ arxiv_paper_ingestion.py # β¨ NEW: Main ingestion DAG β βββ requirements-airflow.txt # β¨ NEW: Airflow dependencies β βββ tests/ # Comprehensive test suite βββ static/ # Assets (images, GIFs) βββ compose.yml # Service orchestration # View all available commands make help # Quick workflow make start # Start all services make health # Check all services health make test # Run tests make stop # Stop services| Command | Description |
|---|---|
make start | Start all services |
make stop | Stop all services |
make restart | Restart all services |
make status | Show service status |
make logs | Show service logs |
make health | Check all services health |
make setup | Install Python dependencies |
make format | Format code |
make lint | Lint and type check |
make test | Run tests |
make test-cov | Run tests with coverage |
make clean | Clean up everything |
# If you prefer using commands directly docker compose up --build -d # Start services docker compose ps # Check status docker compose logs # View logs uv run pytest # Run tests| Who | Why |
|---|---|
| AI/ML Engineers | Learn production RAG architecture beyond tutorials |
| Software Engineers | Build end-to-end AI applications with best practices |
| Data Scientists | Implement production AI systems using modern tools |
Common Issues:
- Services not starting? Wait 2-3 minutes, check
docker compose logs - Port conflicts? Stop other services using ports 8000, 8080, 5432, 9200
- Memory issues? Increase Docker Desktop memory allocation
Get Help:
- Check the comprehensive Week 1 notebook troubleshooting section
- Review service logs:
docker compose logs [service-name] - Complete reset:
docker compose down --volumes && docker compose up --build -d
This course is completely free! You'll only need minimal costs for optional services:
- Local Development: $0 (everything runs locally)
- Optional Cloud APIs: ~$2-5 for external LLM services (if chosen)
Begin with the Week 1 setup notebook and build your first production RAG system!
For learners who want to master modern AI engineering
Built with love by Jam With AI
MIT License - see LICENSE file for details.

