Home

🚀 RAG Data Ingestion Pipeline for ML Workloads

Overview

This repository provides a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications. The pipeline is optimized for handling millions of records, enabling fast, efficient similarity search to enhance LLM applications.

Key Features

✅ Parallel Embedding Generation: Uses Ray to distribute computation across multiple GPUs and CPUs.
✅ Vector Storage in OpenSearch: Implements Hierarchical Navigable Small World (HNSW) indexing for fast approximate nearest neighbor (ANN) search.
✅ Vector Storage in PostgreSQL (pgvector): Supports exact k-NN retrieval for precision-based searches.
✅ Optimized ETL Pipeline: Converts large-scale unstructured text data into vector embeddings efficiently.
✅ Scalability & Performance: Designed to handle millions of records for large-scale ML workloads.

🔗 How to Use

Follow these steps to set up and run the pipeline:

1️⃣ Convert Raw Data to Parquet Format

To improve processing speed and reduce storage overhead, convert your JSONL data into Parquet format:

python src/convert.py

2️⃣ Generate Vector Embeddings with Ray

Leverage Ray for distributed embedding generation across multiple GPUs:

python src/embeddings.py

3️⃣ Store Embeddings in OpenSearch & PostgreSQL

Index the generated embeddings into OpenSearch and PostgreSQL (pgvector):

python src/opensearch_store.py python src/pgvector_store.py

4️⃣ Query for Contextual Document Retrieval

Run queries to fetch similar documents for RAG-based applications:

python src/query.py

Project Structure

rag-data-ingestion-pipeline/ │-- data/ │ │-- raw/ │ │ ├── data.jsonl │ │-- processed/ │ │ ├── data.parquet │-- src/ │ │-- convert.py # Converts JSONL to Parquet │ │-- embeddings.py # Handles embedding generation with Ray │ │-- opensearch_store.py # Stores embeddings in OpenSearch │ │-- pgvector_store.py # Stores embeddings in PostgreSQL │ │-- query.py # Queries vector databases for retrieval │ │-- pipeline.py # Main script to run ingestion pipeline │-- requirements.txt # Python dependencies │-- README.md # Project documentation

🔥 Performance Insights

Ray speeds up embedding generation by distributing workload across GPUs.
OpenSearch provides fast ANN search, while pgvector ensures precise k-NN retrieval.
Batching queries reduces latency—bulk retrieval is significantly faster than per-query execution.
Proper index configuration (HNSW in OpenSearch, IVF in pgvector) enhances performance.

🚀 Future Enhancements

Support for other vector databases like Pinecone and FAISS.
Integration with streaming data sources for real-time ingestion.
Advanced index tuning for even faster retrieval.

📢 Contributing

We welcome contributions! Feel free to submit PRs, suggest improvements, or open issues. Let’s build scalable ML infrastructure together. 🔥

📜 License

This project is licensed under the MIT License.

Have feedback or ideas? Let’s discuss! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!