Skip to content
Pronnoy Goswami edited this page Mar 7, 2025 · 3 revisions

πŸš€ RAG Data Ingestion Pipeline for ML Workloads

Overview

This repository provides a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications. The pipeline is optimized for handling millions of records, enabling fast, efficient similarity search to enhance LLM applications.

Key Features

βœ… Parallel Embedding Generation: Uses Ray to distribute computation across multiple GPUs and CPUs.
βœ… Vector Storage in OpenSearch: Implements Hierarchical Navigable Small World (HNSW) indexing for fast approximate nearest neighbor (ANN) search.
βœ… Vector Storage in PostgreSQL (pgvector): Supports exact k-NN retrieval for precision-based searches.
βœ… Optimized ETL Pipeline: Converts large-scale unstructured text data into vector embeddings efficiently.
βœ… Scalability & Performance: Designed to handle millions of records for large-scale ML workloads.

πŸ”— How to Use

Follow these steps to set up and run the pipeline:

1️⃣ Convert Raw Data to Parquet Format

To improve processing speed and reduce storage overhead, convert your JSONL data into Parquet format:

python src/convert.py

2️⃣ Generate Vector Embeddings with Ray

Leverage Ray for distributed embedding generation across multiple GPUs:

python src/embeddings.py

3️⃣ Store Embeddings in OpenSearch & PostgreSQL

Index the generated embeddings into OpenSearch and PostgreSQL (pgvector):

python src/opensearch_store.py python src/pgvector_store.py

4️⃣ Query for Contextual Document Retrieval

Run queries to fetch similar documents for RAG-based applications:

python src/query.py

Project Structure

rag-data-ingestion-pipeline/ β”‚-- data/ β”‚ β”‚-- raw/ β”‚ β”‚ β”œβ”€β”€ data.jsonl β”‚ β”‚-- processed/ β”‚ β”‚ β”œβ”€β”€ data.parquet β”‚-- src/ β”‚ β”‚-- convert.py # Converts JSONL to Parquet β”‚ β”‚-- embeddings.py # Handles embedding generation with Ray β”‚ β”‚-- opensearch_store.py # Stores embeddings in OpenSearch β”‚ β”‚-- pgvector_store.py # Stores embeddings in PostgreSQL β”‚ β”‚-- query.py # Queries vector databases for retrieval β”‚ β”‚-- pipeline.py # Main script to run ingestion pipeline β”‚-- requirements.txt # Python dependencies β”‚-- README.md # Project documentation 

πŸ”₯ Performance Insights

  • Ray speeds up embedding generation by distributing workload across GPUs.
  • OpenSearch provides fast ANN search, while pgvector ensures precise k-NN retrieval.
  • Batching queries reduces latencyβ€”bulk retrieval is significantly faster than per-query execution.
  • Proper index configuration (HNSW in OpenSearch, IVF in pgvector) enhances performance.

πŸš€ Future Enhancements

  • Support for other vector databases like Pinecone and FAISS.
  • Integration with streaming data sources for real-time ingestion.
  • Advanced index tuning for even faster retrieval.

πŸ“’ Contributing

We welcome contributions! Feel free to submit PRs, suggest improvements, or open issues. Let’s build scalable ML infrastructure together. πŸ”₯

πŸ“œ License

This project is licensed under the MIT License.


Have feedback or ideas? Let’s discuss! πŸš€