Skip to content

High-performance Rust API with AI, multi-format docs, Gemini integration, security, CLI.

License

Not-Buddy/HackerXAPI

Repository files navigation

HackerXAPI - Built for HackRx πŸš€

This is a production-ready system built in Rust that combines high-performance asynchronous processing, AI/ML integration with the Gemini API, multi-format document handling, and security practices.

πŸ—οΈ System Architecture Overview

Our API implements a multi-layered architecture to tackle the problem statement and all test cases.


Architecture

+===================================================+ | main.rs (Interactive CLI) | +---------------------------------------------------+ | server.rs (API Gateway) | +---------------------------------------------------+ | final_challenge.rs (Contest Logic) | +---------------------------------------------------+ | ai/embed.rs (Vector Database Layer) | | ai/gemini.rs (LLM Intelligence Layer) | +---------------------------------------------------+ | pdf.rs + ocr.rs (Processing Pipeline) | +---------------------------------------------------+ | MySQL (Persistent Vector Store) | +===================================================+ 
## HackerXAPI Architecture: β”œβ”€β”€ main.rs (Interactive CLI) <br /> β”œβ”€β”€ server.rs (API Gateway) <br /> β”œβ”€β”€ final_challenge.rs (Contest Logic) <br /> β”œβ”€β”€ AI Layer: <br /> β”‚ β”œβ”€β”€ embed.rs (Vector Database Layer) <br /> β”‚ └── gemini.rs (LLM Intelligence Layer) <br /> β”œβ”€β”€ Processing Layer: <br /> β”‚ β”œβ”€β”€ pdf.rs (Document Processing) <br /> β”‚ └── ocr.rs (OCR Pipeline) <br /> └── MySQL (Persistent Vector Store) <br /> 

✨ Features

  • Intelligent Document Processing: Handles a wide array of file types (PDF, DOCX, XLSX, PPTX, JPEG, PNG, TXT) with a robust fallback chain.
  • High-Performance AI: Leverages the Gemini API with optimized chunking, parallel processing, and smart context filtering for fast, relevant responses.
  • Enterprise-Grade Security: Features multi-layer security, including extensive prompt injection sanitization, and parameterized SQL queries.
  • Scalable Architecture: Built with a stateless design, tokio for async operations, and CPU-aware parallelization for horizontal scaling.
  • Interactive Management: Includes a menu-driven CLI for easy server management, status monitoring, and graceful shutdowns.

πŸ›οΈ Architecture Overview

The system is designed as a series of specialized layers, from the user-facing API and CLI down to the persistent database storage.

flowchart TD %% Entry Point A[main.rs CLI Menu] -->|Start Server| B[Axum Server :8000] A -->|Show Status| A2[Status Placeholder] A -->|Exit| A3[Program Exit] %% Server Request Handler B -->|POST /api/v1/hackrx/run| C[server::hackrx_run] C --> C1{Bearer Token Valid?} C1 -->|No| E401([401 Unauthorized]) C1 -->|Yes| C2[generate_filename_from_url] %% Document Processing Pipeline C2 --> D1{File exists locally?} D1 -->|No| D2[download_file with extension validation] D1 -->|Yes| D3[Skip download] D2 --> D4[extract_file_text] D3 --> D4 %% Multi-Format Text Extraction subgraph Extraction [Text Extraction Layer] D4 --> EXT1{File Extension?} EXT1 -->|PDF| EXT_PDF[Parallel PDF processing with pdftk/qpdf] EXT1 -->|DOCX| EXT_DOCX[ZIP extraction to XML parsing] EXT1 -->|XLSX| EXT_XLSX[Calamine spreadsheet to text] EXT1 -->|PPTX| EXT_PPTX[ImageMagick or LibreOffice to OCR] EXT1 -->|PNG/JPEG| EXT_IMG[Direct OCR with ocrs CLI] EXT1 -->|TXT| EXT_TXT[Token regex extraction] EXT_PDF --> TXT_OUT[Save to pdfs/filename.txt] EXT_DOCX --> TXT_OUT EXT_XLSX --> TXT_OUT EXT_PPTX --> TXT_OUT EXT_IMG --> TXT_OUT EXT_TXT --> TXT_OUT end %% Embeddings and Vector Storage TXT_OUT --> EMB_START[get_policy_chunk_embeddings] subgraph Embeddings [Vector Embeddings System] EMB_START --> EMB1{Embeddings exist in MySQL?} EMB1 -->|Yes| EMB_LOAD[Load from pdf_embeddings table] EMB1 -->|No| EMB_CHUNK[Chunk text into 33k char pieces] EMB_CHUNK --> EMB_API[Parallel Gemini Embedding API calls] EMB_API --> EMB_STORE[Batch store to MySQL] EMB_LOAD --> EMB_RETURN[Return chunk embeddings] EMB_STORE --> EMB_RETURN end %% Context-Aware Retrieval EMB_RETURN --> CTX_START[rewrite_policy_with_context] subgraph Context_RAG [Context Selection RAG] CTX_START --> CTX1[Embed combined questions] CTX1 --> CTX2[Cosine similarity calculation] CTX2 --> CTX3[Select top 10 relevant chunks] CTX3 --> CTX4[Write contextfiltered.txt] end %% Answer Generation CTX4 --> ANS_START[answer_questions] subgraph Answer_Gen [Answer Generation] ANS_START --> ANS1[Load filtered context] ANS1 --> ANS2[Sanitize against prompt injection] ANS2 --> ANS3[Gemini 2.0 Flash API call] ANS3 --> ANS4[Parse structured JSON response] ANS4 --> ANS_END[Extract answers array] end %% Final Response ANS_END --> SUCCESS([200 OK JSON Response]) %% Error Handling C --> ERR_HANDLER[Error Handler] ERR_HANDLER --> ERR_RESPONSE([4xx/5xx Error Response]) %% External Dependencies subgraph External [External Tools & Services] EXT_TOOLS[pdftk, qpdf, ImageMagick, LibreOffice, ocrs, pdftoppm] MYSQL_DB[(MySQL Database)] GEMINI_API[Google Gemini API] end Extraction -.-> EXT_TOOLS Embeddings -.-> MYSQL_DB Embeddings -.-> GEMINI_API Answer_Gen -.-> GEMINI_API 
Loading

πŸ”§ Core Components

🧠 ai - AI & Embedding Layer

This layer handles all interactions with the AI model and vector embeddings, featuring performance optimizations and smart context filtering.

Performance Optimizations

  • Chunking Strategy: Text is split into 33,000 character chunks, which is optimal for the Gemini API.
  • Parallel Processing: Handles up to 50 concurrent requests using futures::stream for high throughput.
  • Database Caching: Caches embedding vectors in MySQL to avoid redundant and costly API calls.
  • Batch Operations: Uses functions like batch_store_pdf_embeddings for efficient bulk database insertions.

Smart Context Filtering

  • Top-K Retrieval: Fetches the top 10 most relevant document chunks for any given query.
  • Similarity Threshold: Enforces a minimum relevance score of 0.5 (cosine similarity) to ensure context quality.
  • Combined Query Embedding: Creates a single, more effective embedding when multiple user questions are asked at once.

Enterprise-Level Security (gemini.rs)

  • Prompt Injection Defense: Proactively sanitizes all user input against a list of over 22 known prompt injection patterns to protect the LLM.

Advanced Vector Operations

// Cosine similarity with proper error handling fn cosine_similarity(vec1: &[f32], vec2: &[f32]) -> f32 { let dot_product: f32 = vec1.iter().zip(vec2.iter()).map(|(a, b)| a * b).sum(); let magnitude1: f32 = vec1.iter().map(|v| v * v).sum::<f32>().sqrt(); let magnitude2: f32 = vec2.iter().map(|v| v * v).sum::<f32>().sqrt(); // ... proper zero-magnitude handling }

Performance Optimizations

  • Chunking Strategy: Text is split into 33,000 character chunks, which is optimal for the Gemini API.
  • Parallel Processing: Handles up to 50 concurrent requests using futures::stream.
  • Database Caching: Caches embedding vectors in MySQL using the native JSON data type.
  • Batch Operations: Uses functions like batch_store_pdf_embeddings for high-performance bulk database insertions.

Smart Context Filtering

  • Top-K Retrieval: Fetches the 10 most relevant document chunks for any given query.
  • Similarity Threshold: Enforces a minimum relevance score of 0.5 (cosine similarity) to ensure context quality.
  • Combined Query Embedding: Creates a single, more effective embedding when multiple user questions are asked at once.

gemini.rs - LLM Integration Layer

This component showcases enterprise-level security and reliability in its integration with the Gemini model.

Security Features

fn sanitize_policy(content: &str) -> String { let dangerous_patterns = [ r"(?i)ignore\s+previous\s+instructions", r"(?i)disregard\s+the\s+above", r"(?i)pretend\s+to\s+be", // ... 22 different injection patterns ]; // Regex-based sanitization }

Advanced API Patterns

  • Structured Output: Enforces a JSON schema for consistent and predictable LLM responses.
  • Cache Busting: Uses UUIDs to ensure request uniqueness where needed.
  • Response Validation: Implements multi-layer JSON parsing.
  • Prompt Engineering: Constructs context-aware prompts for more accurate results.

πŸ“„ Document Processing Pipeline

The system will support the following files for text extraction: File Type Support Matrix:

match ext.as_str() { "docx" => convert_docx_to_pdf(file_path)?, "xlsx" => convert_xlsx_to_pdf(file_path)?, "pdf" => extract_pdf_text_sync(file_path), "jpeg" | "png" => crate::ocr::extract_text_with_ocrs(file_path), "pptx" => extract_text_from_pptx(file_path), "txt" => extract_token_from_text(file_path), }

Performance Engineering

  • CPU-Aware Parallelization: Uses num_cpus::get() to spawn an optimal number of threads for processing.
  • Memory-Safe Concurrency: Leverages Arc<String> for safe, shared ownership of data across parallel tasks.
  • Chunk-based PDF Processing: Intelligently splits large PDFs into chunks to be processed in parallel across CPU cores.
  • Tool Fallback Chain: Implements a resilient processing strategy, trying pdftk, then qpdf, and finally falling back to estimation if needed.

PDF Processing

let page_ranges: Vec<(usize, usize)> = (0..num_cores) .map(|i| { let start = i * pages_per_chunk + 1; let end = ((i + 1) * pages_per_chunk).min(total_pages); (start, end) }) .collect();

Optical Character Recognition

The system also uses OCR to parse text from images or pptx files

Multi-Tool Pipeline:

  • Primary: ImageMagick direct conversion.
  • Fallback: A LibreOffice β†’ PDF β†’ Images chain.
  • OCR Engine: Uses ocrs-cli for the final text extraction.
  • Format Chain: A dedicated PPTX β†’ Images β†’ OCR β†’ Text chain.

Quality Optimization:

  • DPI Settings: Balances quality vs. speed with a 150 DPI setting.
  • Background Processing: Enforces a white background and alpha removal for better accuracy.
  • Slide Preservation: Maintains original slide order and numbering throughout the process.

🌐 Server Architecture & API Design

The server implements intelligent request routing and security.

Security Middleware:

let auth = headers.get("authorization") .and_then(|value| value.to_str().ok()); if auth.is_none() || !auth.unwrap().starts_with("Bearer ") { return Err(StatusCode::UNAUTHORIZED); }
  • URL-to-Filename Generation: Intelligently detects file types from URLs.
  • Special Endpoint Handling: Dedicated logic for handling endpoints in documents.
  • File Existence Checking: Avoids redundant downloads by checking for existing vectors in the database first.

Advanced Features:

  • Final Challenge Detection: Special handling for contest-specific files.
  • Error Response Standardization: Returns errors in a consistent JSON format.
  • Performance Monitoring: Includes request timing and logging for observability.

This module provides a user-friendly, menu-driven interface for managing the server.

Menu-Driven Architecture:

  • Graceful Shutdown: Handles Ctrl+C for proper cleanup before exiting.
  • Server Management: Allows starting and stopping the server with status monitoring.
  • Error Recovery: Robustly handles invalid user input without crashing.

πŸš€ Advanced Technical Patterns

Async Programming Mastery

Tokio Runtime Utilization:

tokio::task::spawn_blocking(move || extract_file_text_sync(&file_path)).await?

Concurrency Patterns:

  • Stream Processing: Uses buffer_unordered(PARALLEL_REQS) for high-throughput, parallel stream processing.
  • Future Composition: Employs tokio::select! for gracefully handling multiple asynchronous operations, such as a task and a shutdown signal.
  • Blocking Task Spawning: Correctly offloads CPU-bound work to a dedicated thread pool to avoid blocking the async runtime.

Database Architecture

Connection Pool Management:

static DB_POOL: Lazy<Pool> = Lazy::new(|| { let opts = Opts::from_url(&database_url).expect("Invalid database URL"); Pool::new(opts).expect("Failed to create database pool") });

Performance Optimizations:

  • Batch Insertions: Commits multiple embeddings in a single transaction for efficiency.
  • Index Strategy: Uses dedicated indexes like idx_pdf_filename and idx_chunk_index for fast lookups.
  • JSON Storage: Uses MySQL's native JSON data type for optimal embedding storage and retrieval.

Memory Management & Safety

Rust Best Practices:

  • RAII Pattern: Guarantees automatic cleanup of temporary files and other resources when they go out of scope.
  • Arc<T>: Employs Arc for safe, shared ownership of data in parallel processing contexts.
  • Result<T, E>: Uses comprehensive error propagation throughout the application for robust failure handling.
  • Option<T>: Ensures null safety across the entire codebase.

πŸ›‘οΈ Security & Reliability Features

Multi-Layer Security

  • Input Sanitization: Defends against prompt injection attacks.
  • File Type Validation: Uses a whitelist-based approach for processing file types.
  • Payload Limits: Enforces request limits (e.g., 35KB on embeddings) for staying within API limits. This can be removed for a big performance gain.
  • SQL Injection Prevention: Exclusively uses parameterized queries to protect the database.

Error Handling Strategy

Graceful Degradation:

  • Tool Fallbacks: Implements a chain of multiple OCR and file conversion tools to maximize success rates.
  • File Recovery: Reuses existing files to recover from partial processing failures.
  • API Resilience: Provides proper HTTP status codes and clear, standardized error messages.

πŸ“Š Performance Characteristics

Scalability Metrics

  • Concurrent Embeddings: Processes up to 50 parallel requests This is of course, limited by the API rate limits. Removing it will improve performance greatly.
  • Chunk Processing: Utilizes CPU-core optimized parallel processing for large PDFs.
  • Database & Caching: Leverages connection pooling and file caching to maximize token use and be as efficient as possible.

Quality Thresholds

  • Relevance Filter: A 0.5 cosine similarity score is the minimum for context retrieval.
  • Context Window: Uses the top 10 chunks to provide optimal context to the LLM. A higher context window increases accuracy even further.
  • OCR Quality: Balances speed and accuracy with a 150 DPI setting.

🎯 Production-Ready Features

  • Stateless Design: Each request is independent, making it easy to scale and multithread.
  • Observability: Includes comprehensive logging and timing measurements for every case.
  • Configuration: All configuration is managed via environment variables for easy deployment.
  • Resource Management: Temporary files are cleaned up automatically via the RAII pattern.
  • API Standards: Adheres to RESTful design principles with proper HTTP semantics.

What is Unique here then?

  • Built in Rust: We chose rust to make the API as fast as possible.
  • Persistent Vector Store: The MySQL Database is perfect for company level usage of the system, where a document is queried constantly by both employees and clients.
  • Handles all Documents: A chain of tools with fallbacks ensures that the system handles as many document types as possible.
  • Context-Aware Embedding: Combines multiple questions into a single embedding for token efficiency.
  • Prompt Injection Protecton: Features prompt injection protection.

Get It Running

1. Install Rust

curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs) | sh

2. Install System Dependencies

This command is for Debian/Ubuntu-based systems.

sudo apt-get update sudo apt-get install pdftk-java qpdf poppler-utils libglib2.0-dev libcairo2-dev libpoppler-glib-dev bc libreoffice imagemagick

3. Install Rust Tools

cargo install miniserve cargo install ocrs-cli --locked

4. Configure Environment

Create a .env file from the example:

cp .envexample .env

5. Setup Database

Create a MySQL database and run the following schema:

CREATE TABLE pdf_embeddings ( id INTEGER PRIMARY KEY AUTO_INCREMENT, pdf_filename VARCHAR(255) NOT NULL, chunk_text TEXT NOT NULL, chunk_index INTEGER NOT NULL, embedding JSON NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, INDEX idx_pdf_filename (pdf_filename), INDEX idx_chunk_index (chunk_index) );

Then, populate your .env file with the database connection string and your Gemini API key:

MYSQL_CONNECTION=mysql://username:password@localhost:3306/your_database GEMINI_KEY=your_gemini_api_key

6. Run the Application

cargo run

7. Testing

The repository includes three scripts with various payloads to test the API with different document types:

./test.sh ./sim.sh ./simr4.sh

πŸ”§ Requirements

  • Rust (latest stable)
  • MySQL database
  • Google Gemini API key
  • System packages for document processing (listed in step 2)
  • OCR tools for image text extraction (listed in step 3)