This is a production-ready system built in Rust that combines high-performance asynchronous processing, AI/ML integration with the Gemini API, multi-format document handling, and security practices.
Our API implements a multi-layered architecture to tackle the problem statement and all test cases.
+===================================================+ | main.rs (Interactive CLI) | +---------------------------------------------------+ | server.rs (API Gateway) | +---------------------------------------------------+ | final_challenge.rs (Contest Logic) | +---------------------------------------------------+ | ai/embed.rs (Vector Database Layer) | | ai/gemini.rs (LLM Intelligence Layer) | +---------------------------------------------------+ | pdf.rs + ocr.rs (Processing Pipeline) | +---------------------------------------------------+ | MySQL (Persistent Vector Store) | +===================================================+ ## HackerXAPI Architecture: βββ main.rs (Interactive CLI) <br /> βββ server.rs (API Gateway) <br /> βββ final_challenge.rs (Contest Logic) <br /> βββ AI Layer: <br /> β βββ embed.rs (Vector Database Layer) <br /> β βββ gemini.rs (LLM Intelligence Layer) <br /> βββ Processing Layer: <br /> β βββ pdf.rs (Document Processing) <br /> β βββ ocr.rs (OCR Pipeline) <br /> βββ MySQL (Persistent Vector Store) <br /> - Intelligent Document Processing: Handles a wide array of file types (
PDF,DOCX,XLSX,PPTX,JPEG,PNG,TXT) with a robust fallback chain. - High-Performance AI: Leverages the Gemini API with optimized chunking, parallel processing, and smart context filtering for fast, relevant responses.
- Enterprise-Grade Security: Features multi-layer security, including extensive prompt injection sanitization, and parameterized SQL queries.
- Scalable Architecture: Built with a stateless design,
tokiofor async operations, and CPU-aware parallelization for horizontal scaling. - Interactive Management: Includes a menu-driven CLI for easy server management, status monitoring, and graceful shutdowns.
The system is designed as a series of specialized layers, from the user-facing API and CLI down to the persistent database storage.
flowchart TD %% Entry Point A[main.rs CLI Menu] -->|Start Server| B[Axum Server :8000] A -->|Show Status| A2[Status Placeholder] A -->|Exit| A3[Program Exit] %% Server Request Handler B -->|POST /api/v1/hackrx/run| C[server::hackrx_run] C --> C1{Bearer Token Valid?} C1 -->|No| E401([401 Unauthorized]) C1 -->|Yes| C2[generate_filename_from_url] %% Document Processing Pipeline C2 --> D1{File exists locally?} D1 -->|No| D2[download_file with extension validation] D1 -->|Yes| D3[Skip download] D2 --> D4[extract_file_text] D3 --> D4 %% Multi-Format Text Extraction subgraph Extraction [Text Extraction Layer] D4 --> EXT1{File Extension?} EXT1 -->|PDF| EXT_PDF[Parallel PDF processing with pdftk/qpdf] EXT1 -->|DOCX| EXT_DOCX[ZIP extraction to XML parsing] EXT1 -->|XLSX| EXT_XLSX[Calamine spreadsheet to text] EXT1 -->|PPTX| EXT_PPTX[ImageMagick or LibreOffice to OCR] EXT1 -->|PNG/JPEG| EXT_IMG[Direct OCR with ocrs CLI] EXT1 -->|TXT| EXT_TXT[Token regex extraction] EXT_PDF --> TXT_OUT[Save to pdfs/filename.txt] EXT_DOCX --> TXT_OUT EXT_XLSX --> TXT_OUT EXT_PPTX --> TXT_OUT EXT_IMG --> TXT_OUT EXT_TXT --> TXT_OUT end %% Embeddings and Vector Storage TXT_OUT --> EMB_START[get_policy_chunk_embeddings] subgraph Embeddings [Vector Embeddings System] EMB_START --> EMB1{Embeddings exist in MySQL?} EMB1 -->|Yes| EMB_LOAD[Load from pdf_embeddings table] EMB1 -->|No| EMB_CHUNK[Chunk text into 33k char pieces] EMB_CHUNK --> EMB_API[Parallel Gemini Embedding API calls] EMB_API --> EMB_STORE[Batch store to MySQL] EMB_LOAD --> EMB_RETURN[Return chunk embeddings] EMB_STORE --> EMB_RETURN end %% Context-Aware Retrieval EMB_RETURN --> CTX_START[rewrite_policy_with_context] subgraph Context_RAG [Context Selection RAG] CTX_START --> CTX1[Embed combined questions] CTX1 --> CTX2[Cosine similarity calculation] CTX2 --> CTX3[Select top 10 relevant chunks] CTX3 --> CTX4[Write contextfiltered.txt] end %% Answer Generation CTX4 --> ANS_START[answer_questions] subgraph Answer_Gen [Answer Generation] ANS_START --> ANS1[Load filtered context] ANS1 --> ANS2[Sanitize against prompt injection] ANS2 --> ANS3[Gemini 2.0 Flash API call] ANS3 --> ANS4[Parse structured JSON response] ANS4 --> ANS_END[Extract answers array] end %% Final Response ANS_END --> SUCCESS([200 OK JSON Response]) %% Error Handling C --> ERR_HANDLER[Error Handler] ERR_HANDLER --> ERR_RESPONSE([4xx/5xx Error Response]) %% External Dependencies subgraph External [External Tools & Services] EXT_TOOLS[pdftk, qpdf, ImageMagick, LibreOffice, ocrs, pdftoppm] MYSQL_DB[(MySQL Database)] GEMINI_API[Google Gemini API] end Extraction -.-> EXT_TOOLS Embeddings -.-> MYSQL_DB Embeddings -.-> GEMINI_API Answer_Gen -.-> GEMINI_API This layer handles all interactions with the AI model and vector embeddings, featuring performance optimizations and smart context filtering.
- Chunking Strategy: Text is split into
33,000character chunks, which is optimal for the Gemini API. - Parallel Processing: Handles up to
50concurrent requests usingfutures::streamfor high throughput. - Database Caching: Caches embedding vectors in MySQL to avoid redundant and costly API calls.
- Batch Operations: Uses functions like
batch_store_pdf_embeddingsfor efficient bulk database insertions.
- Top-K Retrieval: Fetches the top
10most relevant document chunks for any given query. - Similarity Threshold: Enforces a minimum relevance score of
0.5(cosine similarity) to ensure context quality. - Combined Query Embedding: Creates a single, more effective embedding when multiple user questions are asked at once.
- Prompt Injection Defense: Proactively sanitizes all user input against a list of over 22 known prompt injection patterns to protect the LLM.
// Cosine similarity with proper error handling fn cosine_similarity(vec1: &[f32], vec2: &[f32]) -> f32 { let dot_product: f32 = vec1.iter().zip(vec2.iter()).map(|(a, b)| a * b).sum(); let magnitude1: f32 = vec1.iter().map(|v| v * v).sum::<f32>().sqrt(); let magnitude2: f32 = vec2.iter().map(|v| v * v).sum::<f32>().sqrt(); // ... proper zero-magnitude handling }- Chunking Strategy: Text is split into
33,000character chunks, which is optimal for the Gemini API. - Parallel Processing: Handles up to
50concurrent requests usingfutures::stream. - Database Caching: Caches embedding vectors in MySQL using the native
JSONdata type. - Batch Operations: Uses functions like
batch_store_pdf_embeddingsfor high-performance bulk database insertions.
- Top-K Retrieval: Fetches the
10most relevant document chunks for any given query. - Similarity Threshold: Enforces a minimum relevance score of
0.5(cosine similarity) to ensure context quality. - Combined Query Embedding: Creates a single, more effective embedding when multiple user questions are asked at once.
This component showcases enterprise-level security and reliability in its integration with the Gemini model.
fn sanitize_policy(content: &str) -> String { let dangerous_patterns = [ r"(?i)ignore\s+previous\s+instructions", r"(?i)disregard\s+the\s+above", r"(?i)pretend\s+to\s+be", // ... 22 different injection patterns ]; // Regex-based sanitization }- Structured Output: Enforces a JSON schema for consistent and predictable LLM responses.
- Cache Busting: Uses UUIDs to ensure request uniqueness where needed.
- Response Validation: Implements multi-layer JSON parsing.
- Prompt Engineering: Constructs context-aware prompts for more accurate results.
The system will support the following files for text extraction: File Type Support Matrix:
match ext.as_str() { "docx" => convert_docx_to_pdf(file_path)?, "xlsx" => convert_xlsx_to_pdf(file_path)?, "pdf" => extract_pdf_text_sync(file_path), "jpeg" | "png" => crate::ocr::extract_text_with_ocrs(file_path), "pptx" => extract_text_from_pptx(file_path), "txt" => extract_token_from_text(file_path), }- CPU-Aware Parallelization: Uses
num_cpus::get()to spawn an optimal number of threads for processing. - Memory-Safe Concurrency: Leverages
Arc<String>for safe, shared ownership of data across parallel tasks. - Chunk-based PDF Processing: Intelligently splits large PDFs into chunks to be processed in parallel across CPU cores.
- Tool Fallback Chain: Implements a resilient processing strategy, trying
pdftk, thenqpdf, and finally falling back to estimation if needed.
let page_ranges: Vec<(usize, usize)> = (0..num_cores) .map(|i| { let start = i * pages_per_chunk + 1; let end = ((i + 1) * pages_per_chunk).min(total_pages); (start, end) }) .collect();The system also uses OCR to parse text from images or pptx files
Multi-Tool Pipeline:
- Primary:
ImageMagickdirect conversion. - Fallback: A
LibreOfficeβ PDF β Images chain. - OCR Engine: Uses
ocrs-clifor the final text extraction. - Format Chain: A dedicated PPTX β Images β OCR β Text chain.
Quality Optimization:
- DPI Settings: Balances quality vs. speed with a
150 DPIsetting. - Background Processing: Enforces a white background and alpha removal for better accuracy.
- Slide Preservation: Maintains original slide order and numbering throughout the process.
The server implements intelligent request routing and security.
Security Middleware:
let auth = headers.get("authorization") .and_then(|value| value.to_str().ok()); if auth.is_none() || !auth.unwrap().starts_with("Bearer ") { return Err(StatusCode::UNAUTHORIZED); }- URL-to-Filename Generation: Intelligently detects file types from URLs.
- Special Endpoint Handling: Dedicated logic for handling endpoints in documents.
- File Existence Checking: Avoids redundant downloads by checking for existing vectors in the database first.
Advanced Features:
- Final Challenge Detection: Special handling for contest-specific files.
- Error Response Standardization: Returns errors in a consistent JSON format.
- Performance Monitoring: Includes request timing and logging for observability.
This module provides a user-friendly, menu-driven interface for managing the server.
Menu-Driven Architecture:
- Graceful Shutdown: Handles
Ctrl+Cfor proper cleanup before exiting. - Server Management: Allows starting and stopping the server with status monitoring.
- Error Recovery: Robustly handles invalid user input without crashing.
Tokio Runtime Utilization:
tokio::task::spawn_blocking(move || extract_file_text_sync(&file_path)).await?Concurrency Patterns:
- Stream Processing: Uses
buffer_unordered(PARALLEL_REQS)for high-throughput, parallel stream processing. - Future Composition: Employs
tokio::select!for gracefully handling multiple asynchronous operations, such as a task and a shutdown signal. - Blocking Task Spawning: Correctly offloads CPU-bound work to a dedicated thread pool to avoid blocking the async runtime.
Connection Pool Management:
static DB_POOL: Lazy<Pool> = Lazy::new(|| { let opts = Opts::from_url(&database_url).expect("Invalid database URL"); Pool::new(opts).expect("Failed to create database pool") });Performance Optimizations:
- Batch Insertions: Commits multiple embeddings in a single transaction for efficiency.
- Index Strategy: Uses dedicated indexes like
idx_pdf_filenameandidx_chunk_indexfor fast lookups. - JSON Storage: Uses MySQL's native
JSONdata type for optimal embedding storage and retrieval.
Rust Best Practices:
- RAII Pattern: Guarantees automatic cleanup of temporary files and other resources when they go out of scope.
Arc<T>: EmploysArcfor safe, shared ownership of data in parallel processing contexts.Result<T, E>: Uses comprehensive error propagation throughout the application for robust failure handling.Option<T>: Ensures null safety across the entire codebase.
- Input Sanitization: Defends against prompt injection attacks.
- File Type Validation: Uses a whitelist-based approach for processing file types.
- Payload Limits: Enforces request limits (e.g., 35KB on embeddings) for staying within API limits. This can be removed for a big performance gain.
- SQL Injection Prevention: Exclusively uses parameterized queries to protect the database.
Graceful Degradation:
- Tool Fallbacks: Implements a chain of multiple OCR and file conversion tools to maximize success rates.
- File Recovery: Reuses existing files to recover from partial processing failures.
- API Resilience: Provides proper HTTP status codes and clear, standardized error messages.
- Concurrent Embeddings: Processes up to 50 parallel requests This is of course, limited by the API rate limits. Removing it will improve performance greatly.
- Chunk Processing: Utilizes CPU-core optimized parallel processing for large PDFs.
- Database & Caching: Leverages connection pooling and file caching to maximize token use and be as efficient as possible.
- Relevance Filter: A
0.5cosine similarity score is the minimum for context retrieval. - Context Window: Uses the top 10 chunks to provide optimal context to the LLM. A higher context window increases accuracy even further.
- OCR Quality: Balances speed and accuracy with a
150 DPIsetting.
- Stateless Design: Each request is independent, making it easy to scale and multithread.
- Observability: Includes comprehensive logging and timing measurements for every case.
- Configuration: All configuration is managed via environment variables for easy deployment.
- Resource Management: Temporary files are cleaned up automatically via the RAII pattern.
- API Standards: Adheres to RESTful design principles with proper HTTP semantics.
- Built in Rust: We chose rust to make the API as fast as possible.
- Persistent Vector Store: The MySQL Database is perfect for company level usage of the system, where a document is queried constantly by both employees and clients.
- Handles all Documents: A chain of tools with fallbacks ensures that the system handles as many document types as possible.
- Context-Aware Embedding: Combines multiple questions into a single embedding for token efficiency.
- Prompt Injection Protecton: Features prompt injection protection.
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs) | shThis command is for Debian/Ubuntu-based systems.
sudo apt-get update sudo apt-get install pdftk-java qpdf poppler-utils libglib2.0-dev libcairo2-dev libpoppler-glib-dev bc libreoffice imagemagickcargo install miniserve cargo install ocrs-cli --lockedCreate a .env file from the example:
cp .envexample .envCreate a MySQL database and run the following schema:
CREATE TABLE pdf_embeddings ( id INTEGER PRIMARY KEY AUTO_INCREMENT, pdf_filename VARCHAR(255) NOT NULL, chunk_text TEXT NOT NULL, chunk_index INTEGER NOT NULL, embedding JSON NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, INDEX idx_pdf_filename (pdf_filename), INDEX idx_chunk_index (chunk_index) );Then, populate your .env file with the database connection string and your Gemini API key:
MYSQL_CONNECTION=mysql://username:password@localhost:3306/your_database GEMINI_KEY=your_gemini_api_keycargo runThe repository includes three scripts with various payloads to test the API with different document types:
./test.sh ./sim.sh ./simr4.sh- Rust (latest stable)
- MySQL database
- Google Gemini API key
- System packages for document processing (listed in step 2)
- OCR tools for image text extraction (listed in step 3)