Scalable GenAI Invoice PDF Data Extractor

One-liner:
AI-powered backend system to extract structured data from multi-page invoices and PDFs and Images at scale, flag errors, and store results efficiently.

Features

Batch & Async Processing: Upload multiple invoices; files queued in RabbitMQ.
High Concurrency: Single worker can process ~100 files in parallel; scale horizontally with multiple workers.
GenAI-Powered Extraction: Extracts vendor details, invoice metadata, line items, totals, and payment info.
Error Detection: Flags mismatches in subtotal, sales tax, shipping, and total.
Multi-format Support: PDFs, images, scanned, and multi-page invoices.
Token-Efficient YAML Output: Saves 20–30% of AI token costs; converted to JSON for DB storage.

High-Level Architecture

Scalable-Genai-Invoice-PDF-Data-Extractor-LLD

Workflow Summary:

Users upload invoices via /api/file/upload.
Files queued in RabbitMQ.
Workers fetch id's and run GenAI extraction (parallel processing).
YAML output generated → converted to JSON → stored in MongoDB.
Validation checks applied; errors flagged.
Processed results accessible via API.

Dead Letter Queue (DLQ)

Invoices or jobs that fail processing (e.g., unreadable files, persistent GenAI errors) are automatically routed to a Dead Letter Queue (DLQ).
DLQ allows you to review failed jobs, retry them manually, or trigger alerts.
Keeps the main RabbitMQ queue unblocked and ensures smooth processing of other invoices.
Common use cases:
- Corrupted PDF or image
- Unsupported invoice format
- Persistent AI extraction failure
- or any other Error

Retry Strategy with Exponential Backoff

Failed jobs are retried automatically using an exponential backoff strategy.
Wait time increases after each failed attempt (e.g., 60s → 120s → 180s → 240s) to prevent overloading the system.
Jobs exceeding the maximum retry count are sent to the Dead Letter Queue (DLQ) for manual review or alerts.
Ensures smooth processing while handling temporary errors gracefully.

Tech Stack

Layer	Technology
Backend Framework	Express.js
Database	MongoDB
Queue / Async Jobs	RabbitMQ
Rate Limiting	Redis
Containerization	Docker & Docker Compose
AI / Data Extraction	Generative AI (Gemini)

Installation & Setup

🐳 Run with Docker

⚠️ Ensure .env includes your Gemini API key.

# Build and start containers docker compose --build -d # Access the app http://localhost:3000

💻 Run Directly

# Start required services docker compose up -d # Start development server npm run dev # Start background worker npm run worker # OR start with PM2 for production npm run start

Scripts

"scripts": { "dev": "node --env-file=.env --watch index.js", "worker": "node --env-file=.env worker/file.worker.js", "queue:flush": "node --env-file=.env scripts/amqp.flush.js", "start": "pm2 start ecosystem.config.cjs" }

Environment Variables

GEMINI_API_KEY=<your-api-key> AMQP_URL=amqp://user:password@localhost:5672 MONGODB_URI=mongodb://localhost:27017/invoices REDIS_HOST=localhost REDIS_PORT=6379 MAX_FILE_SIZE=200 PORT=3000 WORKER_CONCURRENCY=100

API Endpoints

1. Upload Files

Endpoint: /api/file/upload
Method: POST
Request: Multipart/form-data (multiple files)
Form Key: files
Response:

{ "message": "Files uploaded and queued", "file_ids": [array of file id's] }

2. Fetch All Files (with filters)

Endpoint: /api/file
Method: GET
Request Body Example:
filters (status): pending, processing, processed, error
flag: true or false (to fetch the flagged files having error in calculation or other)

{ "flag": true, "status": "processed" }

3. Fetch File by ID

Endpoint: /api/file/{invoice_id}
Method: GET
FormData Example: Include file(s) to process if needed

Sample Output

items: - name: "Activator CREAM 5-gal 5 gallons/pailt" quantity: 2 rate: 269.00 - name: "AEB2461 PTFE 10\"x36YDS 5mil Glass Cloth Fabric / No Adh." quantity: 5 rate: 64.00 - name: "Permabond 106 Cyanoacrylate 1-oz 10 bottles/case" quantity: 7 rate: 139.00 - name: "Araldite 2014 HT Epoxy Paste GRAY 50ml 2:1 6 cartridges/box | 120 cartridges/case" quantity: 1 rate: 719.00 payment_info: subtotal: 1187 sales_tax_percentage: 8 shipping_handling_cost: 50 total: 1330.75 errors: - "Subtotal mismatch: calculated 2550.00 vs declared 1187.00" - "Sales Tax amount mismatch: calculated 204.00 vs declared 94.75 for sales tax rate 8%" - "Total mismatch: calculated 2804.00 vs declared 1330.75"

Stored in MongoDB as JSON (YAML converted to JSON internally).

Validation & Error Handling

Flags subtotal, sales tax, shipping, and total mismatches.
Invalid invoices are stored with an errors array for review.

Scaling Workers & Concurrency

Worker Concurrency: Default 100 files per worker.
Multiple Workers: Run multiple workers to scale horizontally.
File Size Limit: Default 200MB; configurable.
Rate Limiting: ioredis ensures stable API usage.
Tips: Monitor CPU/memory and RabbitMQ queues for optimal throughput.

Token Efficiency

YAML formatting reduces AI token usage by ~20–30%.
reduced characters by 30% (in current scenario) → faster output.
Read more: How I Saved Millions in GenAI Token Costs

Postman Collection

Import Postman collection from postman_collection.json.
Base URL: http://localhost:3000
Test endpoints: upload invoices, fetch all files, fetch file by ID.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
clients		clients
consts		consts
controller		controller
middleware		middleware
models		models
postman		postman
processors		processors
routes		routes
scripts		scripts
services		services
utils		utils
validation		validation
worker		worker
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
ecosystem.config.cjs		ecosystem.config.cjs
index.js		index.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scalable GenAI Invoice PDF Data Extractor

Table of Contents

Features

High-Level Architecture

Dead Letter Queue (DLQ)

Retry Strategy with Exponential Backoff

Tech Stack

Installation & Setup

🐳 Run with Docker

💻 Run Directly

Scripts

Environment Variables

API Endpoints

1. Upload Files

2. Fetch All Files (with filters)

3. Fetch File by ID

Sample Output

Validation & Error Handling

Scaling Workers & Concurrency

Token Efficiency

Postman Collection

License

About

Uh oh!

Releases

Packages

Languages

License

codeterrayt/Scalable-Genai-Invoice-PDF-Data-Extractor

Folders and files

Latest commit

History

Repository files navigation

Scalable GenAI Invoice PDF Data Extractor

Table of Contents

Features

High-Level Architecture

Dead Letter Queue (DLQ)

Retry Strategy with Exponential Backoff

Tech Stack

Installation & Setup

🐳 Run with Docker

💻 Run Directly

Scripts

Environment Variables

API Endpoints

1. Upload Files

2. Fetch All Files (with filters)

3. Fetch File by ID

Sample Output

Validation & Error Handling

Scaling Workers & Concurrency

Token Efficiency

Postman Collection

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages