One-liner:
AI-powered backend system to extract structured data from multi-page invoices and PDFs and Images at scale, flag errors, and store results efficiently.
- Features
- High-Level Architecture
- Dead Letter Queue (DLQ)
- Retry Strategy with Exponential Backoff
- Tech Stack
- Installation & Setup
- Scripts
- Environment Variables
- API Endpoints
- Sample Output
- Validation & Error Handling
- Scaling Workers & Concurrency
- Token Efficiency
- Postman Collection
- License
- Batch & Async Processing: Upload multiple invoices; files queued in RabbitMQ.
- High Concurrency: Single worker can process ~100 files in parallel; scale horizontally with multiple workers.
- GenAI-Powered Extraction: Extracts vendor details, invoice metadata, line items, totals, and payment info.
- Error Detection: Flags mismatches in subtotal, sales tax, shipping, and total.
- Multi-format Support: PDFs, images, scanned, and multi-page invoices.
- Token-Efficient YAML Output: Saves 20–30% of AI token costs; converted to JSON for DB storage.
Workflow Summary:
- Users upload invoices via
/api/file/upload. - Files queued in RabbitMQ.
- Workers fetch id's and run GenAI extraction (parallel processing).
- YAML output generated → converted to JSON → stored in MongoDB.
- Validation checks applied; errors flagged.
- Processed results accessible via API.
- Invoices or jobs that fail processing (e.g., unreadable files, persistent GenAI errors) are automatically routed to a Dead Letter Queue (DLQ).
- DLQ allows you to review failed jobs, retry them manually, or trigger alerts.
- Keeps the main RabbitMQ queue unblocked and ensures smooth processing of other invoices.
- Common use cases:
- Corrupted PDF or image
- Unsupported invoice format
- Persistent AI extraction failure
- or any other Error
- Failed jobs are retried automatically using an exponential backoff strategy.
- Wait time increases after each failed attempt (e.g., 60s → 120s → 180s → 240s) to prevent overloading the system.
- Jobs exceeding the maximum retry count are sent to the Dead Letter Queue (DLQ) for manual review or alerts.
- Ensures smooth processing while handling temporary errors gracefully.
| Layer | Technology |
|---|---|
| Backend Framework | Express.js |
| Database | MongoDB |
| Queue / Async Jobs | RabbitMQ |
| Rate Limiting | Redis |
| Containerization | Docker & Docker Compose |
| AI / Data Extraction | Generative AI (Gemini) |
⚠️ Ensure.envincludes your Gemini API key.
# Build and start containers docker compose --build -d # Access the app http://localhost:3000# Start required services docker compose up -d # Start development server npm run dev # Start background worker npm run worker # OR start with PM2 for production npm run start"scripts": { "dev": "node --env-file=.env --watch index.js", "worker": "node --env-file=.env worker/file.worker.js", "queue:flush": "node --env-file=.env scripts/amqp.flush.js", "start": "pm2 start ecosystem.config.cjs" }GEMINI_API_KEY=<your-api-key> AMQP_URL=amqp://user:password@localhost:5672 MONGODB_URI=mongodb://localhost:27017/invoices REDIS_HOST=localhost REDIS_PORT=6379 MAX_FILE_SIZE=200 PORT=3000 WORKER_CONCURRENCY=100- Endpoint:
/api/file/upload - Method: POST
- Request: Multipart/form-data (multiple files)
- Form Key: files
- Response:
{ "message": "Files uploaded and queued", "file_ids": [array of file id's] }- Endpoint:
/api/file - Method: GET
- Request Body Example:
- filters (status): pending, processing, processed, error
- flag: true or false (to fetch the flagged files having error in calculation or other)
{ "flag": true, "status": "processed" }- Endpoint:
/api/file/{invoice_id} - Method: GET
- FormData Example: Include file(s) to process if needed
items: - name: "Activator CREAM 5-gal 5 gallons/pailt" quantity: 2 rate: 269.00 - name: "AEB2461 PTFE 10\"x36YDS 5mil Glass Cloth Fabric / No Adh." quantity: 5 rate: 64.00 - name: "Permabond 106 Cyanoacrylate 1-oz 10 bottles/case" quantity: 7 rate: 139.00 - name: "Araldite 2014 HT Epoxy Paste GRAY 50ml 2:1 6 cartridges/box | 120 cartridges/case" quantity: 1 rate: 719.00 payment_info: subtotal: 1187 sales_tax_percentage: 8 shipping_handling_cost: 50 total: 1330.75 errors: - "Subtotal mismatch: calculated 2550.00 vs declared 1187.00" - "Sales Tax amount mismatch: calculated 204.00 vs declared 94.75 for sales tax rate 8%" - "Total mismatch: calculated 2804.00 vs declared 1330.75"Stored in MongoDB as JSON (YAML converted to JSON internally).
- Flags subtotal, sales tax, shipping, and total mismatches.
- Invalid invoices are stored with an
errorsarray for review.
- Worker Concurrency: Default 100 files per worker.
- Multiple Workers: Run multiple workers to scale horizontally.
- File Size Limit: Default 200MB; configurable.
- Rate Limiting: ioredis ensures stable API usage.
- Tips: Monitor CPU/memory and RabbitMQ queues for optimal throughput.
- YAML formatting reduces AI token usage by ~20–30%.
- reduced characters by 30% (in current scenario) → faster output.
- Read more: How I Saved Millions in GenAI Token Costs
- Import Postman collection from
postman_collection.json. - Base URL:
http://localhost:3000 - Test endpoints: upload invoices, fetch all files, fetch file by ID.
This project is licensed under the MIT License.