A sophisticated document processing system that uses AI and OCR to extract financial amounts from medical documents and invoices. The system combines multiple extraction strategies including LLM processing, regex patterns, and fallback mechanisms to ensure high accuracy.
Video showing demo of api endpoints: https://drive.google.com/file/d/1iOxblNLLriSwdP66iAkR8vNQVmD0upM6/view?usp=drivesdk
You can access and test API routes using the Swagger UI at https://ai-powered-amount-extraction-in-medi-docs.onrender.com/api-docs/
(Live API deployed)
http://localhost:3000/api-docs (local development) to see the documentation.
- Multi-Modal Extraction: Combines LLM (Gemini) processing with regex fallback
- OCR Integration: Tesseract-based text extraction from images
- Smart Pattern Matching: Advanced regex patterns for various invoice formats
- Currency Detection: Automatic detection of USD, INR, EUR, GBP, JPY
- Robust Error Handling: Multiple fallback strategies for maximum reliability
- RESTful API: Clean API endpoints for easy integration
- Real-time Processing: Fast document processing with detailed logging
- Main orchestrator for the entire processing pipeline
- Coordinates OCR, LLM, and fallback extraction
- Handles error recovery and result validation
- Provides comprehensive logging and debugging
- AI-powered extraction using Google Gemini models
- Supports multiple model fallbacks (gemini-2.0-flash, gemini-1.5-pro)
- Intelligent retry logic with exponential backoff
- JSON response parsing and validation
- Tesseract integration for image-to-text conversion
- Automatic DPI detection and optimization
- Support for multiple image formats
- Text preprocessing and cleaning
- Character substitution for OCR error correction
- Smart number parsing with context awareness
- Confidence scoring for extracted amounts
- Handles various number formats and separators
- Clean, modern interface for document upload
- Real-time processing status updates
- Results display with detailed breakdown
- Formatters (
utils/formatters.js): Currency and number formatting - Validators (
utils/validators.js): Input validation and sanitization
- Standard Invoices: Total, subtotal, tax, discount, shipping
- Medical Bills: Treatment costs, insurance amounts, patient payments
- Receipts: Itemized amounts, taxes, tips
- Statements: Account balances, payments, charges
total_bill: Final amount duepaid: Amount already paiddue: Outstanding balancetax: Tax amounts (sales tax, GST, VAT)discount: Discount amountsshipping: Shipping and delivery costs
- Node.js (v14 or higher)
- Python 3.7+ (for Tesseract OCR)
- Tesseract OCR installed on system
cd backend npm install # Install Tesseract OCR # Ubuntu/Debian: sudo apt-get install tesseract-ocr # macOS: brew install tesseract # Windows: # Download from: https://github.com/UB-Mannheim/tesseract/wiki # Set up environment variables cp .env.example .env # Edit .env with your Gemini API keycd frontend npm install npm start# Gemini AI Configuration GEMINI_API_KEY=your_gemini_api_key_here GEMINI_MODEL=gemini-2.0-flash GEMINI_FALLBACK_MODEL=gemini-1.5-pro # Server Configuration PORT=3001 NODE_ENV=development # OCR Configuration TESSERACT_DPI=300 TESSERACT_LANG=engPOST /api/process Content-Type: multipart/form-data Body: - file: (image file) - Document image to process - text: (string) - Direct text input (alternative to file)GET /api/health{ "currency": "USD", "amounts": [ { "type": "total_bill", "value": 34850, "source": "text: 'Total, USD: 34850.00'" }, { "type": "tax", "value": 6150, "source": "text: 'Sales Tax, USD: 6150.00'" }, { "type": "discount", "value": 1800, "source": "text: 'Discount, USD. 1800.00'" } ], "status": "ok" }- Upload Document: Drag and drop or select an image file
- Process: Click "Process Document" to start extraction
- View Results: See extracted amounts with confidence scores
- Download: Export results as JSON or CSV
Total: $1,200.00Amount: βΉ5,000Due: β¬250.50
Total, USD: 30000.00Discount, INR. 1800.00Sales Tax, EUR: 6150.00
Total 2000Amount 1500Due 500
Subtotal: 1000 Discount: 100 Total: 900Paid: $500 Balance: $400
The system uses sophisticated regex patterns to handle various formats:
// Currency with symbols /([$β¬Β£Β₯βΉ%])\s*(\d[\d,.]*\d*)/g // Currency codes /(?:usd|inr|eur|gbp|jpy)\s*[:\-\.]\s*(\d[\d,.]*\d*)/gi // Keyword-number combinations /\b(?:total|subtotal|amount|bill|net)\b\s+(\d[\d,.]*\d*)/giβββ backend/ β βββ src/ β β βββ services/ β β β βββ pipelineService.js # Main processing pipeline β β β βββ llmService.js # AI/LLM integration β β β βββ ocrService.js # OCR processing β β β βββ normalizationService.js # Text normalization β β βββ routes/ β β β βββ api.js # API endpoints β β β βββ index.js # Route handlers β β βββ index.js # Server entry point β βββ uploads/ # Temporary file storage β βββ package.json βββ frontend/ β βββ src/ β β βββ pages/Home/ # Main application page β β βββ utils/ # Utility functions β β βββ App.js # React app component β βββ package.json βββ README.md # Backend tests cd backend npm test # Frontend tests cd frontend npm test# Linting npm run lint # Formatting npm run format # Type checking (if using TypeScript) npm run type-check// Model settings generationConfig: { temperature: 0.1, // Low temperature for consistency topK: 10, topP: 0.8, maxOutputTokens: 512 // Optimized for speed } // Retry settings attemptsPerModel: 3, initialBackoffMs: 1000, maxBackoffMs: 10000// Tesseract options { dpi: 300, // Optimal DPI for text recognition lang: 'eng', // Language model oem: 1, // OCR Engine Mode psm: 6 // Page Segmentation Mode }- Cause: Malformed JSON response from AI
- Solution: Enhanced JSON cleaning and fallback to regex
- Cause: Poor image quality or resolution
- Solution: Automatic DPI adjustment and text preprocessing
- Cause: Unusual invoice formats
- Solution: Multiple fallback patterns and comprehensive extraction
- Cause: Ambiguous currency indicators
- Solution: Context-aware detection with INR default
Enable detailed logging by setting:
DEBUG=true LOG_LEVEL=debug- OCR Processing: ~2-5 seconds per document
- LLM Processing: ~1-3 seconds per document
- Regex Fallback: ~100-500ms per document
- Total Pipeline: ~3-8 seconds average
- Image Quality: Use high-resolution images (300+ DPI)
- Text Clarity: Ensure good contrast and minimal noise
- Format Consistency: Use standard invoice formats when possible
- Batch Processing: Process multiple documents in parallel
- No Data Persistence: Documents are processed in memory only
- Temporary Storage: Files are deleted after processing
- API Key Security: Environment variable protection
- Input Validation: Comprehensive sanitization
- Use HTTPS in production
- Implement rate limiting
- Validate file types and sizes
- Monitor API usage
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
- Follow ESLint configuration
- Write comprehensive tests
- Document new features
- Maintain backward compatibility
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: Create a GitHub issue for bugs or feature requests
- Documentation: Check this README and inline code comments
- Community: Join our Discord server for discussions
Q: Why is the system not detecting amounts in my document? A: Check image quality, ensure text is clear, and verify the document format matches supported patterns.
Q: Can I use my own LLM API key? A: Yes, set the GEMINI_API_KEY environment variable with your key.
Q: How accurate is the extraction? A: Accuracy depends on document quality and format. High-quality invoices typically achieve 90%+ accuracy.
Q: Can I process documents in other languages? A: Currently optimized for English, but can be extended with additional Tesseract language models.
- Initial release with core functionality
- LLM integration with Gemini
- OCR processing with Tesseract
- Comprehensive regex patterns
- RESTful API endpoints
- React frontend interface
- Enhanced JSON parsing for LLM responses
- Improved pattern matching for various formats
- Added support for "Total 2000" style inputs
- Better error handling and fallback mechanisms
- Performance optimizations
Built with β€οΈ for accurate financial document processing
