Skip to content

Commit a040483

Browse files
authored
Add OCR_AGENT_CACHE_SIZE environment variable (#4066)
## Problem OCR agents used unlimited caching, causing excessive memory usage. Each cached OCR agent consumes different amounts of memory, but can easily consume ~800MB. ## Solution Add `OCR_AGENT_CACHE_SIZE` environment variable to limit cached OCR agents per process. - **Default**: 1 cached agent - **Configurable**: Set to 0 to disable caching, or higher for more languages
1 parent 869ef45 commit a040483

File tree

4 files changed

+16
-2
lines changed

4 files changed

+16
-2
lines changed

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
## 0.18.10
2+
3+
### Enhancements
4+
5+
### Features
6+
- **Add OCR_AGENT_CACHE_SIZE environment variable** Added configurable cache size for OCR agents to control memory usage.
7+
8+
### Fixes
9+
110
## 0.18.10-dev0
211

312
### Enhancements

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.18.10-dev0" # pragma: no cover
1+
__version__ = "0.18.10" # pragma: no cover

unstructured/partition/utils/config.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,11 @@ def OCR_AGENT(self) -> str:
111111
"""OCR Agent to use"""
112112
return self._get_string("OCR_AGENT", OCR_AGENT_TESSERACT)
113113

114+
@property
115+
def OCR_AGENT_CACHE_SIZE(self) -> int:
116+
"""Maximum number of OCR agents to cache per process"""
117+
return self._get_int("OCR_AGENT_CACHE_SIZE", 1)
118+
114119
@property
115120
def EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD(self) -> int:
116121
"""extra image block content to add around an identified element(`Image`, `Table`) region

unstructured/partition/utils/ocr_models/ocr_interface.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ def get_agent(cls, language: str) -> OCRAgent:
3434
return cls.get_instance(ocr_agent_cls_qname, language)
3535

3636
@staticmethod
37-
@functools.lru_cache(maxsize=None)
37+
@functools.lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE)
3838
def get_instance(ocr_agent_module: str, language: str) -> "OCRAgent":
3939
module_name, class_name = ocr_agent_module.rsplit(".", 1)
4040
if module_name not in OCR_AGENT_MODULES_WHITELIST:

0 commit comments

Comments
 (0)