Fix: Prevent auto-splitting of French accented words in text recognition #16994

Ihebdhouibi · 2025-11-06T10:26:08Z

Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

This PR fixes a bug where French words containing accented characters (é, è, à, ç, etc.) and contractions (n'êtes, l'été) were incorrectly split at each accented character during OCR text recognition word grouping.

🐛 Problem Description

Issue

The BaseRecLabelDecode.get_word_info() method in ppocr/postprocess/rec_postprocess.py only recognized basic ASCII letters (a-z, A-Z) as word characters. Accented characters used in French and other Latin-based languages were incorrectly classified as "splitters", causing words to be broken apart.

Example of the Bug

Before the fix:

Input: "été" (summer)
Output: 3 separate words: ["é", "t", "é"] ❌
Input: "français" (French)
Output: 3 separate words: ["fran", "ç", "ais"] ❌
Input: "n'êtes" (you are)
Output: 3 separate words: ["n", "'", "êtes"] ❌

After the fix:

Input: "été" → Output: 1 word: ["été"] ✅
Input: "français" → Output: 1 word: ["français"] ✅
Input: "n'êtes" → Output: 1 word: ["n'êtes"] ✅

✨ Solution

Changes Made

Added unicodedata import for Unicode character category detection
Implemented is_latin_char() helper function that properly identifies Latin letters with diacritics
Modified get_word_info() method to include accented characters in word grouping logic
Added apostrophe handling for French contractions

Technical Details

The fix uses Python's unicodedata module to check if a character belongs to the Letter category (L*) and has a Latin or French-based Unicode name. This ensures that characters like:

é (LATIN SMALL LETTER E WITH ACUTE)
è (LATIN SMALL LETTER E WITH GRAVE)
à (LATIN SMALL LETTER A WITH GRAVE)
ç (LATIN SMALL LETTER C WITH CEDILLA)

...are correctly recognized as word characters.

📁 Files Modified

Core Changes

ppocr/postprocess/rec_postprocess.py
- Added unicodedata import
- Added is_latin_char() function
- Modified BaseRecLabelDecode.get_word_info() method

Test Files

test_french_accents.py (new)
- Comprehensive test suite for French accented character handling
- Tests various scenarios: simple accents, contractions, mixed text

🧪 Testing

Test Coverage

The included test script validates:

Simple accented words: été, élève
Words with ç: français
Contractions with apostrophes: n'êtes, C'était
Words with à: à demain
Complex sentences with multiple accents

Running Tests

python test_french_accents.py

🔄 Backward Compatibility

✅ Fully backward compatible

This fix:

Only adds new functionality (recognition of accented characters)
Does not change behavior for existing ASCII text
Does not modify the API or function signatures
Uses standard library (unicodedata) - no new dependencies

All existing functionality remains unchanged. Code that worked before will continue to work exactly as before, with the added benefit of proper French (and other Latin-based language) support.

🌍 Impact

Languages Benefited

This fix improves OCR text recognition for all Latin-based languages that use diacritics, including:

French: é, è, ê, à, â, ù, û, ç, ï, etc.
Spanish: á, é, í, ó, ú, ñ, ü
Portuguese: ã, õ, á, é, í, ó, ú, â, ê, ô, ç
German: ä, ö, ü, ß
Italian: à, è, é, ì, ò, ù
And many more...

Use Cases

Document digitization in French-speaking regions
Multilingual OCR applications
Legal and administrative document processing
Educational material processing
International business document handling

📊 Performance Impact

Negligible performance impact:

The is_latin_char() function is only called for non-ASCII characters
Uses efficient unicodedata standard library functions
No additional loops or complex operations
Same time complexity as the original implementation

🔍 Code Quality

✅ Passes all pre-commit hooks:

black (code formatting)
flake8 (linting)
trailing whitespace check
line ending normalization

📝 Related Issues

This fix addresses the issue where French and other Latin-based language texts are incorrectly segmented during OCR post-processing, improving the accuracy and usability of PaddleOCR for international users.

✅ Checklist

Code follows project style guidelines
Self-review completed
Comments added for complex logic
No breaking changes
Test script included
Documentation updated (this PR doc)
All pre-commit hooks pass

🙏 Acknowledgments

This fix was developed and tested on real-world French OCR scenarios, ensuring practical applicability and effectiveness.

Ready for review and merge! 🚀

CLAassistant · 2025-11-06T10:26:16Z

All committers have signed the CLA.

paddle-bot · 2025-11-06T10:26:16Z

Thanks for your contribution!

Added support for Latin characters with diacritics (é, è, à, ç, etc.) and French contractions (n'êtes) in word grouping logic of BaseRecLabelDecode.get_word_info(). This fix ensures that French words are no longer split at accented characters during OCR text recognition.

paddle-bot bot added the contrib/contributor Contributor-related discussion or task. label Nov 6, 2025

Ihebdhouibi force-pushed the fix-auto-split-french-words branch from 2113bca to c37b052 Compare November 7, 2025 08:23

paddle-bot bot added the contributor label Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Prevent auto-splitting of French accented words in text recognition #16994

Fix: Prevent auto-splitting of French accented words in text recognition #16994

Ihebdhouibi commented Nov 6, 2025

CLAassistant commented Nov 6, 2025 •

edited

Loading

paddle-bot bot commented Nov 6, 2025

Labels

2 participants

Fix: Prevent auto-splitting of French accented words in text recognition #16994

Are you sure you want to change the base?

Fix: Prevent auto-splitting of French accented words in text recognition #16994

Conversation

Ihebdhouibi commented Nov 6, 2025

Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

🐛 Problem Description

Issue

Example of the Bug

✨ Solution

Changes Made

Technical Details

📁 Files Modified

Core Changes

Test Files

🧪 Testing

Test Coverage

Running Tests

🔄 Backward Compatibility

🌍 Impact

Languages Benefited

Use Cases

📊 Performance Impact

🔍 Code Quality

📝 Related Issues

✅ Checklist

🙏 Acknowledgments

CLAassistant commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

paddle-bot bot commented Nov 6, 2025

Labels

2 participants

CLAassistant commented Nov 6, 2025 •

edited

Loading