Fix: Prevent auto-splitting of French accented words in text recognition #16994
+163 −1
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition
📋 Summary
This PR fixes a bug where French words containing accented characters (é, è, à, ç, etc.) and contractions (n'êtes, l'été) were incorrectly split at each accented character during OCR text recognition word grouping.
🐛 Problem Description
Issue
The
BaseRecLabelDecode.get_word_info()method inppocr/postprocess/rec_postprocess.pyonly recognized basic ASCII letters (a-z, A-Z) as word characters. Accented characters used in French and other Latin-based languages were incorrectly classified as "splitters", causing words to be broken apart.Example of the Bug
Before the fix:
Input:
"été"(summer)Output: 3 separate words:
["é", "t", "é"]❌Input:
"français"(French)Output: 3 separate words:
["fran", "ç", "ais"]❌Input:
"n'êtes"(you are)Output: 3 separate words:
["n", "'", "êtes"]❌After the fix:
"été"→ Output: 1 word:["été"]✅"français"→ Output: 1 word:["français"]✅"n'êtes"→ Output: 1 word:["n'êtes"]✅✨ Solution
Changes Made
unicodedataimport for Unicode character category detectionis_latin_char()helper function that properly identifies Latin letters with diacriticsget_word_info()method to include accented characters in word grouping logicTechnical Details
The fix uses Python's
unicodedatamodule to check if a character belongs to the Letter category (L*) and has a Latin or French-based Unicode name. This ensures that characters like:...are correctly recognized as word characters.
📁 Files Modified
Core Changes
ppocr/postprocess/rec_postprocess.pyunicodedataimportis_latin_char()functionBaseRecLabelDecode.get_word_info()methodTest Files
test_french_accents.py(new)🧪 Testing
Test Coverage
The included test script validates:
été,élèvefrançaisn'êtes,C'étaità demainRunning Tests
🔄 Backward Compatibility
✅ Fully backward compatible
This fix:
unicodedata) - no new dependenciesAll existing functionality remains unchanged. Code that worked before will continue to work exactly as before, with the added benefit of proper French (and other Latin-based language) support.
🌍 Impact
Languages Benefited
This fix improves OCR text recognition for all Latin-based languages that use diacritics, including:
Use Cases
📊 Performance Impact
Negligible performance impact:
is_latin_char()function is only called for non-ASCII charactersunicodedatastandard library functions🔍 Code Quality
✅ Passes all pre-commit hooks:
📝 Related Issues
This fix addresses the issue where French and other Latin-based language texts are incorrectly segmented during OCR post-processing, improving the accuracy and usability of PaddleOCR for international users.
✅ Checklist
🙏 Acknowledgments
This fix was developed and tested on real-world French OCR scenarios, ensuring practical applicability and effectiveness.
Ready for review and merge! 🚀