Skip to content

Conversation

@Ihebdhouibi
Copy link

Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

This PR fixes a bug where French words containing accented characters (é, è, à, ç, etc.) and contractions (n'êtes, l'été) were incorrectly split at each accented character during OCR text recognition word grouping.

🐛 Problem Description

Issue

The BaseRecLabelDecode.get_word_info() method in ppocr/postprocess/rec_postprocess.py only recognized basic ASCII letters (a-z, A-Z) as word characters. Accented characters used in French and other Latin-based languages were incorrectly classified as "splitters", causing words to be broken apart.

Example of the Bug

Before the fix:

  • Input: "été" (summer)

  • Output: 3 separate words: ["é", "t", "é"]

  • Input: "français" (French)

  • Output: 3 separate words: ["fran", "ç", "ais"]

  • Input: "n'êtes" (you are)

  • Output: 3 separate words: ["n", "'", "êtes"]

After the fix:

  • Input: "été" → Output: 1 word: ["été"]
  • Input: "français" → Output: 1 word: ["français"]
  • Input: "n'êtes" → Output: 1 word: ["n'êtes"]

✨ Solution

Changes Made

  1. Added unicodedata import for Unicode character category detection
  2. Implemented is_latin_char() helper function that properly identifies Latin letters with diacritics
  3. Modified get_word_info() method to include accented characters in word grouping logic
  4. Added apostrophe handling for French contractions

Technical Details

The fix uses Python's unicodedata module to check if a character belongs to the Letter category (L*) and has a Latin or French-based Unicode name. This ensures that characters like:

  • é (LATIN SMALL LETTER E WITH ACUTE)
  • è (LATIN SMALL LETTER E WITH GRAVE)
  • à (LATIN SMALL LETTER A WITH GRAVE)
  • ç (LATIN SMALL LETTER C WITH CEDILLA)

...are correctly recognized as word characters.

📁 Files Modified

Core Changes

  • ppocr/postprocess/rec_postprocess.py
    • Added unicodedata import
    • Added is_latin_char() function
    • Modified BaseRecLabelDecode.get_word_info() method

Test Files

  • test_french_accents.py (new)
    • Comprehensive test suite for French accented character handling
    • Tests various scenarios: simple accents, contractions, mixed text

🧪 Testing

Test Coverage

The included test script validates:

  • Simple accented words: été, élève
  • Words with ç: français
  • Contractions with apostrophes: n'êtes, C'était
  • Words with à: à demain
  • Complex sentences with multiple accents

Running Tests

python test_french_accents.py

🔄 Backward Compatibility

Fully backward compatible

This fix:

  • Only adds new functionality (recognition of accented characters)
  • Does not change behavior for existing ASCII text
  • Does not modify the API or function signatures
  • Uses standard library (unicodedata) - no new dependencies

All existing functionality remains unchanged. Code that worked before will continue to work exactly as before, with the added benefit of proper French (and other Latin-based language) support.

🌍 Impact

Languages Benefited

This fix improves OCR text recognition for all Latin-based languages that use diacritics, including:

  • French: é, è, ê, à, â, ù, û, ç, ï, etc.
  • Spanish: á, é, í, ó, ú, ñ, ü
  • Portuguese: ã, õ, á, é, í, ó, ú, â, ê, ô, ç
  • German: ä, ö, ü, ß
  • Italian: à, è, é, ì, ò, ù
  • And many more...

Use Cases

  • Document digitization in French-speaking regions
  • Multilingual OCR applications
  • Legal and administrative document processing
  • Educational material processing
  • International business document handling

📊 Performance Impact

Negligible performance impact:

  • The is_latin_char() function is only called for non-ASCII characters
  • Uses efficient unicodedata standard library functions
  • No additional loops or complex operations
  • Same time complexity as the original implementation

🔍 Code Quality

✅ Passes all pre-commit hooks:

  • black (code formatting)
  • flake8 (linting)
  • trailing whitespace check
  • line ending normalization

📝 Related Issues

This fix addresses the issue where French and other Latin-based language texts are incorrectly segmented during OCR post-processing, improving the accuracy and usability of PaddleOCR for international users.

✅ Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Comments added for complex logic
  • No breaking changes
  • Test script included
  • Documentation updated (this PR doc)
  • All pre-commit hooks pass

🙏 Acknowledgments

This fix was developed and tested on real-world French OCR scenarios, ensuring practical applicability and effectiveness.


Ready for review and merge! 🚀

@CLAassistant
Copy link

CLAassistant commented Nov 6, 2025

CLA assistant check
All committers have signed the CLA.

@paddle-bot
Copy link

paddle-bot bot commented Nov 6, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contrib/contributor Contributor-related discussion or task. label Nov 6, 2025
Added support for Latin characters with diacritics (é, è, à, ç, etc.) and French contractions (n'êtes) in word grouping logic of BaseRecLabelDecode.get_word_info(). This fix ensures that French words are no longer split at accented characters during OCR text recognition.
@Ihebdhouibi Ihebdhouibi force-pushed the fix-auto-split-french-words branch from 2113bca to c37b052 Compare November 7, 2025 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contrib/contributor Contributor-related discussion or task. contributor

2 participants