- Notifications
You must be signed in to change notification settings - Fork 89
Closed
Labels
Description
As of v0.3.3, pySBD shows destructive behavior in some edge-cases even when setting the option clean to False.
When dealing with OCR text, pySBD removes whitespace after multiple periods.
To reproduce
import pysbd splitter = pysbd.Segmenter(language="fr", clean=False) text = "Maissen se chargea du reste .. Logiquement," print(splitter.segment(text)) text = "Maissen se chargea du reste ... Logiquement," print(splitter.segment(text)) text = "Maissen se chargea du reste .... Logiquement," print(splitter.segment(text))Actual output
Please note the missing whitespace after the final period in the example with .. and .....
['Maissen se chargea du reste .', '.', 'Logiquement,'] ['Maissen se chargea du reste ... ', 'Logiquement,'] ['Maissen se chargea du reste .', '...', 'Logiquement,'] Expected output
['Maissen se chargea du reste .', '. ', 'Logiquement,'] ['Maissen se chargea du reste ... ', 'Logiquement,'] ['Maissen se chargea du reste .', '... ', 'Logiquement,'] In general, pySBD works well. Many thanks @nipunsadvilkar. I can also look into this as soon as I find some time and open a pull request.