Whitespaces can be stored in the conllu file, so that the original text can be reconstructed back #71

michnov · 2021-02-19T16:34:45Z

Use the parameter normalize_spaces=False to preserve all whitespaces in the sentence in the UDPipe way, i.e. using the SpacesAfter and SpacesBefore features in the MISC field. It is backward compatible with CoNLL-U v2 SpaceAfter=No feature. That is, a missing space after the token is marked by SpaceAfter=No, even if it follows the last token of the sentence, and a single space after the token results in no whitespace-related markup at all.

Examples:

$> echo -e "Hello \t world " | udapy read.Sentences $'rstrip=\r\n' tokenize.OnWhitespace normalize_spaces=0 write.Conllu # sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ SpacesAfter=\s\t\s 2 world _ _ _ _ 0 _ _ _ $> echo -e "Hello \t world" | udapy read.Sentences $'rstrip=\r\n' tokenize.OnWhitespace normalize_spaces=0 write.Conllu # sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ SpacesAfter=\s\t\s 2 world _ _ _ _ 0 _ _ SpaceAfter=No $> echo -e "\tHello \t world" | udapy read.Sentences $'rstrip=\r\n' tokenize.OnWhitespace normalize_spaces=0 write.Conllu # sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ SpacesAfter=\s\t\s|SpacesBefore=\t 2 world _ _ _ _ 0 _ _ SpaceAfter=No $> echo -e "\tHello \t world " | udapy read.Sentences $'rstrip=\r\n' tokenize.OnWhitespace write.Conllu # sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ _ 2 world _ _ _ _ 0 _ _ _

Note that if loading the text using read.Sentences and all whitespaces need to be preserved (in order to be able to reconstruct the original document), the read.Sentences block must be called with rstrip=\n or rstrip=\r\n to prevent stripping the trailing whitespace.

The basic whitespace tokenizer keeps the extended information on whitespace from now on. This is done in the way as UDPipe does it, i.e. using in the following MISC attributes: - SpaceAfter=No - SpacesAfter='\t\s\n' - SpacesBefore='\s\s\s'

* use `fill_spaces` to fill in the extra whitespace MISC features * its usage and combination with `read.Sentences` documented * the init parameter `tokenizer_params` was commited by mistake => reverting

* the parameter renamed to match the parameter in UDPipe * fix: if normalize_spaces=True, SpaceAfter=No is never set for the last token in the sentence

martinpopel · 2021-02-19T17:21:53Z

udapi/block/tokenize/onwhitespace.py


+ @staticmethod
+ def escape_whitespace(string):
+ string = re.sub(r' ', r'\\s', string)


If I understand it correctly, it should be:
string = re.sub(' ', r'\s', string)
which could be also written as
string = re.sub(' ', '\\s', string)

Have you checked the code? Is there a test?

Yes, I didn't realize I don't need to escape \s in the replacement string. However, it seems to work the same in any of the three variants of the replacement string. That's why I didn't notice. I'll change it to the first variant you suggest.

Of course, I checked the code as well as the output on several real-world and made-up examples. But I didn't write a unit test for it.

I was wrong, sorry. I forgot re.sub interprets also the replacement parameter (e.g. for \1). You can use either
string = re.sub('\t', r'\\t', string)
or
string = re.sub(r'\t', r'\\t', string)
In the first case, re.sub gets a single-character string as the pattern. In the second case, a two-char string \t, but regular expressions interpret it as tab.
If you use
string = re.sub('\t', r'\t', string)
regular expression decoding is applied also on the replacement, so you would store a tab character in MISC, which is not what you want.
If you use (as I suggested)
string = re.sub(' ', r'\s', string)
you will get re.error: bad escape \s at position 0, because \s cannot appear in the replacement.

That said, my final suggestion
str.maketrans({' ':r'\s', '\t':r'\t', '\r':r'\r', '\n':r'\n'})
is correct because there are no regular expressions involved.

Yes, the commands you suggested wouldn't work for '\t', '\r' or '\n'. I thought your comment concerns only '\s'. Because all three ways of writing it actually work in Python 3.6.3:

In [1]: a = " " In [2]: import re In [3]: re.sub(' ', r'\s', a) Out[3]: '\\s' In [4]: re.sub(' ', r'\\s', a) Out[4]: '\\s' In [5]: re.sub(' ', '\\s', a) Out[5]: '\\s'

Anyway, I changed it to the maketrans + translate solution.

Should I add a test?

re.sub(' ', '\s', 'a b') works in Python 3.6.9, but fails in Python 3.7.0.

If you have time and energy to add a test, it would be nice, but there are so many tests missing (my fault)...

udapi/block/tokenize/onwhitespace.py

martinpopel · 2021-02-19T19:48:42Z

udapi/block/tokenize/onwhitespace.py

+ node.misc["SpacesBefore"] = spaces_before.translate(self.escape_whitespace_table)
 if not spaces_after:
 node.misc["SpaceAfter"] = 'No'
 elif spaces_after != " ":


Alternatively, escape_whitespace_table could be a module-level constant, instead of class-level, but I have no strong preference here, so I will merge this now.

martinpopel · 2021-02-19T19:50:18Z

Thanks a lot for all the effort.

michnov added 4 commits February 15, 2021 18:12

Keep the information on spaces

2f5319d

The basic whitespace tokenizer keeps the extended information on whitespace from now on. This is done in the way as UDPipe does it, i.e. using in the following MISC attributes: - SpaceAfter=No - SpacesAfter='\t\s\n' - SpacesBefore='\s\s\s'

escaping whitespaces in SpacesAfter and SpacesBefore attrs

ddb9f03

Whitespace filling can be enbled by a parameter

8ccf7e5

* use `fill_spaces` to fill in the extra whitespace MISC features * its usage and combination with `read.Sentences` documented * the init parameter `tokenizer_params` was commited by mistake => reverting

fill_spaces=True -> normalize_spaces=False

443d625

* the parameter renamed to match the parameter in UDPipe * fix: if normalize_spaces=True, SpaceAfter=No is never set for the last token in the sentence

martinpopel reviewed Feb 19, 2021

View reviewed changes

michnov added 2 commits February 19, 2021 19:23

fixes after Martin's code review

dd8bb89

bugfix, missing self

c3b28f3

martinpopel reviewed Feb 19, 2021

View reviewed changes

martinpopel merged commit a286249 into master Feb 19, 2021

michnov deleted the ws_tokenizer_spaces_in_misc branch February 21, 2021 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Whitespaces can be stored in the conllu file, so that the original text can be reconstructed back #71

Whitespaces can be stored in the conllu file, so that the original text can be reconstructed back #71

Uh oh!

michnov commented Feb 19, 2021

martinpopel Feb 19, 2021

michnov Feb 19, 2021

martinpopel Feb 19, 2021

michnov Feb 19, 2021

michnov Feb 19, 2021

martinpopel Feb 19, 2021

Uh oh!

Uh oh!

Uh oh!

martinpopel Feb 19, 2021

martinpopel commented Feb 19, 2021

Labels

3 participants

Uh oh!

Whitespaces can be stored in the conllu file, so that the original text can be reconstructed back #71

Whitespaces can be stored in the conllu file, so that the original text can be reconstructed back #71

Uh oh!

Conversation

michnov commented Feb 19, 2021

martinpopel Feb 19, 2021

Choose a reason for hiding this comment

michnov Feb 19, 2021

Choose a reason for hiding this comment

martinpopel Feb 19, 2021

Choose a reason for hiding this comment

michnov Feb 19, 2021

Choose a reason for hiding this comment

michnov Feb 19, 2021

Choose a reason for hiding this comment

martinpopel Feb 19, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

martinpopel Feb 19, 2021

Choose a reason for hiding this comment

martinpopel commented Feb 19, 2021

Labels

3 participants