Skip to content

Conversation

@michnov
Copy link
Contributor

@michnov michnov commented Feb 19, 2021

Use the parameter normalize_spaces=False to preserve all whitespaces in the sentence in the UDPipe way, i.e. using the SpacesAfter and SpacesBefore features in the MISC field. It is backward compatible with CoNLL-U v2 SpaceAfter=No feature. That is, a missing space after the token is marked by SpaceAfter=No, even if it follows the last token of the sentence, and a single space after the token results in no whitespace-related markup at all.

Examples:

$> echo -e "Hello \t world " | udapy read.Sentences $'rstrip=\r\n' tokenize.OnWhitespace normalize_spaces=0 write.Conllu # sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ SpacesAfter=\s\t\s 2 world _ _ _ _ 0 _ _ _ $> echo -e "Hello \t world" | udapy read.Sentences $'rstrip=\r\n' tokenize.OnWhitespace normalize_spaces=0 write.Conllu # sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ SpacesAfter=\s\t\s 2 world _ _ _ _ 0 _ _ SpaceAfter=No $> echo -e "\tHello \t world" | udapy read.Sentences $'rstrip=\r\n' tokenize.OnWhitespace normalize_spaces=0 write.Conllu # sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ SpacesAfter=\s\t\s|SpacesBefore=\t 2 world _ _ _ _ 0 _ _ SpaceAfter=No $> echo -e "\tHello \t world " | udapy read.Sentences $'rstrip=\r\n' tokenize.OnWhitespace write.Conllu # sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ _ 2 world _ _ _ _ 0 _ _ _ 

Note that if loading the text using read.Sentences and all whitespaces need to be preserved (in order to be able to reconstruct the original document), the read.Sentences block must be called with rstrip=\n or rstrip=\r\n to prevent stripping the trailing whitespace.

The basic whitespace tokenizer keeps the extended information on whitespace from now on. This is done in the way as UDPipe does it, i.e. using in the following MISC attributes: - SpaceAfter=No - SpacesAfter='\t\s\n' - SpacesBefore='\s\s\s'
* use `fill_spaces` to fill in the extra whitespace MISC features * its usage and combination with `read.Sentences` documented * the init parameter `tokenizer_params` was commited by mistake => reverting
* the parameter renamed to match the parameter in UDPipe * fix: if normalize_spaces=True, SpaceAfter=No is never set for the last token in the sentence

@staticmethod
def escape_whitespace(string):
string = re.sub(r' ', r'\\s', string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, it should be:
string = re.sub(' ', r'\s', string)
which could be also written as
string = re.sub(' ', '\\s', string)

Have you checked the code? Is there a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't realize I don't need to escape \s in the replacement string. However, it seems to work the same in any of the three variants of the replacement string. That's why I didn't notice. I'll change it to the first variant you suggest.

Of course, I checked the code as well as the output on several real-world and made-up examples. But I didn't write a unit test for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, sorry. I forgot re.sub interprets also the replacement parameter (e.g. for \1). You can use either
string = re.sub('\t', r'\\t', string)
or
string = re.sub(r'\t', r'\\t', string)
In the first case, re.sub gets a single-character string as the pattern. In the second case, a two-char string \t, but regular expressions interpret it as tab.
If you use
string = re.sub('\t', r'\t', string)
regular expression decoding is applied also on the replacement, so you would store a tab character in MISC, which is not what you want.
If you use (as I suggested)
string = re.sub(' ', r'\s', string)
you will get re.error: bad escape \s at position 0, because \s cannot appear in the replacement.

That said, my final suggestion
str.maketrans({' ':r'\s', '\t':r'\t', '\r':r'\r', '\n':r'\n'})
is correct because there are no regular expressions involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the commands you suggested wouldn't work for '\t', '\r' or '\n'. I thought your comment concerns only '\s'. Because all three ways of writing it actually work in Python 3.6.3:

In [1]: a = " " In [2]: import re In [3]: re.sub(' ', r'\s', a) Out[3]: '\\s' In [4]: re.sub(' ', r'\\s', a) Out[4]: '\\s' In [5]: re.sub(' ', '\\s', a) Out[5]: '\\s' 

Anyway, I changed it to the maketrans + translate solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add a test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re.sub(' ', '\s', 'a b') works in Python 3.6.9, but fails in Python 3.7.0.

If you have time and energy to add a test, it would be nice, but there are so many tests missing (my fault)...

node.misc["SpacesBefore"] = spaces_before.translate(self.escape_whitespace_table)
if not spaces_after:
node.misc["SpaceAfter"] = 'No'
elif spaces_after != " ":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, escape_whitespace_table could be a module-level constant, instead of class-level, but I have no strong preference here, so I will merge this now.

@martinpopel martinpopel merged commit a286249 into master Feb 19, 2021
@martinpopel
Copy link
Contributor

Thanks a lot for all the effort.

@michnov michnov deleted the ws_tokenizer_spaces_in_misc branch February 21, 2021 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants