-
Couldn't load subscription status.
- Fork 33
Whitespaces can be stored in the conllu file, so that the original text can be reconstructed back #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The basic whitespace tokenizer keeps the extended information on whitespace from now on. This is done in the way as UDPipe does it, i.e. using in the following MISC attributes: - SpaceAfter=No - SpacesAfter='\t\s\n' - SpacesBefore='\s\s\s'
* use `fill_spaces` to fill in the extra whitespace MISC features * its usage and combination with `read.Sentences` documented * the init parameter `tokenizer_params` was commited by mistake => reverting
* the parameter renamed to match the parameter in UDPipe * fix: if normalize_spaces=True, SpaceAfter=No is never set for the last token in the sentence
udapi/block/tokenize/onwhitespace.py Outdated
| | ||
| @staticmethod | ||
| def escape_whitespace(string): | ||
| string = re.sub(r' ', r'\\s', string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, it should be:
string = re.sub(' ', r'\s', string)
which could be also written as
string = re.sub(' ', '\\s', string)
Have you checked the code? Is there a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't realize I don't need to escape \s in the replacement string. However, it seems to work the same in any of the three variants of the replacement string. That's why I didn't notice. I'll change it to the first variant you suggest.
Of course, I checked the code as well as the output on several real-world and made-up examples. But I didn't write a unit test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong, sorry. I forgot re.sub interprets also the replacement parameter (e.g. for \1). You can use either
string = re.sub('\t', r'\\t', string)
or
string = re.sub(r'\t', r'\\t', string)
In the first case, re.sub gets a single-character string as the pattern. In the second case, a two-char string \t, but regular expressions interpret it as tab.
If you use
string = re.sub('\t', r'\t', string)
regular expression decoding is applied also on the replacement, so you would store a tab character in MISC, which is not what you want.
If you use (as I suggested)
string = re.sub(' ', r'\s', string)
you will get re.error: bad escape \s at position 0, because \s cannot appear in the replacement.
That said, my final suggestion
str.maketrans({' ':r'\s', '\t':r'\t', '\r':r'\r', '\n':r'\n'})
is correct because there are no regular expressions involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the commands you suggested wouldn't work for '\t', '\r' or '\n'. I thought your comment concerns only '\s'. Because all three ways of writing it actually work in Python 3.6.3:
In [1]: a = " " In [2]: import re In [3]: re.sub(' ', r'\s', a) Out[3]: '\\s' In [4]: re.sub(' ', r'\\s', a) Out[4]: '\\s' In [5]: re.sub(' ', '\\s', a) Out[5]: '\\s' Anyway, I changed it to the maketrans + translate solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re.sub(' ', '\s', 'a b') works in Python 3.6.9, but fails in Python 3.7.0.
If you have time and energy to add a test, it would be nice, but there are so many tests missing (my fault)...
| node.misc["SpacesBefore"] = spaces_before.translate(self.escape_whitespace_table) | ||
| if not spaces_after: | ||
| node.misc["SpaceAfter"] = 'No' | ||
| elif spaces_after != " ": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, escape_whitespace_table could be a module-level constant, instead of class-level, but I have no strong preference here, so I will merge this now.
| Thanks a lot for all the effort. |
Use the parameter
normalize_spaces=Falseto preserve all whitespaces in the sentence in the UDPipe way, i.e. using theSpacesAfterandSpacesBeforefeatures in the MISC field. It is backward compatible with CoNLL-U v2SpaceAfter=Nofeature. That is, a missing space after the token is marked bySpaceAfter=No, even if it follows the last token of the sentence, and a single space after the token results in no whitespace-related markup at all.Examples:
Note that if loading the text using
read.Sentencesand all whitespaces need to be preserved (in order to be able to reconstruct the original document), theread.Sentencesblock must be called withrstrip=\norrstrip=\r\nto prevent stripping the trailing whitespace.