Adjustments to correctly work with texts loaded with whitespace included #75

michnov · 2021-02-25T15:42:32Z

These changes are related only to the documents that are read using read.Sentences with parameter rstrip='', rstrip='\n' or rstrip='\r\n'.

The most important is preventing printing out the unescaped whitespace by write.Conllu. It has been already implemented that if tokenize.OnWhitespace keep_spaces=1 or udpipe.Base is used, the whitespace is escaped and preserved in MISC features SpaceAfter, SpacesAfter and SpacesBefore. Here we address the Root.text attribute, which contains all whitespaces in the unescaped form throughout whole processing. These, however, need to be removed or escaped during writing to CoNLL-U format. We have now decided to remove all '\n' and '\r' from the whole Root.text attribute and strip the whitespaces from its end. Multiple spaces or tabs between the tokens are thus kept and printed out with write.Conllu.
A warning of using the rstrip='' parameter was put to its documentation.
The parameter normalize_spaces in tokenize.OnWhitespace renamed to keep_spaces with False as a default value (instead of the previous True).

* removing \r and \n from anywhere in the text attribute * stripping all whitespace from the end of the text attribute

udapi/block/tokenize/onwhitespace.py

michnov added 4 commits February 25, 2021 11:48

preventing the text attribute from being invalid

5f08a47

* removing \r and \n from anywhere in the text attribute * stripping all whitespace from the end of the text attribute

faster

8e9adbf

warning for using read.Sentences rstrip=''

f615108

normalize_spaces=False -> keep_spaces=True

a12435b

martinpopel reviewed Feb 25, 2021

View reviewed changes

udapi/block/tokenize/onwhitespace.py Outdated Show resolved Hide resolved

bugfix

b413b41

martinpopel merged commit e2fe7e0 into master Feb 25, 2021

martinpopel deleted the ws_text branch February 25, 2021 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adjustments to correctly work with texts loaded with whitespace included #75

Adjustments to correctly work with texts loaded with whitespace included #75

Uh oh!

michnov commented Feb 25, 2021

Uh oh!

Labels

3 participants

Uh oh!

Adjustments to correctly work with texts loaded with whitespace included #75

Adjustments to correctly work with texts loaded with whitespace included #75

Uh oh!

Conversation

michnov commented Feb 25, 2021

Uh oh!

Labels

3 participants