Skip to content

Conversation

@michnov
Copy link
Contributor

@michnov michnov commented Feb 25, 2021

These changes are related only to the documents that are read using read.Sentences with parameter rstrip='', rstrip='\n' or rstrip='\r\n'.

  1. The most important is preventing printing out the unescaped whitespace by write.Conllu. It has been already implemented that if tokenize.OnWhitespace keep_spaces=1 or udpipe.Base is used, the whitespace is escaped and preserved in MISC features SpaceAfter, SpacesAfter and SpacesBefore. Here we address the Root.text attribute, which contains all whitespaces in the unescaped form throughout whole processing. These, however, need to be removed or escaped during writing to CoNLL-U format. We have now decided to remove all '\n' and '\r' from the whole Root.text attribute and strip the whitespaces from its end. Multiple spaces or tabs between the tokens are thus kept and printed out with write.Conllu.
  2. A warning of using the rstrip='' parameter was put to its documentation.
  3. The parameter normalize_spaces in tokenize.OnWhitespace renamed to keep_spaces with False as a default value (instead of the previous True).
* removing \r and \n from anywhere in the text attribute * stripping all whitespace from the end of the text attribute
@martinpopel martinpopel merged commit e2fe7e0 into master Feb 25, 2021
@martinpopel martinpopel deleted the ws_text branch February 25, 2021 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants