Skip to content

Conversation

@michnov
Copy link
Contributor

@michnov michnov commented Feb 24, 2021

  1. deal with multiple consecutive spaces in input sentences or the space at the start/end of the sentence
    => no sentence boundary
  2. do not strip possible trailing space from the sentence. When applying tokenize.Simple afterwards with normalize_spaces=0, it correctly fills the whitespace-related MISC features => the text can be reconstructed in its original form
  3. if an abbreviation of the first name is the first word of a quoted segment, delete the starting quotation mark to find out if the word consists of two chars, e.g. in "„A. Merkel ..."
1. deal with multiple consecutive spaces in input sentences or the space at the start/end of the sentence => no sentence boundary 2. if an abbreviation of the first name is a first word of a quoted segment, delete the starting quotation mark to find out if the word consists of two chars
@michnov
Copy link
Contributor Author

michnov commented Feb 24, 2021

The reason why I need (2) could be possibly fixed in a better way. I may think about it later.

@michnov michnov closed this Feb 24, 2021
@martinpopel
Copy link
Contributor

I would suggest if not self.normalize_spaces: segments[-1] += ' ' (and set normalize_space=True by default) and then I would happily merge this PR (if you re-open it).

@michnov
Copy link
Contributor Author

michnov commented Feb 24, 2021

Thanks, I closed it since I immediately discovered a related issue caused by using read.Sentences rstrip=''. The tree.text then contains \n or \r characters, which cause problems in write.Conllu. For the time being, I fixed it locally in write.Conllu by striping the trailing \n or \r. Now I see that it's not as related to the (2) issue as I thought at 3am 😆 Anyway, I'll add it to this PR.

@michnov michnov reopened this Feb 25, 2021
@michnov michnov changed the title Improvements in Simple segmenter Improvements to Simple segmenter Feb 25, 2021
@martinpopel martinpopel merged commit 7cb814e into master Feb 25, 2021
@martinpopel
Copy link
Contributor

Thanks.

@martinpopel martinpopel deleted the segment_simple branch February 25, 2021 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants