- Notifications
You must be signed in to change notification settings - Fork 5.4k
rnnlm dataprep: the only valid 2-col splitter is: ' ' (space) #2455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
scripts/rnnlm/get_unigram_probs.py Outdated
| for line in f: | ||
| fields = line.split() | ||
| assert len(fields) == 2 | ||
| fields = line.split(' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was definitely a bug. The rule for things like words.txt in Kaldi is that they should never contain anything which, when interpreted as an ASCII character, is a space.
And in general, we don't even assume that things like text files are actually encoded in UTF-8 or a compatible encoding; we only require that the spaces between words be ASCII space (' ').
So I believe the correct fix here would be to change encoding="utf-8" to encoding="latin-1".
Would you mind testing whether that change works for your setup?
Make sure there are no similar issues in the rnnlm/ subdirectory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, okay, so, if the only requirement is to contain ASCII-space ' ' as a separator, this is also fine for texts in UTF encoding. The code for whitespace is always '0x20'.
If we changed encoding in python script encoding="utf-8" to encoding="latin-1", the further development of the python scripts would become more difficult, as the prints would become incorrect (with hexa codes instead of UTF symbols). So I'd keep the encoding filter as it is.
The problem was that the line.split() splits string with on any white-space character, while with line.split(' ') we are narrowing to split only with the ASCII-space 0x32, and it does not matter if the line is a UTF-string or Byte-string (UTF-8 is by design backward compatible with ASCII : https://en.wikipedia.org/wiki/UTF-8).
So, I'll look for other split() commands in the directory and change them if necessary...
3a580ed to d7379e1 Compare | Done! The data preparation seems to work well. There is no error. It would be good to test it with one of your other recipes too... (maybe one of the JHU students could do it, before merging-in?) |
| Karel, there is another reason to use latin-1, and it explains why we are intending to standardize its use everywhere. It's that the scripts are supposed to support other non-UTF8 (but ASCII-compatible) encodings, such as GBK. So please change as I asked, by reading as latin-1. Dan …On Mon, May 28, 2018 at 10:14 AM, Karel Vesely ***@***.***> wrote: Done! The data preparation seems to work well. There is no error. It would be good to test it with one of your other recipes too... (maybe one of the JHU students could do it, before merging-in?) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1i0kb0IqUu71g-A2-_npGhVniefks5t3AYwgaJpZM4UOyfh> . |
| ... and regarding printing: for programs like this, we should probably set the encoding for sys.stdout to latin-1 as well. The intention is that the output should look the same as the input, when interpreted as a bytestring, because programs like this have no need to break up words internally or even know what the encoding is, other than being able to split on whitespace. But we can leave that till later; I'm not sure of the correct invocation to do it. |
| Aha, well, still I have mixed feelings about using 'latin-1'. It seems to be against the current trend, in which the UTF-8 is more and more wide-spread than ASCII : https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg On the other hand I checked that the ASCII-space '0x20' cannot be the 2nd/3rd/4th UTF-8 byte, nor the 2nd GBK byte. So I will do the change and test it, in the way you prefer it... |
| Hi Karel, the Latin encoding in this context in effect says 'I don't care, as long as space is 0x20', which should be true for encodings where the smallest unit is 8 bits, I. E including utf8. Will not work with ucs encodings (smallest units 16 or 32 bits). But we don't really support those. Y. …On Tue, May 29, 2018, 15:18 Karel Vesely ***@***.***> wrote: Aha, well, still I have mixed feelings about using 'latin-1'. It seems to be against the current trend, in which the UTF-8 is more and more wide-spread than ASCII : https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg On the other hand I checked that the ASCII-space '0x20' cannot be the 2nd/3rd/4th UTF-8 byte, nor the 2nd GBK byte. So I will do the change and test it, in the way you prefer it... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXxRxFEawo7gg9T0iSt2NcHSh5duLks5t3UqXgaJpZM4UOyfh> . |
| Did you test it already? |
| Well, I tested it partially. The data-prep is running fine. While I have problem with training the model (getting the CUSPARSE error I described before in #2448): But it is true that I did not finish the 'testing by lattice rescoring'. |
| Hi, I found another issue, in rnnlm training there is multi-threaded sampling of targets, but the script was getting just one slot for all the threads... See the fix : c6193c5 (it adds |
| Hmm, still getting that: "CUSPARSE_STATUS_EXECUTION_FAILED" I don't think I can finish the test in our cluster, it keeps on crashing every 3-5th iteration, this is really annoying :( |
| And the 'CUDA_CACHE_DISABLE=1' did not help me... |
| But the data preparation seems to be fine... |
| Karel, did you remember the "export", i.e. "export CUDA_CACHE_DISABLE=1"? It's in your path.sh? …On Wed, May 30, 2018 at 11:07 AM, Karel Vesely ***@***.***> wrote: And the 'CUDA_CACHE_DISABLE=1' did not help me... — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu_5MDwf1E5A9PBmT-AoZprJomnniks5t3rWugaJpZM4UOyfh> . |
| Yes, I have it with 'export'; Then I also tried the more explicit variant: |
| OK, looking back at my emails, it looks like others have encountered this problem before. I would really appreciate it if you could run the program in cuda-memcheck-- preferably repeatedly, on some kind of automated loop, on a machine where you previously had the error, and wait till you get the error. I really don't understand what the issue is. It could be some subtle concurrency problem. In parallel, you could just go ahead and train your model with retry.pl. But obviously back up some files that are sufficient to reproduce the problem. Just for the record, can you also let us know the CUDA version and driver version and which GPU hardware types you get the failure on? |
| Karel: (on c09) Other examples: (on c03) Can you see whether you can find the same on your system? |
scripts/rnnlm/choose_features.py Outdated
| with open(vocab_file, 'r', encoding="latin-1") as f: | ||
| for line in f: | ||
| fields = line.split() | ||
| fields = line.split(' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vesis84, you can remove the ' ' here and elsewhere. If the file contained tabs, we do want to split on those as well, even though that's now how we expect people will write the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, we could eventually support only the spaces and tabs and nothing else by:
re.split("[ \t]", str) Does that sound good? (it's more explicit and readable...)
K.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer just split(), I think, it's easier to remember and duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, split() splits on any whitespace character (this can be a problem for example with \r characters). Are you sure you want to allow to split any "whitespace" character? (and sorry for explaining if you already are aware of it...)
| Hi, I just found an answer to the stability problem.
After enabling the 'process exclusive mode' and using GPUs with |
| And I found similar error message in I am not sure how to interpret the codes there... But I did not investigate it yet... It still might be something "normal"... |
| We are actually getting the CUSPARSE_STATUS_EXECUTION_FAILED on our grid and we do have exclusive mode. From our initial investigations it seems to be some subtle concurrency problem @hainan-xv is working on it. (Right, hainan?) |
scripts/rnnlm/train_rnnlm.sh Outdated
| backstitch_training_interval=1 # backstitch training interval | ||
| | ||
| cmd=run.pl # you might want to set this to queue.pl | ||
| queue_gpu_opt="--gpu 1" # you may change the GPU opt externally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vesis84, please don't do this-- this is not really consistent with our normal way of working.
These kinds of things are configurable by changing queue.conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, originally I did this with 'queue.conf'. But is it possible somehow to have 2 queues, one with all the GPUs and other with GPUs that have more than, say 7G RAM? (the big GPUs are not always necessary...)
scripts/rnnlm/train_rnnlm.sh Outdated
| embedding_l2_regularize=$(perl -e "print ($embedding_l2/$this_num_jobs);") | ||
| | ||
| # allocate queue-slots for threads doing sampling, | ||
| [ -f $dir/sampling.lm ] && queue_thread_opt="--num-threads $num_egs_threads" || queue_thread_opt= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally it won't actually use that much CPU on average because it's limited by GPU time. I don't know how many it uses though.
The num_egs_threads is more like an upper bound to make sure it is not limited by the sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, at our servers, it was running with 600% CPU, while the num-threads was set to 10. So it was 'eating' the CPU time of other slots (GPU process is supposed to consume 100% CPU only).
| For the queue thing, you can have one command be e.g. "queue.pl --config conf/queue_big_gpu_mem.conf" For the slots thing-- yes, I guess it makes sense to add the requirement, but maybe as a middle ground, set it to half the requested num-thread? Too many reserved slots aren't good either, normally. Dan …On Mon, Jun 4, 2018 at 5:54 PM, Karel Vesely ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In scripts/rnnlm/train_rnnlm.sh <#2455 (comment)>: > rnnlm_l2_factor=$(perl -e "print (1.0/$this_num_jobs);") embedding_l2_regularize=$(perl -e "print ($embedding_l2/$this_num_jobs);") + # allocate queue-slots for threads doing sampling, + [ -f $dir/sampling.lm ] && queue_thread_opt="--num-threads $num_egs_threads" || queue_thread_opt= Aha, at our servers, it was running with 600% CPU, while the num-threads was set to 10. So it was 'eating' the CPU time of other slots (GPU process is supposed to consum 100% CPU only). — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu_GS6KgzSlUeb8dttnTQTvFN1uq4ks5t5ayVgaJpZM4UOyfh> . |
| We don't anticipate \r or \t in text lines, but if it's there, we'd want it to be treated as space. What we support is ASCII-compatible encodings, i.e. encodings where spaces have the same meaning as they do in ASCII. …On Mon, Jun 4, 2018 at 5:58 PM, Karel Vesely ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In scripts/rnnlm/choose_features.py <#2455 (comment)>: > for line in f: - fields = line.split() + fields = line.split(' ') Well, split() splits on any whitespace character (this can be a problem for example with \r characters). Are you sure you want to allow to split any "whitespace" character? (and sorry for explaining if you already are aware of it...) — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu2S1OhmNDmjolgGgEhShvwR_CFb2ks5t5a19gaJpZM4UOyfh> . |
d308045 to 61f1e3c Compare | Good, I have incorporated the comments: Now, in the test, I am getting again the original error I had before... A 'counts' file is read in 'latin-1' encoding, the line is: This means that for some reason What should be the next step? |
| OK, so that is unicode character U+00A0, encoded as the bytes "C3" "A0" (written as hex). The second byte, A0 (==160) does map to non-breaking space in the latin-1 character set. We have to decide whether to allow   (non-breaking space) in words-- that will determine the course of action. Currently, validate_lang.pl actually does not allow  : you can see in validate_utf8_whitespaces. But there may actually be situations in, say, Arabic, where   and similar non-standard whitespaces might validly appear inside a word-- intended, for example, to alter the way letters join together. @hhadian and @jtrmal, do you have any comment? Yenda, you were the one who added the code to check for those types of whitespace in UTF text, so I assume you remember the original reason why we banned them. …On Wed, Jun 6, 2018 at 2:21 PM, Karel Vesely ***@***.***> wrote: Good, I have incorporated the comments: ( split(' ') -> split(), num threads is 2/3 of $num_egs_threads). Now, at testing, I am getting again the original error I had before... It is in rnnlm/get_unigram_probs.py, line 133. A 'counts' file is read in 'latin-1' encoding, the line is: w\xc3\xa0nazokumbana 1 and it is wrongly split as ['w\xc3', 'nazokumbana', '1'] This means that for some reason split() interprets the char \xa0 as whitespace. An obvious solution would be to use: split(' ') or re.split('[ \t]', line) What should be the next step? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu-C_a9EJFHuQjXl1JY9nwUkBH4mnks5t6B2wgaJpZM4UOyfh> . |
| OK, I had another look at how the Unicode format works I think Karel's original plan of splitting on " " explicitly would actually make the most sense. The other way we sometimes split is implicitly using awk (always with LC_ALL=C exported). That probably splits on tabs and \r as well, but probably the easiest way to resolve this difference is to just say that tabs and linefeed characters are banned in text data such as in 'data/text'. We could change validate_lang.pl to enforce this. That still leaves open the question of whether   is banned in utf8-encoded text, and we do need to resolve that, but it makes it a separate issue. |
| I'm trying to remember, but I don't know. Those were the times we didn't have to lie about the encoding just to make python3 happy. It was certainly because people weren't cleaning up the data properly and were getting whitespaces into the lexicon and similar files (where, consequently, got split). But I don't remember the concrete case that was the one that made us do it -- certainly, there was one. y. On Wed, Jun 6, 2018 at 3:19 PM Daniel Povey <notifications@github.com> wrote: … OK, so that is unicode character U+00A0, encoded as the bytes "C3" "A0" (written as hex). The second byte, A0 (==160) does map to non-breaking space in the latin-1 character set. We have to decide whether to allow   (non-breaking space) in words-- that will determine the course of action. Currently, validate_lang.pl actually does not allow  : you can see in validate_utf8_whitespaces. But there may actually be situations in, say, Arabic, where   and similar non-standard whitespaces might validly appear inside a word-- intended, for example, to alter the way letters join together. @hhadian and @jtrmal, do you have any comment? Yenda, you were the one who added the code to check for those types of whitespace in UTF text, so I assume you remember the original reason why we banned them. On Wed, Jun 6, 2018 at 2:21 PM, Karel Vesely ***@***.***> wrote: > Good, I have incorporated the comments: > ( split(' ') -> split(), num threads is 2/3 of $num_egs_threads). > > Now, at testing, I am getting again the original error I had before... > It is in rnnlm/get_unigram_probs.py, line 133. > > A 'counts' file is read in 'latin-1' encoding, the line is: > w\xc3\xa0nazokumbana 1 > and it is wrongly split as > ['w\xc3', 'nazokumbana', '1'] > > This means that for some reason split() interprets the char \xa0 as > whitespace. > An obvious solution would be to use: split(' ') or re.split('[ \t]', line) > > What should be the next step? > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub > <#2455 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/ADJVu-C_a9EJFHuQjXl1JY9nwUkBH4mnks5t6B2wgaJpZM4UOyfh > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXzn-zB20sblNWA5SPh9bKxtO4YB6ks5t6CtTgaJpZM4UOyfh> . |
| I'd split on tabs and spaces, because we are pretty careless sometimes (for example in the lexicon stuff). Plus, speaking of the lexicon, those tabs were sometimes used to split on syllabs, but I don't recall now. y. …On Wed, Jun 6, 2018 at 4:13 PM Daniel Povey ***@***.***> wrote: OK, I had another look at how the Unicode format works https://www.fileformat.info/info/unicode/utf8.htm and it looks like the byte "A0" (i.e. byte value 160) could potentially appear in a lot of Unicode characters; for instance c3a0 is a valid utf-8 byte sequence, meaning a grave-accented "a". That means that whenever we split data that is encoded as latin-1, we can't do just "split()", because it could spuriously split a lot of UTF characters. I think Karel's original plan of splitting on " " explicitly would actually make the most sense. I now see that sym2int.pl (which is the other program that mainly deals with splitting of text data) explicitly splits on " ", and keeping those two things consistent would probably make sense. The other way we sometimes split is implicitly using awk (always with LC_ALL=C exported). That probably splits on tabs and \r as well, but probably the easiest way to resolve this difference is to just say that tabs and linefeed characters are banned in text data such as in 'data/text'. We could change validate_lang.pl to enforce this. That still leaves open the question of whether   is banned in utf8-encoded text, and we do need to resolve that, but it makes it a separate issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX69C_Jz1wWaQGk_9tiApVY4iYIt1ks5t6DfUgaJpZM4UOyfh> . |
| In Farsi (and probably Arabic), zero-width space and zero-width non-joiner (&zwnj) are commonly used in the middle of compound words, not sure about $nbsp but I don't remember seeing it in Farsi texts (I guess it's mostly popular in HTML). |
| I think we used to recommend to remove them or map them to something visible anyway as it can cause troubles hard to debug. ZWJ does not have the unicode isWhitespace set ( https://www.fileformat.info/info/unicode/char/200d/index.htm) y. …On Wed, Jun 6, 2018 at 5:06 PM Hossein Hadian ***@***.***> wrote: In Farsi (and probably Arabic), zero-width space and zero-width non-joiner (&zwnj) are commonly used in the middle of compound words, not sure about $nbsp but I don't remember seeing it in Farsi texts (I guess it's mostly popular in HTML). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX1tYutO17lTLcRRtpmQ78niqYpupks5t6EQ_gaJpZM4UOyfh> . |
| OK, so I think we're good then, assuming Hossein thinks it's OK for Arabic. Our rule will be: we disallow any space characters that UTF-8 considers as space, that are different from space, tab and newline. That permits zero-width non-joiner but not zero-width space. Hossein, do you think we can get by with that? E.g. can we map any zero-width spaces to something else, like regular space, without badly distorting something or losing meaning? That requires us to -- modify validate_text.pl and validate_lang.pl to explicitly disallow linefeed (they should probably print something that's specific to linefeed, so people don't think it's a UTF-8 issue). I realized that sym2int.pl is actualy splitting on all whitespaces including tabs, spaces and linefeeds, since in perl, split(" ", ...) has a special meaning. Karel, so you should be splitting on spaces or tabs; and if you feel like it, make those modifications to validate_text.pl and validate_lang.pl to reject linefeed (if not, someone else can do it). …On Wed, Jun 6, 2018 at 5:11 PM, jtrmal ***@***.***> wrote: I think we used to recommend to remove them or map them to something visible anyway as it can cause troubles hard to debug. ZWJ does not have the unicode isWhitespace set ( https://www.fileformat.info/info/unicode/char/200d/index.htm) y. On Wed, Jun 6, 2018 at 5:06 PM Hossein Hadian ***@***.***> wrote: > In Farsi (and probably Arabic), zero-width space and zero-width non-joiner > (&zwnj) are commonly used in the middle of compound words, not sure about > $nbsp but I don't remember seeing it in Farsi texts (I guess it's mostly > popular in HTML). > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2455 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ AKisX1tYutO17lTLcRRtpmQ78niqYpupks5t6EQ_gaJpZM4UOyfh> > . > — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu_Inzbc4MQti2oLKYBJvxMFO031kks5t6EWBgaJpZM4UOyfh> . |
| Yes, I think. Any zero-width space can be mapped to zero-width non-joiner (actually, I guess zero-width non-joiner is the standard way) or even space without losing meaning. |
| I remember running into these zero width characters in Dari and Pashto. people would get mad at me when I would delete them, so I think they might be important. … On Jun 6, 2018, at 5:30 PM, Hossein Hadian ***@***.***> wrote: Yes, I think. Any zero-width space can be mapped to zero-width non-joiner (actually, I guess zero-width non-joiner is the standard way <https://en.wikipedia.org/wiki/Zero-width_non-joiner>) or even space without losing meaning. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABK_gBp9vjBbUC7odjR_0qlV551Tricnks5t6EnxgaJpZM4UOyfh>. |
| Hm, OK. But Hossein is saying that it should be possible to replace zero-width space with zero-width joiner. What I am *hoping* is that for each character that has Unicode's "space" property, there is another usable equivalent that isn't considered a space. But I don't know this for sure. It will definitely make our lives easier if we can ensure that words don't contain spaces internally. On Wed, Jun 6, 2018 at 7:46 PM, John J Morgan <notifications@github.com> wrote: … I remember running into these zero width characters in Dari and Pashto. people would get mad at me when I would delete them, so I think they might be important. > On Jun 6, 2018, at 5:30 PM, Hossein Hadian ***@***.***> wrote: > > Yes, I think. Any zero-width space can be mapped to zero-width non-joiner (actually, I guess zero-width non-joiner is the standard way < https://en.wikipedia.org/wiki/Zero-width_non-joiner>) or even space without losing meaning. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub < #2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe- auth/ABK_gBp9vjBbUC7odjR_0qlV551Tricnks5t6EnxgaJpZM4UOyfh>. > — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuzCr5KWK0qebUPoNLpbuNNLVF1h3ks5t6GnpgaJpZM4UOyfh> . |
| Yes. I think that works for zero width space. I was worried about ligatures. If you want a ligature, you replace zero width space with zero width joiner. If you do not want a ligature, you replace zero width space with zero width non-joiner. … On Jun 6, 2018, at 7:59 PM, Daniel Povey ***@***.***> wrote: Hm, OK. But Hossein is saying that it should be possible to replace zero-width space with zero-width joiner. What I am *hoping* is that for each character that has Unicode's "space" property, there is another usable equivalent that isn't considered a space. But I don't know this for sure. It will definitely make our lives easier if we can ensure that words don't contain spaces internally. On Wed, Jun 6, 2018 at 7:46 PM, John J Morgan ***@***.***> wrote: > I remember running into these zero width characters in Dari and Pashto. > people would get mad at me when I would delete them, so I think they might > be important. > > > > > On Jun 6, 2018, at 5:30 PM, Hossein Hadian ***@***.***> > wrote: > > > > Yes, I think. Any zero-width space can be mapped to zero-width > non-joiner (actually, I guess zero-width non-joiner is the standard way < > https://en.wikipedia.org/wiki/Zero-width_non-joiner>) or even space > without losing meaning. > > > > — > > You are receiving this because you are subscribed to this thread. > > Reply to this email directly, view it on GitHub < > #2455 (comment)>, or > mute the thread <https://github.com/notifications/unsubscribe- > auth/ABK_gBp9vjBbUC7odjR_0qlV551Tricnks5t6EnxgaJpZM4UOyfh>. > > > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub > <#2455 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ADJVuzCr5KWK0qebUPoNLpbuNNLVF1h3ks5t6GnpgaJpZM4UOyfh> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABK_gOzGDw53qS6-7AJKhRj9uGQwj4Bvks5t6GzugaJpZM4UOyfh>. |
| Aha, okay, so, you mean... I should modify the |
| I am not sure how to make the change in |
| Thanks. Once you confirm that you've tested it (at least that it doesn't crash in the early stages), I'll merge. @jtrmal, do you have time to change the validation scripts to ban CR (\r)? |
| I'll do it. y. …On Thu, Jun 7, 2018 at 1:56 PM Daniel Povey ***@***.***> wrote: Thanks. Once you confirm that you've tested it (at least that it doesn't crash in the early stages), I'll merge. @jtrmal <https://github.com/jtrmal>, do you have time to change the validation scripts to ban CR (\r)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXw2fCF3ApBGPBWfSi3es_bdtDiVgks5t6WlmgaJpZM4UOyfh> . |
| Hi, for me the dataprep and RNNLM training runs fine. A.t.m. waiting for the training to finish... K. |
which otherwise lead to having >2 columns,