rnnlm dataprep: the only valid 2-col splitter is: ' ' (space) #2455

KarelVesely84 · 2018-05-26T05:28:58Z

helps if the words in 'words.txt' contain special UTF whitespaces,
which otherwise lead to having >2 columns,
it will make the code a little more robust to 'dirty' dataprep,

danpovey · 2018-05-26T17:02:59Z

scripts/rnnlm/get_unigram_probs.py

 for line in f:
- fields = line.split()
- assert len(fields) == 2
+ fields = line.split(' ')


This was definitely a bug. The rule for things like words.txt in Kaldi is that they should never contain anything which, when interpreted as an ASCII character, is a space.
And in general, we don't even assume that things like text files are actually encoded in UTF-8 or a compatible encoding; we only require that the spaces between words be ASCII space (' ').
So I believe the correct fix here would be to change encoding="utf-8" to encoding="latin-1".
Would you mind testing whether that change works for your setup?
Make sure there are no similar issues in the rnnlm/ subdirectory.

Hi, okay, so, if the only requirement is to contain ASCII-space ' ' as a separator, this is also fine for texts in UTF encoding. The code for whitespace is always '0x20'.

If we changed encoding in python script encoding="utf-8" to encoding="latin-1", the further development of the python scripts would become more difficult, as the prints would become incorrect (with hexa codes instead of UTF symbols). So I'd keep the encoding filter as it is.

The problem was that the line.split() splits string with on any white-space character, while with line.split(' ') we are narrowing to split only with the ASCII-space 0x32, and it does not matter if the line is a UTF-string or Byte-string (UTF-8 is by design backward compatible with ASCII : https://en.wikipedia.org/wiki/UTF-8).

So, I'll look for other split() commands in the directory and change them if necessary...

KarelVesely84 · 2018-05-28T14:14:01Z

Done! The data preparation seems to work well. There is no error. It would be good to test it with one of your other recipes too... (maybe one of the JHU students could do it, before merging-in?)

danpovey · 2018-05-28T17:55:57Z

Karel, there is another reason to use latin-1, and it explains why we are intending to standardize its use everywhere. It's that the scripts are supposed to support other non-UTF8 (but ASCII-compatible) encodings, such as GBK. So please change as I asked, by reading as latin-1. Dan

…

On Mon, May 28, 2018 at 10:14 AM, Karel Vesely ***@***.***> wrote: Done! The data preparation seems to work well. There is no error. It would be good to test it with one of your other recipes too... (maybe one of the JHU students could do it, before merging-in?) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1i0kb0IqUu71g-A2-_npGhVniefks5t3AYwgaJpZM4UOyfh> .

danpovey · 2018-05-28T19:59:24Z

... and regarding printing: for programs like this, we should probably set the encoding for sys.stdout to latin-1 as well. The intention is that the output should look the same as the input, when interpreted as a bytestring, because programs like this have no need to break up words internally or even know what the encoding is, other than being able to split on whitespace. But we can leave that till later; I'm not sure of the correct invocation to do it.

KarelVesely84 · 2018-05-29T13:18:02Z

Aha, well, still I have mixed feelings about using 'latin-1'. It seems to be against the current trend, in which the UTF-8 is more and more wide-spread than ASCII : https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg

On the other hand I checked that the ASCII-space '0x20' cannot be the 2nd/3rd/4th UTF-8 byte, nor the 2nd GBK byte. So I will do the change and test it, in the way you prefer it...

jtrmal · 2018-05-29T13:40:13Z

Hi Karel, the Latin encoding in this context in effect says 'I don't care, as long as space is 0x20', which should be true for encodings where the smallest unit is 8 bits, I. E including utf8. Will not work with ucs encodings (smallest units 16 or 32 bits). But we don't really support those. Y.

…

On Tue, May 29, 2018, 15:18 Karel Vesely ***@***.***> wrote: Aha, well, still I have mixed feelings about using 'latin-1'. It seems to be against the current trend, in which the UTF-8 is more and more wide-spread than ASCII : https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg On the other hand I checked that the ASCII-space '0x20' cannot be the 2nd/3rd/4th UTF-8 byte, nor the 2nd GBK byte. So I will do the change and test it, in the way you prefer it... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXxRxFEawo7gg9T0iSt2NcHSh5duLks5t3UqXgaJpZM4UOyfh> .

danpovey · 2018-05-29T16:59:13Z

Did you test it already?

KarelVesely84 · 2018-05-30T13:04:23Z

Well, I tested it partially. The data-prep is running fine. While I have problem with training the model (getting the CUSPARSE error I described before in #2448):
ERROR (rnnlm-train[5.4.146~5-6b94e]:CopyFromSmat():cu-sparse-matrix.cc:395) cusparseStatus_t 6 : "CUSPARSE_STATUS_EXECUTION_FAILED" returned from 'cusparse_csr2csc(GetCusparseHandle(), smat.NumRows(), smat.NumCols(), smat.NumElements(), smat.CsrVal(), smat.CsrRowPtr(), smat.CsrColIdx(), CsrVal(), CsrColIdx(), CsrRowPtr(), CUSPARSE_ACTION_NUMERIC, CUSPARSE_INDEX_BASE_ZERO)'
I keep getting this much more often then I did last week. But AFAIK this is independent to the change in the data preparation for RNNLM training.

But it is true that I did not finish the 'testing by lattice rescoring'.

KarelVesely84 · 2018-05-30T13:33:51Z

Hi, I found another issue, in rnnlm training there is multi-threaded sampling of targets, but the script was getting just one slot for all the threads... See the fix : c6193c5 (it adds --num-threds N to the call of the queue.pl) This is likely to improve stability of the training jobs...
K.

KarelVesely84 · 2018-05-30T15:05:38Z

Hmm, still getting that: "CUSPARSE_STATUS_EXECUTION_FAILED" I don't think I can finish the test in our cluster, it keeps on crashing every 3-5th iteration, this is really annoying :(

KarelVesely84 · 2018-05-30T15:07:15Z

And the 'CUDA_CACHE_DISABLE=1' did not help me...

KarelVesely84 · 2018-05-30T16:13:17Z

But the data preparation seems to be fine...

danpovey · 2018-05-30T16:58:30Z

Karel, did you remember the "export", i.e. "export CUDA_CACHE_DISABLE=1"? It's in your path.sh?

…

On Wed, May 30, 2018 at 11:07 AM, Karel Vesely ***@***.***> wrote: And the 'CUDA_CACHE_DISABLE=1' did not help me... — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu_5MDwf1E5A9PBmT-AoZprJomnniks5t3rWugaJpZM4UOyfh> .

KarelVesely84 · 2018-05-31T05:22:19Z

Yes, I have it with 'export'; Then I also tried the more explicit variant:
queue.pl ... xyz.log CUDA_CACHE_DISABLE=1 rnnlm-train ...
And it was not better either...

danpovey · 2018-05-31T19:42:35Z

OK, looking back at my emails, it looks like others have encountered this problem before.
However, we were never able to fix it because we could not replicate it here at Hopkins.

I would really appreciate it if you could run the program in cuda-memcheck-- preferably repeatedly, on some kind of automated loop, on a machine where you previously had the error, and wait till you get the error. I really don't understand what the issue is. It could be some subtle concurrency problem.

In parallel, you could just go ahead and train your model with retry.pl. But obviously back up some files that are sufficient to reproduce the problem.

Just for the record, can you also let us know the CUDA version and driver version and which GPU hardware types you get the failure on?

danpovey · 2018-06-01T19:58:19Z

Karel:
We started getting failures like this at JHU. This is good news as it means we can investigate the problem and find the cause. Previously when I reported the problem to NVidia as a likely driver bug they told me that it was probably an error in our code. But now I think I have something to show them: I am seeing things in the system logs at the same time as these errors occur. This is from the output of dmesg -T:

(on c09)

[Fri Jun 1 15:35:11 2018] NVRM: GPU at PCI:0000:02:00: GPU-56dda1e1-ec05-17a6-ec14-d33d60198868 [Fri Jun 1 15:35:11 2018] NVRM: GPU Board Serial Number: [Fri Jun 1 15:35:11 2018] NVRM: Xid (PCI:0000:02:00): 31, Ch 00000010, engmask 00000101, intr 10000000

Other examples:
(on c07)

[Fri Jun 1 15:46:50 2018] NVRM: Xid (PCI:0000:02:00): 31, Ch 00000010, engmask 00000101, intr 10000000

(on c03)

[Fri Jun 1 14:50:51 2018] NVRM: Xid (PCI:0000:04:00): 31, Ch 00000010, engmask 00000101, intr 10000000 [Fri Jun 1 14:52:59 2018] NVRM: GPU at PCI:0000:02:00: GPU-ae2076d0-2e04-8a6b-907b-298de4f9b743 [Fri Jun 1 14:52:59 2018] NVRM: GPU Board Serial Number: 0321117092297 [Fri Jun 1 14:52:59 2018] NVRM: Xid (PCI:0000:02:00): 31, Ch 00000010, engmask 00000101, intr 10000000 [Fri Jun 1 15:34:06 2018] NVRM: Xid (PCI:0000:02:00): 31, Ch 00000010, engmask 00000101, intr 10000000

Can you see whether you can find the same on your system?
You may need to ask your sysadmins: not all Linux flavors allow users to see dmesg output.

danpovey · 2018-06-02T03:54:21Z

scripts/rnnlm/choose_features.py

+ with open(vocab_file, 'r', encoding="latin-1") as f:
 for line in f:
- fields = line.split()
+ fields = line.split(' ')


@vesis84, you can remove the ' ' here and elsewhere. If the file contained tabs, we do want to split on those as well, even though that's now how we expect people will write the file.

Hi, we could eventually support only the spaces and tabs and nothing else by:
re.split("[ \t]", str) Does that sound good? (it's more explicit and readable...)
K.

I'd prefer just split(), I think, it's easier to remember and duplicate.

Well, split() splits on any whitespace character (this can be a problem for example with \r characters). Are you sure you want to allow to split any "whitespace" character? (and sorry for explaining if you already are aware of it...)

KarelVesely84 · 2018-06-04T16:21:46Z

Hi, I just found an answer to the stability problem.

The cause: recent upgrade of GPU driver to 384.130, this disabled the compute exclusive mode as a by-product.
The consequence: 2 processes sharing 1 GPU (part of GPU-RAM taken by other process), and the CUDA allocator in Kaldi already has a lot of GPU-RAM, so there is not enough GPU-RAM for the cusparse_csr2csc(.) call
The outcome: error message "CUSPARSE_STATUS_EXECUTION_FAILED", which means "we just ran out of GPU memory..."

After enabling the 'process exclusive mode' and using GPUs with >=8GB RAM, the training is stable, and the problem is solved...

KarelVesely84 · 2018-06-04T16:37:03Z

And I found similar error message in dmesg -t (seems to be a bit old, it's from March 21st):

NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.130 Wed Mar 21 03:37:26 PDT 2018 (using threaded interrupts) NVRM: GPU at PCI:0000:02:00: GPU-7fef535b-ed05-2a7d-3398-57aa66ebd4b9 NVRM: GPU Board Serial Number: NVRM: Xid (PCI:0000:02:00): 31, Ch 00000020, engmask 00000101, intr 10000000 NVRM: GPU at PCI:0000:02:00: GPU-7fef535b-ed05-2a7d-3398-57aa66ebd4b9 NVRM: GPU Board Serial Number: NVRM: Xid (PCI:0000:02:00): 31, Ch 00000020, engmask 00000101, intr 10000000

I am not sure how to interpret the codes there... But I did not investigate it yet... It still might be something "normal"...
K.

danpovey · 2018-06-04T17:39:51Z

We are actually getting the CUSPARSE_STATUS_EXECUTION_FAILED on our grid and we do have exclusive mode. From our initial investigations it seems to be some subtle concurrency problem @hainan-xv is working on it. (Right, hainan?)

danpovey · 2018-06-04T17:41:09Z

scripts/rnnlm/train_rnnlm.sh

 backstitch_training_interval=1 # backstitch training interval
+
 cmd=run.pl # you might want to set this to queue.pl
+queue_gpu_opt="--gpu 1" # you may change the GPU opt externally,


@vesis84, please don't do this-- this is not really consistent with our normal way of working.
These kinds of things are configurable by changing queue.conf.

Yes, originally I did this with 'queue.conf'. But is it possible somehow to have 2 queues, one with all the GPUs and other with GPUs that have more than, say 7G RAM? (the big GPUs are not always necessary...)

danpovey · 2018-06-04T17:43:22Z

scripts/rnnlm/train_rnnlm.sh

 embedding_l2_regularize=$(perl -e "print ($embedding_l2/$this_num_jobs);")

+ # allocate queue-slots for threads doing sampling,
+ [ -f $dir/sampling.lm ] && queue_thread_opt="--num-threads $num_egs_threads" || queue_thread_opt=


Normally it won't actually use that much CPU on average because it's limited by GPU time. I don't know how many it uses though.
The num_egs_threads is more like an upper bound to make sure it is not limited by the sampling.

Aha, at our servers, it was running with 600% CPU, while the num-threads was set to 10. So it was 'eating' the CPU time of other slots (GPU process is supposed to consume 100% CPU only).

danpovey · 2018-06-04T21:56:35Z

For the queue thing, you can have one command be e.g. "queue.pl --config conf/queue_big_gpu_mem.conf" For the slots thing-- yes, I guess it makes sense to add the requirement, but maybe as a middle ground, set it to half the requested num-thread? Too many reserved slots aren't good either, normally. Dan

…

On Mon, Jun 4, 2018 at 5:54 PM, Karel Vesely ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In scripts/rnnlm/train_rnnlm.sh <#2455 (comment)>: > rnnlm_l2_factor=$(perl -e "print (1.0/$this_num_jobs);") embedding_l2_regularize=$(perl -e "print ($embedding_l2/$this_num_jobs);") + # allocate queue-slots for threads doing sampling, + [ -f $dir/sampling.lm ] && queue_thread_opt="--num-threads $num_egs_threads" || queue_thread_opt= Aha, at our servers, it was running with 600% CPU, while the num-threads was set to 10. So it was 'eating' the CPU time of other slots (GPU process is supposed to consum 100% CPU only). — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu_GS6KgzSlUeb8dttnTQTvFN1uq4ks5t5ayVgaJpZM4UOyfh> .

danpovey · 2018-06-04T22:00:48Z

We don't anticipate \r or \t in text lines, but if it's there, we'd want it to be treated as space. What we support is ASCII-compatible encodings, i.e. encodings where spaces have the same meaning as they do in ASCII.

…

On Mon, Jun 4, 2018 at 5:58 PM, Karel Vesely ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In scripts/rnnlm/choose_features.py <#2455 (comment)>: > for line in f: - fields = line.split() + fields = line.split(' ') Well, split() splits on any whitespace character (this can be a problem for example with \r characters). Are you sure you want to allow to split any "whitespace" character? (and sorry for explaining if you already are aware of it...) — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu2S1OhmNDmjolgGgEhShvwR_CFb2ks5t5a19gaJpZM4UOyfh> .

- dataprep: switching i/o in python 'utf-8 -> latin-1', - training: allocate 2/3 or sampling of targets in the queue (i.e. 6 cores for num_egs_threads=10),

KarelVesely84 · 2018-06-06T18:21:30Z

Good, I have incorporated the comments:
( split(' ') -> split(), $queue_thread_opt is with 2/3 of $num_egs_threads, removed the configurable $queue_gpu_opt).

Now, in the test, I am getting again the original error I had before...
It is in rnnlm/get_unigram_probs.py, line 133.

A 'counts' file is read in 'latin-1' encoding, the line is:
w\xc3\xa0nazokumbana 1
and it is wrongly split as
['w\xc3', 'nazokumbana', '1']

This means that for some reason split() interprets the char \xa0 as whitespace.
An obvious solution would be to use: split(' ') or re.split('[ \t]', line),
but you don't seem to be welcoming such change.

What should be the next step?

danpovey · 2018-06-06T19:19:33Z

OK, so that is unicode character U+00A0, encoded as the bytes "C3" "A0" (written as hex). The second byte, A0 (==160) does map to non-breaking space in the latin-1 character set. We have to decide whether to allow &nbsp (non-breaking space) in words-- that will determine the course of action. Currently, validate_lang.pl actually does not allow &nbsp: you can see in validate_utf8_whitespaces. But there may actually be situations in, say, Arabic, where &nbsp and similar non-standard whitespaces might validly appear inside a word-- intended, for example, to alter the way letters join together. @hhadian and @jtrmal, do you have any comment? Yenda, you were the one who added the code to check for those types of whitespace in UTF text, so I assume you remember the original reason why we banned them.

…

On Wed, Jun 6, 2018 at 2:21 PM, Karel Vesely ***@***.***> wrote: Good, I have incorporated the comments: ( split(' ') -> split(), num threads is 2/3 of $num_egs_threads). Now, at testing, I am getting again the original error I had before... It is in rnnlm/get_unigram_probs.py, line 133. A 'counts' file is read in 'latin-1' encoding, the line is: w\xc3\xa0nazokumbana 1 and it is wrongly split as ['w\xc3', 'nazokumbana', '1'] This means that for some reason split() interprets the char \xa0 as whitespace. An obvious solution would be to use: split(' ') or re.split('[ \t]', line) What should be the next step? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu-C_a9EJFHuQjXl1JY9nwUkBH4mnks5t6B2wgaJpZM4UOyfh> .

danpovey · 2018-06-06T20:12:52Z

OK, I had another look at how the Unicode format works
https://www.fileformat.info/info/unicode/utf8.htm
and it looks like the byte "A0" (i.e. byte value 160) could potentially appear in a lot of Unicode characters; for instance c3a0 is a valid utf-8 byte sequence, meaning a grave-accented "a". That means that whenever we split data that is encoded as latin-1, we can't do just "split()", because it could spuriously split a lot of UTF characters.

I think Karel's original plan of splitting on " " explicitly would actually make the most sense.
I now see that sym2int.pl (which is the other program that mainly deals with splitting of text data) explicitly splits on " ", and keeping those two things consistent would probably make sense.

The other way we sometimes split is implicitly using awk (always with LC_ALL=C exported). That probably splits on tabs and \r as well, but probably the easiest way to resolve this difference is to just say that tabs and linefeed characters are banned in text data such as in 'data/text'. We could change validate_lang.pl to enforce this.

That still leaves open the question of whether &nbsp is banned in utf8-encoded text, and we do need to resolve that, but it makes it a separate issue.

jtrmal · 2018-06-06T21:02:06Z

I'm trying to remember, but I don't know. Those were the times we didn't have to lie about the encoding just to make python3 happy. It was certainly because people weren't cleaning up the data properly and were getting whitespaces into the lexicon and similar files (where, consequently, got split). But I don't remember the concrete case that was the one that made us do it -- certainly, there was one. y. On Wed, Jun 6, 2018 at 3:19 PM Daniel Povey <notifications@github.com> wrote:

…

OK, so that is unicode character U+00A0, encoded as the bytes "C3" "A0" (written as hex). The second byte, A0 (==160) does map to non-breaking space in the latin-1 character set. We have to decide whether to allow &nbsp (non-breaking space) in words-- that will determine the course of action. Currently, validate_lang.pl actually does not allow &nbsp: you can see in validate_utf8_whitespaces. But there may actually be situations in, say, Arabic, where &nbsp and similar non-standard whitespaces might validly appear inside a word-- intended, for example, to alter the way letters join together. @hhadian and @jtrmal, do you have any comment? Yenda, you were the one who added the code to check for those types of whitespace in UTF text, so I assume you remember the original reason why we banned them. On Wed, Jun 6, 2018 at 2:21 PM, Karel Vesely ***@***.***> wrote: > Good, I have incorporated the comments: > ( split(' ') -> split(), num threads is 2/3 of $num_egs_threads). > > Now, at testing, I am getting again the original error I had before... > It is in rnnlm/get_unigram_probs.py, line 133. > > A 'counts' file is read in 'latin-1' encoding, the line is: > w\xc3\xa0nazokumbana 1 > and it is wrongly split as > ['w\xc3', 'nazokumbana', '1'] > > This means that for some reason split() interprets the char \xa0 as > whitespace. > An obvious solution would be to use: split(' ') or re.split('[ \t]', line) > > What should be the next step? > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub > <#2455 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/ADJVu-C_a9EJFHuQjXl1JY9nwUkBH4mnks5t6B2wgaJpZM4UOyfh > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXzn-zB20sblNWA5SPh9bKxtO4YB6ks5t6CtTgaJpZM4UOyfh> .

jtrmal · 2018-06-06T21:03:32Z

I'd split on tabs and spaces, because we are pretty careless sometimes (for example in the lexicon stuff). Plus, speaking of the lexicon, those tabs were sometimes used to split on syllabs, but I don't recall now. y.

…

On Wed, Jun 6, 2018 at 4:13 PM Daniel Povey ***@***.***> wrote: OK, I had another look at how the Unicode format works https://www.fileformat.info/info/unicode/utf8.htm and it looks like the byte "A0" (i.e. byte value 160) could potentially appear in a lot of Unicode characters; for instance c3a0 is a valid utf-8 byte sequence, meaning a grave-accented "a". That means that whenever we split data that is encoded as latin-1, we can't do just "split()", because it could spuriously split a lot of UTF characters. I think Karel's original plan of splitting on " " explicitly would actually make the most sense. I now see that sym2int.pl (which is the other program that mainly deals with splitting of text data) explicitly splits on " ", and keeping those two things consistent would probably make sense. The other way we sometimes split is implicitly using awk (always with LC_ALL=C exported). That probably splits on tabs and \r as well, but probably the easiest way to resolve this difference is to just say that tabs and linefeed characters are banned in text data such as in 'data/text'. We could change validate_lang.pl to enforce this. That still leaves open the question of whether &nbsp is banned in utf8-encoded text, and we do need to resolve that, but it makes it a separate issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX69C_Jz1wWaQGk_9tiApVY4iYIt1ks5t6DfUgaJpZM4UOyfh> .

hhadian · 2018-06-06T21:05:57Z

In Farsi (and probably Arabic), zero-width space and zero-width non-joiner (&zwnj) are commonly used in the middle of compound words, not sure about $nbsp but I don't remember seeing it in Farsi texts (I guess it's mostly popular in HTML).

jtrmal · 2018-06-06T21:11:20Z

I think we used to recommend to remove them or map them to something visible anyway as it can cause troubles hard to debug. ZWJ does not have the unicode isWhitespace set ( https://www.fileformat.info/info/unicode/char/200d/index.htm) y.

…

On Wed, Jun 6, 2018 at 5:06 PM Hossein Hadian ***@***.***> wrote: In Farsi (and probably Arabic), zero-width space and zero-width non-joiner (&zwnj) are commonly used in the middle of compound words, not sure about $nbsp but I don't remember seeing it in Farsi texts (I guess it's mostly popular in HTML). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX1tYutO17lTLcRRtpmQ78niqYpupks5t6EQ_gaJpZM4UOyfh> .

danpovey · 2018-06-06T21:21:40Z

OK, so I think we're good then, assuming Hossein thinks it's OK for Arabic. Our rule will be: we disallow any space characters that UTF-8 considers as space, that are different from space, tab and newline. That permits zero-width non-joiner but not zero-width space. Hossein, do you think we can get by with that? E.g. can we map any zero-width spaces to something else, like regular space, without badly distorting something or losing meaning? That requires us to -- modify validate_text.pl and validate_lang.pl to explicitly disallow linefeed (they should probably print something that's specific to linefeed, so people don't think it's a UTF-8 issue). I realized that sym2int.pl is actualy splitting on all whitespaces including tabs, spaces and linefeeds, since in perl, split(" ", ...) has a special meaning. Karel, so you should be splitting on spaces or tabs; and if you feel like it, make those modifications to validate_text.pl and validate_lang.pl to reject linefeed (if not, someone else can do it).

…

On Wed, Jun 6, 2018 at 5:11 PM, jtrmal ***@***.***> wrote: I think we used to recommend to remove them or map them to something visible anyway as it can cause troubles hard to debug. ZWJ does not have the unicode isWhitespace set ( https://www.fileformat.info/info/unicode/char/200d/index.htm) y. On Wed, Jun 6, 2018 at 5:06 PM Hossein Hadian ***@***.***> wrote: > In Farsi (and probably Arabic), zero-width space and zero-width non-joiner > (&zwnj) are commonly used in the middle of compound words, not sure about > $nbsp but I don't remember seeing it in Farsi texts (I guess it's mostly > popular in HTML). > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2455 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ AKisX1tYutO17lTLcRRtpmQ78niqYpupks5t6EQ_gaJpZM4UOyfh> > . > — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu_Inzbc4MQti2oLKYBJvxMFO031kks5t6EWBgaJpZM4UOyfh> .

hhadian · 2018-06-06T21:30:23Z

Yes, I think. Any zero-width space can be mapped to zero-width non-joiner (actually, I guess zero-width non-joiner is the standard way) or even space without losing meaning.

johnjosephmorgan · 2018-06-06T23:46:42Z

I remember running into these zero width characters in Dari and Pashto. people would get mad at me when I would delete them, so I think they might be important.

…

On Jun 6, 2018, at 5:30 PM, Hossein Hadian ***@***.***> wrote: Yes, I think. Any zero-width space can be mapped to zero-width non-joiner (actually, I guess zero-width non-joiner is the standard way <https://en.wikipedia.org/wiki/Zero-width_non-joiner>) or even space without losing meaning. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABK_gBp9vjBbUC7odjR_0qlV551Tricnks5t6EnxgaJpZM4UOyfh>.

danpovey · 2018-06-06T23:59:39Z

Hm, OK. But Hossein is saying that it should be possible to replace zero-width space with zero-width joiner. What I am *hoping* is that for each character that has Unicode's "space" property, there is another usable equivalent that isn't considered a space. But I don't know this for sure. It will definitely make our lives easier if we can ensure that words don't contain spaces internally. On Wed, Jun 6, 2018 at 7:46 PM, John J Morgan <notifications@github.com> wrote:

…

I remember running into these zero width characters in Dari and Pashto. people would get mad at me when I would delete them, so I think they might be important. > On Jun 6, 2018, at 5:30 PM, Hossein Hadian ***@***.***> wrote: > > Yes, I think. Any zero-width space can be mapped to zero-width non-joiner (actually, I guess zero-width non-joiner is the standard way < https://en.wikipedia.org/wiki/Zero-width_non-joiner>) or even space without losing meaning. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub < #2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe- auth/ABK_gBp9vjBbUC7odjR_0qlV551Tricnks5t6EnxgaJpZM4UOyfh>. > — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuzCr5KWK0qebUPoNLpbuNNLVF1h3ks5t6GnpgaJpZM4UOyfh> .

johnjosephmorgan · 2018-06-07T01:36:07Z

Yes. I think that works for zero width space. I was worried about ligatures. If you want a ligature, you replace zero width space with zero width joiner. If you do not want a ligature, you replace zero width space with zero width non-joiner.

…

On Jun 6, 2018, at 7:59 PM, Daniel Povey ***@***.***> wrote: Hm, OK. But Hossein is saying that it should be possible to replace zero-width space with zero-width joiner. What I am *hoping* is that for each character that has Unicode's "space" property, there is another usable equivalent that isn't considered a space. But I don't know this for sure. It will definitely make our lives easier if we can ensure that words don't contain spaces internally. On Wed, Jun 6, 2018 at 7:46 PM, John J Morgan ***@***.***> wrote: > I remember running into these zero width characters in Dari and Pashto. > people would get mad at me when I would delete them, so I think they might > be important. > > > > > On Jun 6, 2018, at 5:30 PM, Hossein Hadian ***@***.***> > wrote: > > > > Yes, I think. Any zero-width space can be mapped to zero-width > non-joiner (actually, I guess zero-width non-joiner is the standard way < > https://en.wikipedia.org/wiki/Zero-width_non-joiner>) or even space > without losing meaning. > > > > — > > You are receiving this because you are subscribed to this thread. > > Reply to this email directly, view it on GitHub < > #2455 (comment)>, or > mute the thread <https://github.com/notifications/unsubscribe- > auth/ABK_gBp9vjBbUC7odjR_0qlV551Tricnks5t6EnxgaJpZM4UOyfh>. > > > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub > <#2455 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ADJVuzCr5KWK0qebUPoNLpbuNNLVF1h3ks5t6GnpgaJpZM4UOyfh> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABK_gOzGDw53qS6-7AJKhRj9uGQwj4Bvks5t6GzugaJpZM4UOyfh>.

KarelVesely84 · 2018-06-07T14:31:43Z

Aha, okay, so, you mean... I should modify the validate_text.pl and validate_lang.pl to disallow the CR (not LF) both in 'words.txt' and in 'data/text'. Is that correct?

KarelVesely84 · 2018-06-07T15:34:05Z

I am not sure how to make the change in validate_*.pl correctly and I need to leave for today. Maybe someone else could do it? Or, if it can wait, I'll look into it tomorrow...

danpovey · 2018-06-07T17:56:39Z

Thanks. Once you confirm that you've tested it (at least that it doesn't crash in the early stages), I'll merge. @jtrmal, do you have time to change the validation scripts to ban CR (\r)?

jtrmal · 2018-06-07T17:58:06Z

I'll do it. y.

…

On Thu, Jun 7, 2018 at 1:56 PM Daniel Povey ***@***.***> wrote: Thanks. Once you confirm that you've tested it (at least that it doesn't crash in the early stages), I'll merge. @jtrmal <https://github.com/jtrmal>, do you have time to change the validation scripts to ban CR (\r)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXw2fCF3ApBGPBWfSi3es_bdtDiVgks5t6WlmgaJpZM4UOyfh> .

KarelVesely84 · 2018-06-08T13:34:59Z

Hi, for me the dataprep and RNNLM training runs fine. A.t.m. waiting for the training to finish... K.

…#2455)

danpovey reviewed May 26, 2018

View reviewed changes

KarelVesely84 force-pushed the rnnlm_dataprep branch from 3a580ed to d7379e1 Compare May 28, 2018 14:10

danpovey closed this May 28, 2018

danpovey reopened this May 28, 2018

danpovey reviewed Jun 2, 2018

View reviewed changes

danpovey reviewed Jun 4, 2018

View reviewed changes

rnnlm,

61f1e3c

- dataprep: switching i/o in python 'utf-8 -> latin-1', - training: allocate 2/3 or sampling of targets in the queue (i.e. 6 cores for num_egs_threads=10),

KarelVesely84 force-pushed the rnnlm_dataprep branch from d308045 to 61f1e3c Compare June 6, 2018 17:49

rnnlm dataprep, replacing 'line.split()' -> 're.split("[ \t]", line)'

55aa2cc

danpovey merged commit 5a6477b into kaldi-asr:master Jun 8, 2018

jtrmal mentioned this pull request Jun 9, 2018

better check for white space chars and correct flags during compilation #2485

Merged

dpriver pushed a commit to dpriver/kaldi that referenced this pull request Sep 13, 2018

[scripts] Improve encoding compatibility of rnnlm dataprep (kaldi-asr…

62866e7

…#2455)

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[scripts] Improve encoding compatibility of rnnlm dataprep (kaldi-asr…

3a77acb

…#2455)

KarelVesely84 deleted the rnnlm_dataprep branch September 2, 2019 15:37

rnnlm dataprep: the only valid 2-col splitter is: ' ' (space) #2455

rnnlm dataprep: the only valid 2-col splitter is: ' ' (space) #2455

Uh oh!

Conversation

KarelVesely84 commented May 26, 2018

Choose a reason for hiding this comment

KarelVesely84 May 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

KarelVesely84 commented May 28, 2018

danpovey commented May 28, 2018 via email

danpovey commented May 28, 2018

KarelVesely84 commented May 29, 2018

jtrmal commented May 29, 2018 via email

danpovey commented May 29, 2018

KarelVesely84 commented May 30, 2018

KarelVesely84 commented May 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KarelVesely84 commented May 30, 2018

KarelVesely84 commented May 30, 2018

KarelVesely84 commented May 30, 2018

danpovey commented May 30, 2018 via email

KarelVesely84 commented May 31, 2018

danpovey commented May 31, 2018

danpovey commented Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KarelVesely84 commented Jun 4, 2018

KarelVesely84 commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

danpovey commented Jun 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KarelVesely84 Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

danpovey commented Jun 4, 2018 via email

danpovey commented Jun 4, 2018 via email

KarelVesely84 commented Jun 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

danpovey commented Jun 6, 2018 via email

danpovey commented Jun 6, 2018

jtrmal commented Jun 6, 2018 via email

jtrmal commented Jun 6, 2018 via email

hhadian commented Jun 6, 2018

jtrmal commented Jun 6, 2018 via email

danpovey commented Jun 6, 2018 via email

hhadian commented Jun 6, 2018

johnjosephmorgan commented Jun 6, 2018 via email

danpovey commented Jun 6, 2018 via email

johnjosephmorgan commented Jun 7, 2018 via email

KarelVesely84 commented Jun 7, 2018

KarelVesely84 commented Jun 7, 2018

danpovey commented Jun 7, 2018

jtrmal commented Jun 7, 2018 via email

KarelVesely84 commented Jun 8, 2018

Labels

5 participants

KarelVesely84 May 28, 2018 •

edited

Loading

KarelVesely84 commented May 30, 2018 •

edited

Loading

danpovey commented Jun 1, 2018 •

edited

Loading

KarelVesely84 commented Jun 4, 2018 •

edited

Loading

KarelVesely84 Jun 4, 2018 •

edited

Loading

KarelVesely84 commented Jun 6, 2018 •

edited

Loading