Skip to content

Conversation

@KarelVesely84
Copy link
Contributor

  • helps if the words in 'words.txt' contain special UTF whitespaces,
    which otherwise lead to having >2 columns,
  • it will make the code a little more robust to 'dirty' dataprep,
for line in f:
fields = line.split()
assert len(fields) == 2
fields = line.split(' ')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was definitely a bug. The rule for things like words.txt in Kaldi is that they should never contain anything which, when interpreted as an ASCII character, is a space.
And in general, we don't even assume that things like text files are actually encoded in UTF-8 or a compatible encoding; we only require that the spaces between words be ASCII space (' ').
So I believe the correct fix here would be to change encoding="utf-8" to encoding="latin-1".
Would you mind testing whether that change works for your setup?
Make sure there are no similar issues in the rnnlm/ subdirectory.

Copy link
Contributor Author

@KarelVesely84 KarelVesely84 May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, okay, so, if the only requirement is to contain ASCII-space ' ' as a separator, this is also fine for texts in UTF encoding. The code for whitespace is always '0x20'.

If we changed encoding in python script encoding="utf-8" to encoding="latin-1", the further development of the python scripts would become more difficult, as the prints would become incorrect (with hexa codes instead of UTF symbols). So I'd keep the encoding filter as it is.

The problem was that the line.split() splits string with on any white-space character, while with line.split(' ') we are narrowing to split only with the ASCII-space 0x32, and it does not matter if the line is a UTF-string or Byte-string (UTF-8 is by design backward compatible with ASCII : https://en.wikipedia.org/wiki/UTF-8).

So, I'll look for other split() commands in the directory and change them if necessary...

@KarelVesely84
Copy link
Contributor Author

Done! The data preparation seems to work well. There is no error. It would be good to test it with one of your other recipes too... (maybe one of the JHU students could do it, before merging-in?)

@danpovey
Copy link
Contributor

danpovey commented May 28, 2018 via email

@danpovey
Copy link
Contributor

... and regarding printing: for programs like this, we should probably set the encoding for sys.stdout to latin-1 as well. The intention is that the output should look the same as the input, when interpreted as a bytestring, because programs like this have no need to break up words internally or even know what the encoding is, other than being able to split on whitespace. But we can leave that till later; I'm not sure of the correct invocation to do it.

@danpovey danpovey closed this May 28, 2018
@danpovey danpovey reopened this May 28, 2018
@KarelVesely84
Copy link
Contributor Author

Aha, well, still I have mixed feelings about using 'latin-1'. It seems to be against the current trend, in which the UTF-8 is more and more wide-spread than ASCII : https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg

On the other hand I checked that the ASCII-space '0x20' cannot be the 2nd/3rd/4th UTF-8 byte, nor the 2nd GBK byte. So I will do the change and test it, in the way you prefer it...

@jtrmal
Copy link
Contributor

jtrmal commented May 29, 2018 via email

@danpovey
Copy link
Contributor

Did you test it already?

@KarelVesely84
Copy link
Contributor Author

Well, I tested it partially. The data-prep is running fine. While I have problem with training the model (getting the CUSPARSE error I described before in #2448):
ERROR (rnnlm-train[5.4.146~5-6b94e]:CopyFromSmat():cu-sparse-matrix.cc:395) cusparseStatus_t 6 : "CUSPARSE_STATUS_EXECUTION_FAILED" returned from 'cusparse_csr2csc(GetCusparseHandle(), smat.NumRows(), smat.NumCols(), smat.NumElements(), smat.CsrVal(), smat.CsrRowPtr(), smat.CsrColIdx(), CsrVal(), CsrColIdx(), CsrRowPtr(), CUSPARSE_ACTION_NUMERIC, CUSPARSE_INDEX_BASE_ZERO)'
I keep getting this much more often then I did last week. But AFAIK this is independent to the change in the data preparation for RNNLM training.

But it is true that I did not finish the 'testing by lattice rescoring'.

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented May 30, 2018

Hi, I found another issue, in rnnlm training there is multi-threaded sampling of targets, but the script was getting just one slot for all the threads... See the fix : c6193c5 (it adds --num-threds N to the call of the queue.pl) This is likely to improve stability of the training jobs...
K.

@KarelVesely84
Copy link
Contributor Author

Hmm, still getting that: "CUSPARSE_STATUS_EXECUTION_FAILED" I don't think I can finish the test in our cluster, it keeps on crashing every 3-5th iteration, this is really annoying :(

@KarelVesely84
Copy link
Contributor Author

And the 'CUDA_CACHE_DISABLE=1' did not help me...

@KarelVesely84
Copy link
Contributor Author

But the data preparation seems to be fine...

@danpovey
Copy link
Contributor

danpovey commented May 30, 2018 via email

@KarelVesely84
Copy link
Contributor Author

Yes, I have it with 'export'; Then I also tried the more explicit variant:
queue.pl ... xyz.log CUDA_CACHE_DISABLE=1 rnnlm-train ...
And it was not better either...

@danpovey
Copy link
Contributor

OK, looking back at my emails, it looks like others have encountered this problem before.
However, we were never able to fix it because we could not replicate it here at Hopkins.

I would really appreciate it if you could run the program in cuda-memcheck-- preferably repeatedly, on some kind of automated loop, on a machine where you previously had the error, and wait till you get the error. I really don't understand what the issue is. It could be some subtle concurrency problem.

In parallel, you could just go ahead and train your model with retry.pl. But obviously back up some files that are sufficient to reproduce the problem.

Just for the record, can you also let us know the CUDA version and driver version and which GPU hardware types you get the failure on?

@danpovey
Copy link
Contributor

danpovey commented Jun 1, 2018

Karel:
We started getting failures like this at JHU. This is good news as it means we can investigate the problem and find the cause. Previously when I reported the problem to NVidia as a likely driver bug they told me that it was probably an error in our code. But now I think I have something to show them: I am seeing things in the system logs at the same time as these errors occur. This is from the output of dmesg -T:

(on c09)

[Fri Jun 1 15:35:11 2018] NVRM: GPU at PCI:0000:02:00: GPU-56dda1e1-ec05-17a6-ec14-d33d60198868 [Fri Jun 1 15:35:11 2018] NVRM: GPU Board Serial Number: [Fri Jun 1 15:35:11 2018] NVRM: Xid (PCI:0000:02:00): 31, Ch 00000010, engmask 00000101, intr 10000000 

Other examples:
(on c07)

[Fri Jun 1 15:46:50 2018] NVRM: Xid (PCI:0000:02:00): 31, Ch 00000010, engmask 00000101, intr 10000000 

(on c03)

[Fri Jun 1 14:50:51 2018] NVRM: Xid (PCI:0000:04:00): 31, Ch 00000010, engmask 00000101, intr 10000000 [Fri Jun 1 14:52:59 2018] NVRM: GPU at PCI:0000:02:00: GPU-ae2076d0-2e04-8a6b-907b-298de4f9b743 [Fri Jun 1 14:52:59 2018] NVRM: GPU Board Serial Number: 0321117092297 [Fri Jun 1 14:52:59 2018] NVRM: Xid (PCI:0000:02:00): 31, Ch 00000010, engmask 00000101, intr 10000000 [Fri Jun 1 15:34:06 2018] NVRM: Xid (PCI:0000:02:00): 31, Ch 00000010, engmask 00000101, intr 10000000 

Can you see whether you can find the same on your system?
You may need to ask your sysadmins: not all Linux flavors allow users to see dmesg output.

with open(vocab_file, 'r', encoding="latin-1") as f:
for line in f:
fields = line.split()
fields = line.split(' ')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vesis84, you can remove the ' ' here and elsewhere. If the file contained tabs, we do want to split on those as well, even though that's now how we expect people will write the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, we could eventually support only the spaces and tabs and nothing else by:
re.split("[ \t]", str) Does that sound good? (it's more explicit and readable...)
K.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer just split(), I think, it's easier to remember and duplicate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, split() splits on any whitespace character (this can be a problem for example with \r characters). Are you sure you want to allow to split any "whitespace" character? (and sorry for explaining if you already are aware of it...)

@KarelVesely84
Copy link
Contributor Author

Hi, I just found an answer to the stability problem.

  • The cause: recent upgrade of GPU driver to 384.130, this disabled the compute exclusive mode as a by-product.
  • The consequence: 2 processes sharing 1 GPU (part of GPU-RAM taken by other process), and the CUDA allocator in Kaldi already has a lot of GPU-RAM, so there is not enough GPU-RAM for the cusparse_csr2csc(.) call
  • The outcome: error message "CUSPARSE_STATUS_EXECUTION_FAILED", which means "we just ran out of GPU memory..."

After enabling the 'process exclusive mode' and using GPUs with >=8GB RAM, the training is stable, and the problem is solved...

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Jun 4, 2018

And I found similar error message in dmesg -t (seems to be a bit old, it's from March 21st):

NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.130 Wed Mar 21 03:37:26 PDT 2018 (using threaded interrupts) NVRM: GPU at PCI:0000:02:00: GPU-7fef535b-ed05-2a7d-3398-57aa66ebd4b9 NVRM: GPU Board Serial Number: NVRM: Xid (PCI:0000:02:00): 31, Ch 00000020, engmask 00000101, intr 10000000 NVRM: GPU at PCI:0000:02:00: GPU-7fef535b-ed05-2a7d-3398-57aa66ebd4b9 NVRM: GPU Board Serial Number: NVRM: Xid (PCI:0000:02:00): 31, Ch 00000020, engmask 00000101, intr 10000000 

I am not sure how to interpret the codes there... But I did not investigate it yet... It still might be something "normal"...
K.

@danpovey
Copy link
Contributor

danpovey commented Jun 4, 2018

We are actually getting the CUSPARSE_STATUS_EXECUTION_FAILED on our grid and we do have exclusive mode. From our initial investigations it seems to be some subtle concurrency problem @hainan-xv is working on it. (Right, hainan?)

backstitch_training_interval=1 # backstitch training interval

cmd=run.pl # you might want to set this to queue.pl
queue_gpu_opt="--gpu 1" # you may change the GPU opt externally,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vesis84, please don't do this-- this is not really consistent with our normal way of working.
These kinds of things are configurable by changing queue.conf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, originally I did this with 'queue.conf'. But is it possible somehow to have 2 queues, one with all the GPUs and other with GPUs that have more than, say 7G RAM? (the big GPUs are not always necessary...)

embedding_l2_regularize=$(perl -e "print ($embedding_l2/$this_num_jobs);")

# allocate queue-slots for threads doing sampling,
[ -f $dir/sampling.lm ] && queue_thread_opt="--num-threads $num_egs_threads" || queue_thread_opt=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally it won't actually use that much CPU on average because it's limited by GPU time. I don't know how many it uses though.
The num_egs_threads is more like an upper bound to make sure it is not limited by the sampling.

Copy link
Contributor Author

@KarelVesely84 KarelVesely84 Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, at our servers, it was running with 600% CPU, while the num-threads was set to 10. So it was 'eating' the CPU time of other slots (GPU process is supposed to consume 100% CPU only).

@danpovey
Copy link
Contributor

danpovey commented Jun 4, 2018 via email

@danpovey
Copy link
Contributor

danpovey commented Jun 4, 2018 via email

- dataprep: switching i/o in python 'utf-8 -> latin-1', - training: allocate 2/3 or sampling of targets in the queue (i.e. 6 cores for num_egs_threads=10),
@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Jun 6, 2018

Good, I have incorporated the comments:
( split(' ') -> split(), $queue_thread_opt is with 2/3 of $num_egs_threads, removed the configurable $queue_gpu_opt).

Now, in the test, I am getting again the original error I had before...
It is in rnnlm/get_unigram_probs.py, line 133.

A 'counts' file is read in 'latin-1' encoding, the line is:
w\xc3\xa0nazokumbana 1
and it is wrongly split as
['w\xc3', 'nazokumbana', '1']

This means that for some reason split() interprets the char \xa0 as whitespace.
An obvious solution would be to use: split(' ') or re.split('[ \t]', line),
but you don't seem to be welcoming such change.

What should be the next step?

@danpovey
Copy link
Contributor

danpovey commented Jun 6, 2018 via email

@danpovey
Copy link
Contributor

danpovey commented Jun 6, 2018

OK, I had another look at how the Unicode format works
https://www.fileformat.info/info/unicode/utf8.htm
and it looks like the byte "A0" (i.e. byte value 160) could potentially appear in a lot of Unicode characters; for instance c3a0 is a valid utf-8 byte sequence, meaning a grave-accented "a". That means that whenever we split data that is encoded as latin-1, we can't do just "split()", because it could spuriously split a lot of UTF characters.

I think Karel's original plan of splitting on " " explicitly would actually make the most sense.
I now see that sym2int.pl (which is the other program that mainly deals with splitting of text data) explicitly splits on " ", and keeping those two things consistent would probably make sense.

The other way we sometimes split is implicitly using awk (always with LC_ALL=C exported). That probably splits on tabs and \r as well, but probably the easiest way to resolve this difference is to just say that tabs and linefeed characters are banned in text data such as in 'data/text'. We could change validate_lang.pl to enforce this.

That still leaves open the question of whether &nbsp is banned in utf8-encoded text, and we do need to resolve that, but it makes it a separate issue.

@jtrmal
Copy link
Contributor

jtrmal commented Jun 6, 2018 via email

@jtrmal
Copy link
Contributor

jtrmal commented Jun 6, 2018 via email

@hhadian
Copy link
Contributor

hhadian commented Jun 6, 2018

In Farsi (and probably Arabic), zero-width space and zero-width non-joiner (&zwnj) are commonly used in the middle of compound words, not sure about $nbsp but I don't remember seeing it in Farsi texts (I guess it's mostly popular in HTML).

@jtrmal
Copy link
Contributor

jtrmal commented Jun 6, 2018 via email

@danpovey
Copy link
Contributor

danpovey commented Jun 6, 2018 via email

@hhadian
Copy link
Contributor

hhadian commented Jun 6, 2018

Yes, I think. Any zero-width space can be mapped to zero-width non-joiner (actually, I guess zero-width non-joiner is the standard way) or even space without losing meaning.

@johnjosephmorgan
Copy link
Contributor

johnjosephmorgan commented Jun 6, 2018 via email

@danpovey
Copy link
Contributor

danpovey commented Jun 6, 2018 via email

@johnjosephmorgan
Copy link
Contributor

johnjosephmorgan commented Jun 7, 2018 via email

@KarelVesely84
Copy link
Contributor Author

Aha, okay, so, you mean... I should modify the validate_text.pl and validate_lang.pl to disallow the CR (not LF) both in 'words.txt' and in 'data/text'. Is that correct?

@KarelVesely84
Copy link
Contributor Author

I am not sure how to make the change in validate_*.pl correctly and I need to leave for today. Maybe someone else could do it? Or, if it can wait, I'll look into it tomorrow...

@danpovey
Copy link
Contributor

danpovey commented Jun 7, 2018

Thanks. Once you confirm that you've tested it (at least that it doesn't crash in the early stages), I'll merge. @jtrmal, do you have time to change the validation scripts to ban CR (\r)?

@jtrmal
Copy link
Contributor

jtrmal commented Jun 7, 2018 via email

@KarelVesely84
Copy link
Contributor Author

Hi, for me the dataprep and RNNLM training runs fine. A.t.m. waiting for the training to finish... K.

@danpovey danpovey merged commit 5a6477b into kaldi-asr:master Jun 8, 2018
dpriver pushed a commit to dpriver/kaldi that referenced this pull request Sep 13, 2018
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
@KarelVesely84 KarelVesely84 deleted the rnnlm_dataprep branch September 2, 2019 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants