- Notifications
You must be signed in to change notification settings - Fork 5.4k
Adding chime5 baseline recipe #2262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
danpovey left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments.
| @@ -0,0 +1,50 @@ | |||
| #BeamformIt sample configuration file for AMI data (http://groups.inf.ed.ac.uk/ami/download/) | |||
| | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename to beamformit_chime5.cfg to clarify that it relates to that?
egs/chime5/s5/conf/decode.config Outdated
| @@ -0,0 +1,2 @@ | |||
| beam=11.0 # beam for decoding. Was 13.0 in the scripts. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably delete this (also from the source of these scripts) unless it's being used. I believe it is not used unless an option like "--config conf/decode.config" is given to decoding scripts.
| @@ -0,0 +1,283 @@ | |||
| #!/bin/bash | |||
| | |||
| # 1e is as 1d but instead of the --proportional-shrink option, using | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you don't have letters a through d in this PR, please rename to 1a.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| num_targets=$(tree-info $tree_dir/tree |grep num-pdfs|awk '{print $2}') | ||
| learning_rate_factor=$(echo "print 0.5/$xent_regularize" | python) | ||
| opts="l2-regularize=0.05" | ||
| output_opts="l2-regularize=0.01" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may find that adding "bottleneck-dim=320" (or maybe 256) to output_opts helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just put "bottleneck-dim=320" in the current script for now (I used it in my old setup, but it was removed during some merge steps). It will be tuned later.
| adir=$1 | ||
| jdir=$2 | ||
| dir=$3 | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some basic check of the inputs, so the messages are informative if the user gives the wrong inputs.
| echo "-------------------" | ||
| echo "Maxent 3grams" | ||
| echo "-------------------" | ||
| sed 's/'${oov_symbol}'/<unk>/g' $tgtdir/train.txt | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these needed? I notice you're not treating it as an error if LIBLBFGS is not defined. It seems to me, either it's important (in which case it should be an error if not defined), or it's not (in which case this could be deleted).
If you want to have a script that tries a bunch of LMs automatically, like this, then IMO this shouldn't be in local/, it should be a generic script called from local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That script is fairly general because there was a lot of experimentation with the training set size and yet, it's not general enough, as it contains some chime5 specific filtering (as it is probable, that the training set could contain duplicate transcriptions of the same utterance (via a different channel). IMO the MaxEnt models are usually the most robust way for arpa-format LM -- you don't have to mess with discounting and cut-offs, but they are not very widely accepted, so I wanted to provide a comparison against KN and GT baselines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtrmal, for these max-ent LMs, are they the ones that are actually chosen by the script?
I believe they are quite slow to estimate, and I'm not sure how they behave when pruned.
I'd prefer to get rid of the dependency if there isn't really a good reason to use them.
egs/chime5/s5/run.sh Outdated
| # chime5 main directory path | ||
| # please change the path accordingly | ||
| chime5_corpus=/export/corpora4/CHiME5 | ||
| json_dir=${chime5_corpus}/data/transcriptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe json_dir=${chime5_corpus}/transcriptions
there is no data dir in unzipped CHiME5 dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
egs/chime5/s5/run.sh Outdated
| # please change the path accordingly | ||
| chime5_corpus=/export/corpora4/CHiME5 | ||
| json_dir=${chime5_corpus}/data/transcriptions | ||
| audio_dir=${chime5_corpus}/data/audio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| @danpovey @ShigekiKarita @sw005320 I'm addressing the comments, will push today -- I want to run the data preparation pipeline to make sure it works with the new corpus location |
| @jtrmal @ShigekiKarita, I just fixed wrong path issues. Now I confirmed that it's working by stage 16. |
| The non-addressed comments are related on TDNN/chain script, shinji is still waiting for his training to finish. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you have to use cuda_cmd?
| @kamo-naoyuki which part are you talking about? |
| fi | ||
| | ||
| steps/nnet3/chain/train.py --stage=$train_stage \ | ||
| --cmd="$decode_cmd" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, here. $decode_cmd should be replace to $cuda_cmd?
| No, it shouldn't be cuda_cmd Y. …On Fri, Mar 9, 2018, 16:57 kamo-naoyuki ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/chime5/s5/local/chain/tuning/run_tdnn_1e.sh <#2262 (comment)>: > + # similar in the xent and regular final layers. + relu-batchnorm-layer name=prefinal-xent input=tdnn8 $opts dim=512 target-rms=0.5 + output-layer name=output-xent $output_opts dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5 +EOF + steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/ +fi + + +if [ $stage -le 14 ]; then + if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d $dir/egs/storage ]; then + utils/create_split_dir.pl \ + /export/b0{3,4,5,6}/$USER/kaldi-data/egs/chime5-$(date +'%m_%d_%H_%M')/s5/$dir/egs/storage $dir/egs/storage + fi + + steps/nnet3/chain/train.py --stage=$train_stage \ + --cmd="$decode_cmd" \ Sorry, here. $decode_cmd should be replace to $cuda_cmd? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2262 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX6bHPQaOOU0Pjs4vZkc04hUG2gVuks5tcqaFgaJpZM4Sg9M0> . |
| Cuda_cmd was used for training Karels dnn, now the cuda during training is used in different way. …On Fri, Mar 9, 2018, 17:38 Jan Trmal ***@***.***> wrote: No, it shouldn't be cuda_cmd Y. On Fri, Mar 9, 2018, 16:57 kamo-naoyuki ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In egs/chime5/s5/local/chain/tuning/run_tdnn_1e.sh > <#2262 (comment)>: > > > + # similar in the xent and regular final layers. > + relu-batchnorm-layer name=prefinal-xent input=tdnn8 $opts dim=512 target-rms=0.5 > + output-layer name=output-xent $output_opts dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5 > +EOF > + steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/ > +fi > + > + > +if [ $stage -le 14 ]; then > + if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d $dir/egs/storage ]; then > + utils/create_split_dir.pl \ > + /export/b0{3,4,5,6}/$USER/kaldi-data/egs/chime5-$(date +'%m_%d_%H_%M')/s5/$dir/egs/storage $dir/egs/storage > + fi > + > + steps/nnet3/chain/train.py --stage=$train_stage \ > + --cmd="$decode_cmd" \ > > Sorry, here. $decode_cmd should be replace to $cuda_cmd? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2262 (review)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AKisX6bHPQaOOU0Pjs4vZkc04hUG2gVuks5tcqaFgaJpZM4Sg9M0> > . > |
…nd right channel information according to the other channel information format
| ok, I understood. |
| Yes, those are typically the ones used. The gain on the WER for this dataset is around 0.2 % abs. Also, I'm using it fairly commonly (for babel and others) and didn't run into any problem. But given the fact CHiME organizers' plan is to get is this merged as soon as possible, I'm ok with deleting it. y. …On Sat, Mar 10, 2018 at 10:04 PM, Daniel Povey ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/chime5/s5/local/train_lms_srilm.sh <#2262 (comment)>: > + -kndiscount1 -gt1min 0 -kndiscount2 -gt2min 1 -kndiscount3 -gt3min 2 -kndiscount4 -gt4min 3 -order 4 \ + -text $tgtdir/train.txt -vocab $tgtdir/vocab -unk -sort -map-unk "$oov_symbol" +ngram-count -lm $tgtdir/4gram.kn0222.gz \ + -kndiscount1 -gt1min 0 -kndiscount2 -gt2min 2 -kndiscount3 -gt3min 2 -kndiscount4 -gt4min 2 -order 4 \ + -text $tgtdir/train.txt -vocab $tgtdir/vocab -unk -sort -map-unk "$oov_symbol" +ngram-count -lm $tgtdir/4gram.kn0223.gz \ + -kndiscount1 -gt1min 0 -kndiscount2 -gt2min 2 -kndiscount3 -gt3min 2 -kndiscount4 -gt4min 3 -order 4 \ + -text $tgtdir/train.txt -vocab $tgtdir/vocab -unk -sort -map-unk "$oov_symbol" + +if [ ! -z ${LIBLBFGS} ]; then + #please note that if the switch -map-unk "$oov_symbol" is used with -maxent-convert-to-arpa, ngram-count will segfault + #instead of that, we simply output the model in the maxent format and convert it using the "ngram" + echo "-------------------" + echo "Maxent 3grams" + echo "-------------------" + sed 's/'${oov_symbol}'/<unk>/g' $tgtdir/train.txt | \ @jtrmal <https://github.com/jtrmal>, for these max-ent LMs, are they the ones that are actually chosen by the script? I believe they are quite slow to estimate, and I'm not sure how they behave when pruned. I'd prefer to get rid of the dependency if there isn't really a good reason to use them. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2262 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXwk_rZzCGqKXOBg6PTXSxu8M2IRkks5tdD_DgaJpZM4Sg9M0> . |
| it's OK, you can keep it if it helps. …On Sat, Mar 10, 2018 at 5:12 PM, jtrmal ***@***.***> wrote: Yes, those are typically the ones used. The gain on the WER for this dataset is around 0.2 % abs. Also, I'm using it fairly commonly (for babel and others) and didn't run into any problem. But given the fact CHiME organizers' plan is to get is this merged as soon as possible, I'm ok with deleting it. y. On Sat, Mar 10, 2018 at 10:04 PM, Daniel Povey ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In egs/chime5/s5/local/train_lms_srilm.sh > <#2262 (comment)>: > > > + -kndiscount1 -gt1min 0 -kndiscount2 -gt2min 1 -kndiscount3 -gt3min 2 -kndiscount4 -gt4min 3 -order 4 \ > + -text $tgtdir/train.txt -vocab $tgtdir/vocab -unk -sort -map-unk "$oov_symbol" > +ngram-count -lm $tgtdir/4gram.kn0222.gz \ > + -kndiscount1 -gt1min 0 -kndiscount2 -gt2min 2 -kndiscount3 -gt3min 2 -kndiscount4 -gt4min 2 -order 4 \ > + -text $tgtdir/train.txt -vocab $tgtdir/vocab -unk -sort -map-unk "$oov_symbol" > +ngram-count -lm $tgtdir/4gram.kn0223.gz \ > + -kndiscount1 -gt1min 0 -kndiscount2 -gt2min 2 -kndiscount3 -gt3min 2 -kndiscount4 -gt4min 3 -order 4 \ > + -text $tgtdir/train.txt -vocab $tgtdir/vocab -unk -sort -map-unk "$oov_symbol" > + > +if [ ! -z ${LIBLBFGS} ]; then > + #please note that if the switch -map-unk "$oov_symbol" is used with -maxent-convert-to-arpa, ngram-count will segfault > + #instead of that, we simply output the model in the maxent format and convert it using the "ngram" > + echo "-------------------" > + echo "Maxent 3grams" > + echo "-------------------" > + sed 's/'${oov_symbol}'/<unk>/g' $tgtdir/train.txt | \ > > @jtrmal <https://github.com/jtrmal>, for these max-ent LMs, are they the > ones that are actually chosen by the script? > I believe they are quite slow to estimate, and I'm not sure how they > behave when pruned. > I'd prefer to get rid of the dependency if there isn't really a good > reason to use them. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2262 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AKisXwk_ rZzCGqKXOBg6PTXSxu8M2IRkks5tdD_DgaJpZM4Sg9M0> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2262 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1kG8drsqLoETio1al1bF7X6bcKxks5tdE_egaJpZM4Sg9M0> . |
| @jtrmal @danpovey I'm just reporting the current status. I removed stage 13 (lexicon update) to strictly follow the challenge regulation, and also added a location tag for future scoring. Now, I'm checking whether the recipe is working from scratch. Now it's working by the data cleaning stage, and will move to the chain model. I already confirmed it's working in the previous setup, and I think the check will be smoothly finished in the weekend. (I hope. I believe) |
| @danpovey we've finished the recipe check, and updated the latest results. If this is no problem, please merge it. |
| Thanks!!!! |
This is a joint work with @sw005320 and me
Dan, we are still working on the chain/nnet trainining but we thought we will go ahead and create PR now, so that at least the data preparation and gmm can go through review