Skip to content

Conversation

@danpovey
Copy link
Contributor

still far from compiling.

@danpovey
Copy link
Contributor Author

@hhadian, there's some stuff you can help with here:

  • Add attention.o to Makefile and make sure it compiles.
  • Write test for GetAttentionDotProducts() [you can do this with reference to the
    comment that explains what it does: write a simple version in the test file, that
    you can compare with.]
  • Write ApplyScalesToOutput() and ApplyScalesToInput(), and write testing code for them.
    The code will have a loop that goes up to context_dim, just like GetAttentionDotProducts(), and
    each time it loops it will call AddDiagVecMat. It will need to create a temporary transposed copy
    of the 'C' input.

Note: eventually, if this works, we may ask @kangshiyin to write CUDA versions of GetAttentionDotProducts(), ApplyScalesToOutput(), and ApplyScalesToInput(). But this probably won't take more than half the time, even with the naive implementation, so there's no need to do that just yet.

@hhadian
Copy link
Contributor

hhadian commented Jun 30, 2017

Will do




void TimeHeightConvolutionComponent::Check() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is TimeHeightConvolutionComponent supposed to be here? Almost all the functions ralating TimeHeightConvolutionComponent are duplicated here.

@danpovey
Copy link
Contributor Author

danpovey commented Jun 30, 2017 via email

@hhadian
Copy link
Contributor

hhadian commented Jun 30, 2017

OK, sure, I was just wondering.

@hhadian
Copy link
Contributor

hhadian commented Jun 30, 2017

Do you want me to write the test for GetAttentionDotProducts in a new file attention-test.cc or in an existing tester file? Also, the function itself seems not to be implemented yet (I looked in attention.cc), should I implement it?

I read the docs, but I'm not sure what query, key, and value are going to be in ASR tasks. The values should be the frames of speech, but what are the keys and queries?

@danpovey
Copy link
Contributor Author

danpovey commented Jun 30, 2017 via email

This function implements:
A->Row(i) += alpha * C(i, j) * B.Row(i + j * row_shift).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has j on the right-hand side but not on the left-hand side. I'm not sure I get it right.

@danpovey
Copy link
Contributor Author

danpovey commented Jul 1, 2017 via email

CuSubMatrix<BaseFloat> output_values_part(
*output, 0, num_output_rows, 0, value_dim);

ApplyScalesToOutput(1.0, values, *c, &output_values_part);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is asserted (a few lines before) that values and c both have num_output_rows rows, this will set A, B, and C in a way that all have the same number of rows, so row_shift will become 0.

@danpovey
Copy link
Contributor Author

danpovey commented Jul 1, 2017 via email

@danpovey
Copy link
Contributor Author

danpovey commented Jul 3, 2017

@hhadian, I think the component-level code is now working and tested.

Can you please work on the script-level changes required to test this?
I think the easiest way to do this will be to copy-and-modify BasicLayer, supporting layers of the
form affine + attention + [some kind of nonlinearity]. Do this in a new file attention.py Let's not bother with the ResNet-like thing they were doing, at this point; I've found previously in speech tasks that that stuff was not helpful. Note: our "NormalizeLayer" is basically the same as the "layer normalization" from Hinton that they refer to in the paper.

E.g. we want someone to be able to write in a config line:

attention-renorm-layer num-heads=10 value-dim=50 key-dim=50 time-stride=3 num-left-inputs=5 num-right-inputs=2.
or
attention-relu-renorm-layer num-heads=10 value-dim=50 key-dim=50 time-stride=3 num-left-inputs=5 num-right-inputs=2.

You can intersperse these with regular relu-batchnorm-layers for initial experiments.

You can have num-left-inputs-required and num-right-inputs-required and key-dim all present but defaulting to -1, and output-context present and defaulting to true; time-stride can default to 1 and num-heads to 1, but require the user to specify value-dim, key-dim, num-left-inputs and num-right-inputs.

@hhadian
Copy link
Contributor

hhadian commented Jul 3, 2017

Will do

@hhadian
Copy link
Contributor

hhadian commented Jul 3, 2017

What values do you suggest for the first experiment? With num-heads=10 value-dim=50 key-dim=50 time-stride=3 num-left-inputs=5 num-right-inputs=2, the input dim of the attention block is 1580 and the output dim is 580.

@danpovey
Copy link
Contributor Author

danpovey commented Jul 3, 2017 via email

@danpovey
Copy link
Contributor Author

danpovey commented Jul 4, 2017

@hhadian, I notice that the stats are not being printed in the progress logs.
It is due to an oversight on my part, I should have implemented the functions Add() and Scale() in the component, which would add and scale the stats. Also ZeroStats(). Can you fix it? You can look at class NonlinearComponent (see nnet-component-itf.{h,cc}) for inspiration.

@hhadian
Copy link
Contributor

hhadian commented Jul 4, 2017

Sure, will do.

void GetTList(const std::vector<Index> &indexes,
std::vector<int32> *t_values) {
// set of t values
std::unordered_set<int32> t_set;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, use std::set, which is sorted. Might be marginally more efficient.

You may also use some stl magic to reduce the function to just three lines, arguably more readable:

std::set<int32> t_set; std::remove_copy(indexes.begin(), indexes.end(), std::inserter(t_set, t_set.begin()), kNoTime); t_values->assign(t_set.begin(), t_set.end());

(include <algorithm> and <iterator>).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the normal case, there could be many (e.g. 128) copies of each 't' value, so in that case I think
the way we have it is more efficient. (Also Index is a struct, not an integer).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, missed the iter->t part.

@hhadian
Copy link
Contributor

hhadian commented Jul 15, 2017

The results of changing num_heads while keeping the output/input dimension constant.
[The input dim of attention layer is num_heads * (3 * dim + C) assuming key_dim = value_dim = dim and C is context-size]
[The output dim is num_heads * (dim + C) ]

In the following results, input_dim ~= 1580

# System tdnn_7k head15_dim34 head10_dim50 head8_dim63 # WER on train_dev(tg) 13.93 14.33 14.26 14.10 # WER on train_dev(fg) 12.85 13.27 12.93 12.99 # WER on eval2000(tg) 16.7 16.9 16.6 16.6 # WER on eval2000(fg) 15.0 15.2 14.8 15.0 # Final train prob -0.085 -0.079 -0.079 -0.080 # Final valid prob -0.106 -0.103 -0.102 -0.101 # Final train prob (xent) -1.260 -1.026 -1.024 -1.037 # Final valid prob (xent) -1.3193 -1.1072 -1.1048 -1.1147 

In the following results, input_dim ~= 2330 (i.e. ~50% bigger layers):

# System tdnn_7k head20_dim37 head15_dim50 head10_dim75 # WER on train_dev(tg) 13.93 14.07 13.96 14.01 # WER on train_dev(fg) 12.85 12.98 12.90 12.81 # WER on eval2000(tg) 16.7 16.9 16.4 16.4 # WER on eval2000(fg) 15.0 15.3 14.8 14.9 # Final train prob -0.085 -0.076 -0.078 -0.077 # Final valid prob -0.106 -0.100 -0.101 -0.101 # Final train prob (xent) -1.260 -0.999 -0.995 -1.003 # Final valid prob (xent) -1.3193 -1.0964 -1.0946 -1.0926 

In all these, there are 2 attention layers, one near the beginning and one near the end and context is (5, 2).

@danpovey
Copy link
Contributor Author

danpovey commented Jul 15, 2017 via email

@hhadian
Copy link
Contributor

hhadian commented Jul 15, 2017

Will do

@hhadian
Copy link
Contributor

hhadian commented Jul 18, 2017

Results regarding the position of attention layer in the network:

# System tdnn_7k L2 L5 L6 L7 # WER on train_dev(tg) 13.93 14.11 14.00 13.80 13.64 # WER on train_dev(fg) 12.85 12.94 12.85 12.74 12.55 # WER on eval2000(tg) 16.7 16.8 16.6 16.4 16.3 # WER on eval2000(fg) 15.0 15.2 15.0 15.0 14.8 # Final train prob -0.085 -0.085 -0.080 -0.079 -0.077 # Final valid prob -0.106 -0.104 -0.101 -0.101 -0.099 # Final train prob (xent) -1.260 -1.150 -1.034 -1.030 -1.009 # Final valid prob (xent) -1.3193 -1.2133 -1.1319 -1.1034 -1.0980 

Li means only layer i is attention and the rest are TDNN as baseline.
With the current config (i.e. context 5,2, key/value dim 50 and num-heads 10 ) attention is not working in the initial layers (I used time-stride=1 for L2 and time-stride=3 for the rest). I am trying a larger context for L2 to see if it helps.
I guess I should also try it on pre-final-chain (i.e. layer 8).

@hhadian
Copy link
Contributor

hhadian commented Jul 18, 2017

Since attention is good at layer 7, I tried bigger value dimensions with that:

# System tdnn_7k L7_key50_val50 L7_key50_val100 L7_key40-val80 # WER on train_dev(tg) 13.93 13.64 13.64 13.76 # WER on train_dev(fg) 12.85 12.55 12.68 12.62 # WER on eval2000(tg) 16.7 16.3 16.3 16.2 # WER on eval2000(fg) 15.0 14.8 14.7 14.6 # Final train prob -0.085 -0.077 -0.074 -0.076 # Final valid prob -0.106 -0.099 -0.095 -0.098 # Final train prob (xent) -1.260 -1.009 -0.984 -0.997 # Final valid prob (xent) -1.3193 -1.0980 -1.0727 -1.0887 

So the best result is currently L7_key40-val80 with 0.5% and 0.4% absolute improvement on eval2000 tg and fg.

@danpovey
Copy link
Contributor Author

danpovey commented Jul 18, 2017 via email

@hhadian
Copy link
Contributor

hhadian commented Jul 18, 2017

Will do.
Re train_dev, is it really just as important as eval2000? because my impression was that train_sev has a lot of speaker overlap and should be considered less important when tuning.

@danpovey
Copy link
Contributor Author

danpovey commented Jul 18, 2017 via email

@danpovey danpovey changed the title Preliminary work on attention model Attention modeling, with example scripts Sep 15, 2017
@danpovey danpovey merged commit d1016d8 into kaldi-asr:master Sep 15, 2017
@danpovey
Copy link
Contributor Author

@hhadian, sorry, your authorship seems to have been lost by git due to the squash (I don't like to merge, except between versions of Kaldi). Next time you merge stuff, it will be to master anyway.

kronos-cm added a commit to kronos-cm/kaldi that referenced this pull request Sep 16, 2017
* 'master' of https://github.com/kaldi-asr/kaldi: (43 commits) [src,scripts,egs] Transfer learning for ASR with nnet3 (kaldi-asr#1633) [src,scripts,egs] Attention modeling, with example scripts (kaldi-asr#1731) [src] Fix bug in block matrix addition (thanks: Sidhi Adkoli). [egs] Fix inconseqential input-checking bug in Swbd example script (kaldi-asr#1886) [build] dependency-check: that python2.7 and python3 exist and 2.7 is default (kaldi-asr#1876) [scripts] A cosmetic change to info messages in chain training (kaldi-asr#1880) [doc] Keep tutorial code up to date (thanks: Luwei Yang) [scripts] Bug-fix in long-utterance-segmentation script (thanks: Armin Oliya) (kaldi-asr#1877) [egs] Fixed some issues in the multilingual BABEL example scripts (kaldi-asr#1850) [build] Cosmetic fix in Makefile Remove memory leaks and unused variables (when CUDA is not enabled) (kaldi-asr#1866) [scripts] Fix default for egs.cmd in nnet3 training scripts (kaldi-asr#1865) [doc] Fix to how documentation is built (thanks: David van Leeuwen) [scripts] Add --decode-extra-opts in steps/decode.sh (required for speech activity detection scripts) (kaldi-asr#1859) [src] Adding documentation for lattice discriminative training functions (kaldi-asr#1854) [src] Typo fixes in documenation. (kaldi-asr#1857) [egs] Update to score.sh in fisher_swbd setup, allow --iter option (kaldi-asr#1853) [scripts] bug-fix in TFRNNLM rescoring script (no 'ark' needed for unk.probs file) (kaldi-asr#1851) [src] Remove repeated parameter documentation. (kaldi-asr#1849) [egs] Aspire example scripts: Update autoencoder example to xconfig (kaldi-asr#1847) ...
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants