- Notifications
You must be signed in to change notification settings - Fork 5.4k
Attention modeling, with example scripts #1731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @hhadian, there's some stuff you can help with here:
Note: eventually, if this works, we may ask @kangshiyin to write CUDA versions of GetAttentionDotProducts(), ApplyScalesToOutput(), and ApplyScalesToInput(). But this probably won't take more than half the time, even with the naive implementation, so there's no need to do that just yet. |
| Will do |
| | ||
| | ||
| | ||
| void TimeHeightConvolutionComponent::Check() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is TimeHeightConvolutionComponent supposed to be here? Almost all the functions ralating TimeHeightConvolutionComponent are duplicated here.
| I am copying-and-modifying that file. I'll remove all that stuff. For now don't bother compiling that file, I'm working on it. …On Fri, Jun 30, 2017 at 4:03 PM, Hossein Hadian ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/nnet3/nnet-attention-component.cc <#1731 (comment)>: > + // 'keys' contains the keys; note, these are not extended with + // context information; that happens further in. + CuSubMatrix<BaseFloat> keys(in, 0, in.NumRows(), 0, key_dim_); + + // 'values' contains the values which we will interpolate. + // these don't contain the context information; that will be added + // later if output_context_ == true. + CuSubMatrix<BaseFloat> values(in, 0, in.NumRows(), key_dim_, value_dim_); + + + AttentionForward(key_scale_, keys, queries, values, c, out); +} + + + +void TimeHeightConvolutionComponent::Check() { Is TimeHeightConvolutionComponent supposed to be here? Almost all the functions ralating TimeHeightConvolutionComponent are duplicated here. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1731 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuxQTo75lwzns5X8feBthBKn5AW6iks5sJVR3gaJpZM4OK40U> . |
| OK, sure, I was just wondering. |
| Do you want me to write the test for I read the docs, but I'm not sure what query, key, and value are going to be in ASR tasks. The values should be the frames of speech, but what are the keys and queries? |
| new file attention-test.cc. It's implemented but mis-named as AttentionCoreForward(), in the cc file, fix the name. query, key and value are all just different sub-ranges of the input to the layer, i.e. they're ranges of column indexes. …On Fri, Jun 30, 2017 at 5:47 PM, Hossein Hadian ***@***.***> wrote: Do you want me to write the test for GetAttentionDotProducts in a new file attention-test.cc or in an existing tester file? Also, the function itself seems not to be implemented yet (I looked in attention.cc), should I implement it? I read the docs, but I'm not sure what query, key, and value are going to be in ASR tasks. The values should be the frames of speech, but what are the keys and queries? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1731 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuwoqpCeqBmWKm8Smyl9vUU6kFPTOks5sJWz4gaJpZM4OK40U> . |
src/nnet3/attention.h Outdated
| This function implements: | ||
| A->Row(i) += alpha * C(i, j) * B.Row(i + j * row_shift). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line has j on the right-hand side but not on the left-hand side. I'm not sure I get it right.
| add a \sum_j on the right hand side. …On Fri, Jun 30, 2017 at 8:23 PM, Hossein Hadian ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/nnet3/attention.h <#1731 (comment)>: > + CuMatrixBase<BaseFloat> *C); + + +/** + This function is related to GetAttentionDotProducts(); it + is used in scaling the values by the softmax scales, and + in backprop. + + We have put the A, B and C in an unusual order here in order + to make clearer the relationship with GetAttentionDotProducts(). + The matrices have the same relationship in terms of their + dimensions, as A, B and C do in GetAttentionDotProducts(). + + This function implements: + + A->Row(i) += alpha * C(i, j) * B.Row(i + j * row_shift). This line has j on the right-hand side but not on the left-hand side. I'm not sure I get it right. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1731 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1IcinmZoX2aUvuYZpKkx5AU0Rj2ks5sJZGDgaJpZM4OK40U> . |
| CuSubMatrix<BaseFloat> output_values_part( | ||
| *output, 0, num_output_rows, 0, value_dim); | ||
| | ||
| ApplyScalesToOutput(1.0, values, *c, &output_values_part); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is asserted (a few lines before) that values and c both have num_output_rows rows, this will set A, B, and C in a way that all have the same number of rows, so row_shift will become 0.
| 'values' should have 'num_input_rows' rows, the assert must have been a mistake. …On Fri, Jun 30, 2017 at 11:26 PM, Hossein Hadian ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/nnet3/attention.cc <#1731 (comment)>: > + queries_key_part, + keys, c); + // think of 'queries_context_part' as a position-dependent bias term. + c->AddMat(1.0, queries_context_part); + // compute the soft-max function. Up till this point, 'c' + // actually contained what in attention.h we called 'b', which is + // the input to the softmax. + c->SoftMaxPerRow(*c); + + + // the part of the output that is weighted + // combinations of the input values. + CuSubMatrix<BaseFloat> output_values_part( + *output, 0, num_output_rows, 0, value_dim); + + ApplyScalesToOutput(1.0, values, *c, &output_values_part); Since it is asserted (a few lines before) that values and c both have num_output_rows rows, this will set A, B, and C in a way that all have the same number of rows, so row_shift will become 0. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1731 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuyCMB9aaXxZPvPYxJfoae7RRSjMEks5sJbxmgaJpZM4OK40U> . |
…lyScalesToInput not done yet
[WIP] [attention] add ApplyScalesToInput/Output functions and tests
Add ApplyScalesToInput + test
Add test template for AttentionForward/Backward
| @hhadian, I think the component-level code is now working and tested. Can you please work on the script-level changes required to test this? E.g. we want someone to be able to write in a config line: attention-renorm-layer num-heads=10 value-dim=50 key-dim=50 time-stride=3 num-left-inputs=5 num-right-inputs=2. You can intersperse these with regular relu-batchnorm-layers for initial experiments. You can have num-left-inputs-required and num-right-inputs-required and key-dim all present but defaulting to -1, and output-context present and defaulting to true; time-stride can default to 1 and num-heads to 1, but require the user to specify value-dim, key-dim, num-left-inputs and num-right-inputs. |
| Will do |
| What values do you suggest for the first experiment? With |
| That's similar to TDNNs with relu-dim=512, so I think it's a reasonable setting. …On Mon, Jul 3, 2017 at 7:30 PM, Hossein Hadian ***@***.***> wrote: What values do you suggest for the first experiment? With num-heads=10 value-dim=50 key-dim=50 time-stride=3 num-left-inputs=5 num-right-inputs=2, the input dim of the attention block is 1580 and the output dim is 580. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1731 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu-4j3R2kPZ6Te2SEwqhOt2O3sP2zks5sKXmpgaJpZM4OK40U> . |
| @hhadian, I notice that the stats are not being printed in the progress logs. |
| Sure, will do. |
Add Scale/Add/ZeroStats + xconfig scripts for Attention
A fix in Add function
| void GetTList(const std::vector<Index> &indexes, | ||
| std::vector<int32> *t_values) { | ||
| // set of t values | ||
| std::unordered_set<int32> t_set; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optionally, use std::set, which is sorted. Might be marginally more efficient.
You may also use some stl magic to reduce the function to just three lines, arguably more readable:
std::set<int32> t_set; std::remove_copy(indexes.begin(), indexes.end(), std::inserter(t_set, t_set.begin()), kNoTime); t_values->assign(t_set.begin(), t_set.end());(include <algorithm> and <iterator>).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the normal case, there could be many (e.g. 128) copies of each 't' value, so in that case I think
the way we have it is more efficient. (Also Index is a struct, not an integer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, missed the iter->t part.
Update recipes + minor fix to allow for bigger contexts
| The results of changing In the following results, In the following results, In all these, there are 2 attention layers, one near the beginning and one near the end and context is |
| Hm. It does look like having the key/value dimension less than about 50 is harmful. Try setups where the value dimension is larger than the key dimension (e.g. twice larger), with slightly fewer heads. If you could do some of these experiments in a context where you replace more of the TDNN layers with attention layers, the results might be more different, even if we ultimately decide to use only a couple of attention layers. …On Sat, Jul 15, 2017 at 1:08 AM, Hossein Hadian ***@***.***> wrote: The results of changing num_heads while keeping the output/input dimension constant. [The input dim of attention layer is num_heads * (3 * dim + C) assuming key_dim = value_dim = dim and C is context-size] [The output dim is num_heads * (dim + C) ] In the following results, input_dim ~= 1580 # System tdnn_7k head15_dim34 head10_dim50 head8_dim63 # WER on train_dev(tg) 13.93 14.33 14.26 14.10 # WER on train_dev(fg) 12.85 13.27 12.93 12.99 # WER on eval2000(tg) 16.7 16.9 16.6 16.6 # WER on eval2000(fg) 15.0 15.2 14.8 15.0 # Final train prob -0.085 -0.079 -0.079 -0.080 # Final valid prob -0.106 -0.103 -0.102 -0.101 # Final train prob (xent) -1.260 -1.026 -1.024 -1.037 # Final valid prob (xent) -1.3193 -1.1072 -1.1048 -1.1147 In the following results, input_dim ~= 2330 (i.e. ~50% bigger layers): # System tdnn_7k head20_dim37 head15_dim50 head10_dim75 # WER on train_dev(tg) 13.93 14.07 13.96 14.01 # WER on train_dev(fg) 12.85 12.98 12.90 12.81 # WER on eval2000(tg) 16.7 16.9 16.4 16.4 # WER on eval2000(fg) 15.0 15.3 14.8 14.9 # Final train prob -0.085 -0.076 -0.078 -0.077 # Final valid prob -0.106 -0.100 -0.101 -0.101 # Final train prob (xent) -1.260 -0.999 -0.995 -1.003 # Final valid prob (xent) -1.3193 -1.0964 -1.0946 -1.0926 In all these, there are 2 attention layers, one near the beginning and one near the end and context is (5, 2). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1731 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuwqfaonr_M-hFzRznCi1g2g3iHx_ks5sOElogaJpZM4OK40U> . |
| Will do |
| Results regarding the position of attention layer in the network:
|
| Since attention is good at layer 7, I tried bigger value dimensions with that: So the best result is currently L7_key40-val80 with 0.5% and 0.4% absolute improvement on eval2000 tg and fg. |
| Cool! Don't ignore the train_dev, those numbers are just as important as eval2000. Overall it's not clear to me that you are getting an improvement from a larger value-dim. Try a larger amount of left and right context for the attention layer, or two attention layers right at the end. …On Tue, Jul 18, 2017 at 3:54 PM, Hossein Hadian ***@***.***> wrote: Since attention is good at layer 7, I tried bigger value dimensions with that: # System tdnn_7k L7_key50_val50 L7_key50_val100 L7_key40-val80 # WER on train_dev(tg) 13.93 13.64 13.64 13.76 # WER on train_dev(fg) 12.85 12.55 12.68 12.62 # WER on eval2000(tg) 16.7 16.3 16.3 16.2 # WER on eval2000(fg) 15.0 14.8 14.7 14.6 # Final train prob -0.085 -0.077 -0.074 -0.076 # Final valid prob -0.106 -0.099 -0.095 -0.098 # Final train prob (xent) -1.260 -1.009 -0.984 -0.997 # Final valid prob (xent) -1.3193 -1.0980 -1.0727 -1.0887 So the best result is currently L7_key40-val80 with 0.5% and 0.4% absolute improvement on eval2000 tg and fg. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1731 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu3fCGO3TKQZzLXy9A7j29xWD88uhks5sPQ2AgaJpZM4OK40U> . |
| Will do. |
| I think even the test sets have speaker overlap on Switchboard. The only time I'd be concerned about speaker overlap is in things that are specifically about speaker adaptation. These changes are orthogonal to that and I'm more concerned about test-set noise due to limited size. …On Tue, Jul 18, 2017 at 5:52 PM, Hossein Hadian ***@***.***> wrote: Will do. Re train_dev, is it really just as important as eval2000? because my impression was that train_sev has a lot of speaker overlap and should be considered less important when tuning. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1731 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu88eq2uPTv1sa19mUcpeqLi8DBbYks5sPSj-gaJpZM4OK40U> . |
Add/update attention recipes + minor xconfig update
| @hhadian, sorry, your authorship seems to have been lost by git due to the squash (I don't like to merge, except between versions of Kaldi). Next time you merge stuff, it will be to master anyway. |
* 'master' of https://github.com/kaldi-asr/kaldi: (43 commits) [src,scripts,egs] Transfer learning for ASR with nnet3 (kaldi-asr#1633) [src,scripts,egs] Attention modeling, with example scripts (kaldi-asr#1731) [src] Fix bug in block matrix addition (thanks: Sidhi Adkoli). [egs] Fix inconseqential input-checking bug in Swbd example script (kaldi-asr#1886) [build] dependency-check: that python2.7 and python3 exist and 2.7 is default (kaldi-asr#1876) [scripts] A cosmetic change to info messages in chain training (kaldi-asr#1880) [doc] Keep tutorial code up to date (thanks: Luwei Yang) [scripts] Bug-fix in long-utterance-segmentation script (thanks: Armin Oliya) (kaldi-asr#1877) [egs] Fixed some issues in the multilingual BABEL example scripts (kaldi-asr#1850) [build] Cosmetic fix in Makefile Remove memory leaks and unused variables (when CUDA is not enabled) (kaldi-asr#1866) [scripts] Fix default for egs.cmd in nnet3 training scripts (kaldi-asr#1865) [doc] Fix to how documentation is built (thanks: David van Leeuwen) [scripts] Add --decode-extra-opts in steps/decode.sh (required for speech activity detection scripts) (kaldi-asr#1859) [src] Adding documentation for lattice discriminative training functions (kaldi-asr#1854) [src] Typo fixes in documenation. (kaldi-asr#1857) [egs] Update to score.sh in fisher_swbd setup, allow --iter option (kaldi-asr#1853) [scripts] bug-fix in TFRNNLM rescoring script (no 'ark' needed for unk.probs file) (kaldi-asr#1851) [src] Remove repeated parameter documentation. (kaldi-asr#1849) [egs] Aspire example scripts: Update autoencoder example to xconfig (kaldi-asr#1847) ...
still far from compiling.