Pre allocate KV tensors and use inference mode #13
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Fixes: #12 (Use bigcode-project/transformers#5)
MQA, before: e2e = 1242 ms, cuda = 762 ms
MQA, pre-allocate: e2e = 1248 ms, cuda = 693 ms (no e2e speedup for this case but there will be for cases that aren't cpu-bottlenecked)
MQA, inference mode: e2e = 1130 ms, cuda = 690 ms
MQA, inference mode, no pre-allocate: e2e = 1121 ms, cuda = 760 ms
MHA, before: e2e = 2241 ms, cuda = 2201 ms
MHA, pre-allocate: e2e = 1442 ms, cuda = 1048 ms
MHA, inference mode: e2e = 1249 ms, cuda = 1046 ms