Support cross attention kv cache #187

larryliu0820 · 2025-11-18T08:30:08Z

To avoid excessive computation we want to support kv cache for cross attention in Whisper.

Fundamentally we only run k_proj and v_proj once on the encoder output hidden state, at the first token generation, then we should keep the key_states and value_states and reuse them in all the subsequent token generation.

For whisper-large-v3-turbo, where we have 4 layers of decoder:

WhisperDecoder( (embed_tokens): Embedding(51866, 1280, padding_idx=50257) (embed_positions): WhisperPositionalEmbedding(448, 1280) (layers): ModuleList( (0-3): 4 x WhisperDecoderLayer( (self_attn): WhisperAttention( (k_proj): Linear(in_features=1280, out_features=1280, bias=False) (v_proj): Linear(in_features=1280, out_features=1280, bias=True) (q_proj): Linear(in_features=1280, out_features=1280, bias=True) (out_proj): Linear(in_features=1280, out_features=1280, bias=True) ) (activation_fn): GELUActivation() (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True) (encoder_attn): WhisperAttention( (k_proj): Linear(in_features=1280, out_features=1280, bias=False) (v_proj): Linear(in_features=1280, out_features=1280, bias=True) (q_proj): Linear(in_features=1280, out_features=1280, bias=True) (out_proj): Linear(in_features=1280, out_features=1280, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1280, out_features=5120, bias=True) (fc2): Linear(in_features=5120, out_features=1280, bias=True) (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True) ) ) (layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True) )

Without KV cache in encoder_attn, we are doing 2 1280x1280 MM for each layer, so in total 8 1280x1280 MM for each token generated. This largely impacts token/sec perf number.

This PR replaces encoder_attn with a WhisperCrossAttention class, where we replaces if condition with torch.cond. The logic becomes:

If KV cache values are all zero:
- Compute KV projections
Otherwise:
- Clone from KV cache. Note here we can't directly return KV cache, due to the non-aliasing requirement.
After torch.cond:
- Write back the values from either branch back to KV cache

Notice that we still have 1 extra read and 1 extra write, but it should be much faster than MM.

jackzhxng · 2025-11-19T17:12:43Z

optimum/exporters/executorch/integrations.py

+ self.cross_attention_cache = StaticCache(
+ config=self.config,
+ max_batch_size=batch_size,
+ max_cache_len=getattr(self.config, "max_source_positions", max_static_cache_length), # This is fixed in whisper


Pull this outside into a var like the other arguments

jackzhxng · 2025-11-19T17:13:16Z

optimum/exporters/executorch/integrations.py

+ self.cross_attention_cache = StaticCache(
+ config=self.config,
+ max_batch_size=batch_size,
+ max_cache_len=getattr(self.config, "max_source_positions", max_static_cache_length), # This is fixed in whisper


Also what do you mean this is fixed in whisper? Will this work for t5?

Basically they always have 1500 for max_source_positions and that translates to 30 seconds of audio. So we should use that for cache len. For T5 I don't know and that's why I name this class WhisperCrossAttention.

optimum/exporters/executorch/integrations.py

optimum/exporters/executorch/whisper_attention.py

jackzhxng

Oh also run make style for formatting

larryliu0820 · 2025-11-24T23:29:22Z

This works, gives correct output, but eventually we still need to copy data from GPU to CPU, just for the predicate. There's no way we can workaround it.

For whisper-large-v3-turbo, there are 4 decoder layers, so we see 4 cudaAsyncMemcpy blocks in each token generation:

This is too expensive to be a good solution.

larryliu0820 requested a review from jackzhxng November 18, 2025 08:30

jackzhxng reviewed Nov 19, 2025

View reviewed changes

larryliu0820 added 3 commits November 21, 2025 01:24

Support cross attention kv cache

d74b934

Make it work

8c7ad34

Fix max_source_positions

6ca7dd0

larryliu0820 force-pushed the whisper_cond branch from b6e172d to 6ca7dd0 Compare November 21, 2025 09:29

Address comments

cded66e

Try adding a flag on CPU

128b100

larryliu0820 force-pushed the whisper_cond branch from 14a9a0e to 128b100 Compare November 25, 2025 22:11

larryliu0820 and others added 5 commits November 25, 2025 15:08

Change fill_ to index update

7ef3ba4

Use bool tensor

7268245

More fix

fe81adf

Final fix

a3f4f20

Stop memory planning input so that we can avoid copies

d10725b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support cross attention kv cache #187

Support cross attention kv cache #187

Uh oh!

larryliu0820 commented Nov 18, 2025

jackzhxng Nov 19, 2025

jackzhxng Nov 19, 2025

larryliu0820 Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackzhxng left a comment

larryliu0820 commented Nov 24, 2025

Labels

3 participants

Support cross attention kv cache #187

Are you sure you want to change the base?

Support cross attention kv cache #187

Uh oh!

Conversation

larryliu0820 commented Nov 18, 2025

jackzhxng Nov 19, 2025

Choose a reason for hiding this comment

jackzhxng Nov 19, 2025

Choose a reason for hiding this comment

larryliu0820 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackzhxng left a comment

Choose a reason for hiding this comment

larryliu0820 commented Nov 24, 2025

Labels

3 participants