Integrate ragged paged attention v2 #8791

bythew3i · 2025-03-05T09:07:24Z

Tested:

python test/test_pallas.py -v -k PallasTest.test_ragged_paged_attention_wrapper

Please Read

This PR adds validation of ragged attn inputs to torch.ops.xla.ragged_paged_attention and expect to run it during runtime. Please move the validation code out if we have to compile something like (or just avoid compiling this).

def ragged_paged_attention_wrapper(...): ... return torch.ops.xla.ragged_paged_attention(...) compiled_paged_attention = torch.compile( ragged_paged_attention_wrapper, backend="openxla")

Key Features in Ragged Paged Attention V2

Support mixed prefill and decode to increase throughput for inference. (eg., 5x speedup compared to padded Muti-Queries Paged Attention implementation for llama-3-8b.)
No explicit swapaxes for seq_len and num_head in pre/post kernel. The kernel takes num_head in 2nd minor as it naturally was. We fold swapaxes to strided load/store in the kernel and apply transpose on the fly.
No GMM (Grouped Matmul) Metadata required! We calculate the metadata on the fly in the kernel. This can speed up 10%!
Increase MXU utilization 8x in GQA by grouping shared q heads for MXU in decode.
Minimize recompilation: The only factors can cause recompilation are model specs, max_num_batched_tokens and max_num_seqs in the setting of mixed engine.

Note: this PR does not include tests for Ragged Paged Attention kernel. Because it is already tested in jax-ml/jax#26920 and we will directly import it as source instead of keep duplicated implementations in the future.

yaochengji

LGTM, thanks!

zpcore · 2025-03-05T23:17:06Z

The test test_ragged_paged_attention_wrapper_with_padding_with_dynamo2 is failing. Can someone help make a fix? Thanks

yaochengji · 2025-03-06T00:12:14Z

@zpcore , thanks, it is fixed in #8797

bythew3i added 3 commits March 5, 2025 02:05

Add ragged paged attn v2

aa8c026

Update custom kernel

4064de1

Update tests

6ab903c

yaochengji self-requested a review March 5, 2025 17:49

yaochengji approved these changes Mar 5, 2025

View reviewed changes

yaochengji enabled auto-merge (squash) March 5, 2025 17:50

vanbasten23 approved these changes Mar 5, 2025

View reviewed changes

yaochengji merged commit 5644f44 into pytorch:master Mar 5, 2025
22 of 23 checks passed

pgmoka pushed a commit that referenced this pull request Mar 5, 2025

Integrate ragged paged attention v2 (#8791)

47ec58c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integrate ragged paged attention v2 #8791

Integrate ragged paged attention v2 #8791

Uh oh!

bythew3i commented Mar 5, 2025 •

edited

Loading

yaochengji left a comment

Uh oh!

zpcore commented Mar 5, 2025

yaochengji commented Mar 6, 2025

Labels

4 participants

Uh oh!

Integrate ragged paged attention v2 #8791

Integrate ragged paged attention v2 #8791

Uh oh!

Conversation

bythew3i commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tested:

Please Read

Key Features in Ragged Paged Attention V2

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

zpcore commented Mar 5, 2025

yaochengji commented Mar 6, 2025

Labels

4 participants

bythew3i commented Mar 5, 2025 •

edited

Loading