Enable PagedAttention through Pallas #6912

wonjoolee95 · 2024-04-10T19:54:53Z

Enable PagedAttention through Pallas

Test plan:

root@1fdc3324aeef:/pytorch/xla# python test/test_pallas.py PallasTest.test_paged_attention_wrapper
.
----------------------------------------------------------------------
Ran 1 test in 2.209s

OK
root@1fdc3324aeef:/pytorch/xla# python test/test_pallas.py PallasTest.test_paged_attention_wrapper_with_dynamo
.
----------------------------------------------------------------------
Ran 1 test in 2.114s

OK

Todo as follow-ups:

Add unit test for Dynamo
Enable all other parameters for jax.experimental.pallas.ops.tpu.paged_attention.paged_attention_kernel

torch_xla/experimental/custom_kernel.py

miladm · 2024-04-16T21:00:42Z

cc @WoosukKwon to take a look

wonjoolee95 · 2024-04-22T21:56:43Z

Locally, the tests are succeeding on my v4:

root@1fdc3324aeef:/pytorch/xla# python test/test_pallas.py PallasTest.test_paged_attention_wrapper
.
----------------------------------------------------------------------
Ran 1 test in 2.209s

OK
root@1fdc3324aeef:/pytorch/xla# python test/test_pallas.py PallasTest.test_paged_attention_wrapper_with_dynamo
.
----------------------------------------------------------------------
Ran 1 test in 2.114s

OK

I also just triggered the TPU CI on this PR.

wonjoolee95 · 2024-04-24T21:15:52Z

The CPU CI is failing with an unrelated test:

======================================================================
FAIL: test_resnet18 (__main__.DynamoTrainingBasicTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/xla/xla/pytorch/xla/test/dynamo/test_dynamo.py", line 494, in test_resnet18
    self.assertEqual(met.metric_data('ExecuteTime')[0], sample_count * 3)
AssertionError: 29 != 30

The CI including the TPU CI is passing, so this PR should be good to be reviewed. Thanks!

miladm · 2024-04-24T21:19:11Z

test/test_pallas.py

+        torch.allclose(
+            output.cpu()[seq_lens > 0],
+            expected_output.cpu()[seq_lens > 0],
+            atol=1e-1,


wdyt we use a tighter bound for atol and rtol? e.g. 1e-3

Sg, updated to 1e-5 for both tests.

miladm

Thanks @wonjoolee95 - left a comment for you to eval and address - approving to unblock you

alanwaketan

In general, it looks good to me. Left a few comments.

alanwaketan · 2024-04-24T23:48:09Z

torch_xla/experimental/custom_kernel.py

@@ -331,6 +331,51 @@ def flash_attention(
  return FlashAttention.apply(q, k, v, causal)


+def paged_attention(q, k_pages, v_pages, lengths, page_indices,


The original kernel has this thing called: q_dtype_for_kernel_launch? What does it do? Should we copy that as well?

In the original kernel, the q_dtype_for_kernel_launch is always either jnp.float32 or q's dtype. In our case, I'm expecting the passed-in q's dtype to be torch.float32, so the q_dtype_for_kernel_launch will always be float32.

No, I don't think that will be the case for actual workflow. It could be bf16 or even in8 etc...

I see, makes sense. Just updated to handle q_dtype_for_kernel_launch, following jax's kernel -- https://github.com/google/jax/blob/main/jax/experimental/pallas/ops/tpu/paged_attention/paged_attention_kernel.py#L393. I can follow-up in another PR to add some more unit tests for different dtypes for q.

alanwaketan · 2024-04-24T23:50:29Z

torch_xla/experimental/custom_kernel.py

+                            pages_per_compute_block: int):
+  # This will be called when dynamo use fake tensor to construct the fake output.
+  # We need to make sure output tensor's shape is correct.
+  if k.device != torch.device("meta"):


It feels like this part can be consolidated with the flash attention one.

Sg, refactored these into a helper function.

alanwaketan · 2024-04-24T23:53:42Z

test/test_pallas.py

+  @unittest.skipIf(xr.device_type() != 'TPU' or tpu.version() < 4,
+                   "This test only works on TPUv4+.")
+  def test_paged_attention_wrapper(self):
+    jax.config.update('jax_default_matmul_precision', jax.lax.Precision.HIGHEST)


It's interesting that you use jax as the reference. I guess that works too. Wondering if we can just the eager attention helper in the class instead? Or that doesn't work? Anyway, if you are using jax as the reference, you can drop this.

Sg, yeah I saw that we're dependent on JAX Pallas anyways, so I thought it may be easier to just test against the JAX's outputs.

Ah, makes sense. Just removed the jax.config updates.

alanwaketan · 2024-04-24T23:59:21Z

test/test_pallas.py

+        q_xla,
+        k_pages_xla,
+        v_pages_xla,
+        seq_lens_xla,


Can you explain what these seq_lens are? Are these the previous tokens for each batch in k, v?

Yep, that is my understanding -- the seq_lens here equals the number of tokens that are processed in the batch. Reference: https://docs.vllm.ai/en/latest/dev/kernel/paged_attention.html#concepts.

JackCaoG · 2024-04-25T00:46:51Z

torch_xla/experimental/custom_kernel.py

+  # We need to make sure output tensor's shape is correct.
+  if k.device != torch.device("meta"):
+    warnings.warn(
+        'XLA flash attention should only be applied to tensors on XLA device')


nit, paged attention instead of flash attention

actually it is not even paged attention, you can just make this warning message more general.

Good catch, updated to use an f string.

alanwaketan · 2024-04-25T00:48:48Z

torch_xla/experimental/custom_kernel.py

@@ -400,6 +400,9 @@ def paged_attention(q, k_pages, v_pages, lengths, page_indices,
  buffer_index = torch.zeros((1,), dtype=torch.int32).to("xla")
  step = torch.zeros((1,), dtype=torch.int32).to("xla")
  output_shape = torch.Size(list(q.shape[:-1]) + [1])
+  q_output_dtype = torch.float32
+  if (num_heads // num_kv_heads) % 8 != 0:


I guess you can combine this with the above L396 code.

Good catch! Updated.

wonjoolee95 · 2024-04-25T00:54:58Z

Thanks all for the reviews. After addressing all the comments, the two unit tests are still passing locally on my V4. I'll let the TPU CI verify one more time before merging.

alanwaketan · 2024-04-25T00:55:43Z

torch_xla/experimental/custom_kernel.py

+      ], payload, [q.shape, output_shape, output_shape],
+      [q_output_dtype, torch.float32, torch.float32])
+
+  return output.reshape(batch_size, num_heads, head_dim)


You probably want to use .to to cast the output back to the original dtype here.

wonjoolee95 · 2024-04-25T03:14:04Z

Merging as all CI is green.

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch from 50dac57 to b6822a3 Compare April 10, 2024 19:55

JackCaoG reviewed Apr 10, 2024

View reviewed changes

torch_xla/experimental/custom_kernel.py Outdated Show resolved Hide resolved

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch 3 times, most recently from b0262b0 to 72fdd57 Compare April 10, 2024 21:12

alanwaketan self-requested a review April 11, 2024 18:45

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch 4 times, most recently from 45d9fe3 to f07c7e3 Compare April 12, 2024 22:14

miladm reviewed Apr 15, 2024

View reviewed changes

torch_xla/experimental/custom_kernel.py Show resolved Hide resolved

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch from f07c7e3 to 1f70ed9 Compare April 16, 2024 20:13

wonjoolee95 changed the title ~~[WIP] Enable PagedAttention through Pallas~~ Enable PagedAttention through Pallas Apr 16, 2024

wonjoolee95 marked this pull request as ready for review April 16, 2024 20:16

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch from 1f70ed9 to f32836e Compare April 16, 2024 20:27

miladm mentioned this pull request Apr 16, 2024

[RFC] Initial Support for Cloud TPUs vllm-project/vllm#3620

Open

6 tasks

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch 4 times, most recently from cc5ad3a to 312bef1 Compare April 22, 2024 21:45

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch from c6040cf to 19e28f8 Compare April 22, 2024 21:58

wonjoolee95 requested a review from JackCaoG April 24, 2024 21:14

wonjoolee95 requested a review from miladm April 24, 2024 21:16

miladm reviewed Apr 24, 2024

View reviewed changes

miladm approved these changes Apr 24, 2024

View reviewed changes

alanwaketan approved these changes Apr 25, 2024

View reviewed changes

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch from 19e28f8 to b3a5948 Compare April 25, 2024 00:37

JackCaoG reviewed Apr 25, 2024

View reviewed changes

alanwaketan reviewed Apr 25, 2024

View reviewed changes

wonjoolee95 added 15 commits April 25, 2024 01:04

Add unit tests for PagedAttention

bfdac9f

Add kernel for PagedAttention

f1893ff

Run linter

6608939

Update unit tests and fix typos

7208441

Update int64 to int32

e2948b5

Split paged_attention into tracing and execution separately

acfc289

Add reshape in paged_attention

a7f7a87

Update unit tests

59280cf

Run linter again

9c4a4cc

Address comments

053ddef

Add q_output_dtype

2e48553

Fix typo in warning message

22c4f27

Combine some if statements

7f8dc98

Run linter one last time :)

4e5b0b0

Convert output back to qdtype

7818d0f

wonjoolee95 force-pushed the wonjoo/pallas-pagedattention branch from 961dfff to 7818d0f Compare April 25, 2024 01:04

wonjoolee95 merged commit 6ed2026 into master Apr 25, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable PagedAttention through Pallas #6912

Enable PagedAttention through Pallas #6912

wonjoolee95 commented Apr 10, 2024 •

edited

Loading

miladm commented Apr 16, 2024 •

edited

Loading

wonjoolee95 commented Apr 22, 2024

wonjoolee95 commented Apr 24, 2024 •

edited

Loading

miladm Apr 24, 2024

wonjoolee95 Apr 25, 2024 •

edited

Loading

miladm left a comment

alanwaketan left a comment

alanwaketan Apr 24, 2024

wonjoolee95 Apr 25, 2024 •

edited

Loading

alanwaketan Apr 25, 2024

wonjoolee95 Apr 25, 2024

alanwaketan Apr 24, 2024

wonjoolee95 Apr 25, 2024

alanwaketan Apr 24, 2024

wonjoolee95 Apr 25, 2024 •

edited

Loading

alanwaketan Apr 24, 2024

wonjoolee95 Apr 25, 2024

JackCaoG Apr 25, 2024

JackCaoG Apr 25, 2024

wonjoolee95 Apr 25, 2024

alanwaketan Apr 25, 2024

wonjoolee95 Apr 25, 2024

wonjoolee95 commented Apr 25, 2024

alanwaketan Apr 25, 2024

wonjoolee95 Apr 25, 2024

wonjoolee95 commented Apr 25, 2024

		@@ -331,6 +331,51 @@ def flash_attention(
		return FlashAttention.apply(q, k, v, causal)


		def paged_attention(q, k_pages, v_pages, lengths, page_indices,

Enable PagedAttention through Pallas #6912

Enable PagedAttention through Pallas #6912

Conversation

wonjoolee95 commented Apr 10, 2024 • edited Loading

miladm commented Apr 16, 2024 • edited Loading

wonjoolee95 commented Apr 22, 2024

wonjoolee95 commented Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

wonjoolee95 Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

miladm left a comment

Choose a reason for hiding this comment

alanwaketan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wonjoolee95 Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wonjoolee95 Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wonjoolee95 commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wonjoolee95 commented Apr 25, 2024

wonjoolee95 commented Apr 10, 2024 •

edited

Loading

miladm commented Apr 16, 2024 •

edited

Loading

wonjoolee95 commented Apr 24, 2024 •

edited

Loading

wonjoolee95 Apr 25, 2024 •

edited

Loading

wonjoolee95 Apr 25, 2024 •

edited

Loading

wonjoolee95 Apr 25, 2024 •

edited

Loading