PagedAttention for training? #671

tmm1 · 2023-08-03T20:24:35Z

tmm1
Aug 3, 2023

I'm curious if there are parts of PagedAttention that could be useful for speeding up training?

I know vLLM is experimenting with both xformers/flash-attention for attention, and many training frameworks also rely on those. But are there other parts of PagedAttention that could be re-used during training for additional performance improvement?

tmm1 · 2023-08-13T02:55:28Z

tmm1
Aug 13, 2023
Author

It occurred to me that the main page cache in vLLM is likely not that useful for training, since you can't cache attention scores when the model's weights are still changing.

However, it does seem like there are other techniques and pieces of vLLM that could be useful.

For instance, there's a cos/sin cache for computing rotary embeddings:

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/attention.py#L261-L275

and there's a custom kernel for computing the rotary embeddings:

from vllm import pos_encoding_ops
pos_encoding_ops.rotary_embedding_neox(...)

this takes the cache and operates on it using this implementation: https://github.com/vllm-project/vllm/blob/main/csrc/pos_encoding_kernels.cu

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PagedAttention for training? #671

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

PagedAttention for training? #671

tmm1 Aug 3, 2023

Replies: 1 comment

tmm1 Aug 13, 2023 Author

tmm1
Aug 3, 2023

tmm1
Aug 13, 2023
Author