Replies: 1 comment
-
It occurred to me that the main page cache in vLLM is likely not that useful for training, since you can't cache attention scores when the model's weights are still changing. However, it does seem like there are other techniques and pieces of vLLM that could be useful. For instance, there's a cos/sin cache for computing rotary embeddings: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/attention.py#L261-L275 and there's a custom kernel for computing the rotary embeddings:
this takes the cache and operates on it using this implementation: https://github.com/vllm-project/vllm/blob/main/csrc/pos_encoding_kernels.cu |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm curious if there are parts of PagedAttention that could be useful for speeding up training?
I know vLLM is experimenting with both xformers/flash-attention for attention, and many training frameworks also rely on those. But are there other parts of PagedAttention that could be re-used during training for additional performance improvement?
Beta Was this translation helpful? Give feedback.
All reactions