0.0.8 (2024-07-03)
- fix prefill/append kernel behavior for empty kv-cache (#353) (7adc8c)
- fix decode attention kernel with logits cap (#350) (f5f7a2)
0.0.7 (2024-06-28)
batch_decode_with_padded_kv_cache
was removed, we encourage user to useBatchDecodeWithPagedKVCacheWrapper
instead. (#343)
- fix the
forward_return_lse
function inBatchPrefillWithRaggedKVCache
class (#337) - fix the scheduler behavior of large page size (#333)
- change minimal
kv_chunk_size
back to 128 (#329) (f237f5f) - more options for kv tile size (#336) (bf2a6c7)
0.0.6 (2024-06-21)
Fix some bug in v0.0.5 that might lead to crashes and instable performance.
0.0.5 (2024-06-20)
- Support any GQA group size support for tensor-cores kernels.
- Support any page size support for tensor-cores kernels.
- Support CUDA-Graph for prefill/decode APIs.
- Add an option to accelerate decode kernels with Tensor Cores.
- Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor)
- Support logits cap in Grok-1 models.
- Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html)
- PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/group_gemm.html)
We thank @ibsidorenko, @LiuXiaoxuanPKU, @Yard1 @AgrawalAmey, @xuzhenqi, @mgerstgrasser, @esmeetu, @yz-tang, @HSQ79815, @Qubitium, @shreygupta2809, @sighingnow, @vinx13, @tqchen, @merrymercy, @comaniac and many others for their contributions and helpful discussions for 0.0.5 release.
- support any GQA group size for tensor-cores kernels (#301) (c111ca)
- support any page size for tensor-cores kernels (#306) (82fd8c)
- add
use_tensor_cores
option to decode kernels to accelerate GQA (#317) (3b50dd5) - add group gemm operators (#282) (e08ba42)
- initial support of distributed operators (#289) (03553da)
- initial support of logits hook (#298) (ab1e2ad)
- Separate Q and KV dtypes for decode (#286) (5602659)
- support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
- support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
- support custom attention mask in prefill/append attention kernels (#266) (7304282)
- fused speculative sampilng kernels (#259) (cea2bb)
- expose sampling APIs in pytorch (#238) (092902)
- initial cuda graph support (#256) (7e9cc7f)
- split kv-cache for prefill/append kernels (#310) (f0bb0a3)
- use packed bit array for attention mask (#308) (3d43dc9)
0.0.4 (2024-05-01)
- pytorch 2.3 support
- gpu sampling kernels (top-p, top-k)
- more gqa group sizes
- add mma instructions for fp8 (#179) (d305798)
- mma rowsum for fp8 (#180) (5af935c)
- support any num_heads for get_alibi_slope (#200) (b217a6f)
0.0.3 (2024-03-08)
- adding
sm_scale
field for all attention APIs (#145) (85d4018) - enable
head_dim=256
for attention kernels (#132) (0372acc) - pytorch api of fp8 kv-cache (#156) (66ee066)
- support ALiBi (#146) (383518b)
- bugfix to pr 135 (#136) (3d55c71)
- fix bugs introduced in #132 (#135) (9b7b0b9)
- fix FindThrust.cmake (#161) (30fa584)