- add basic WGPU tensor support
- factor out TensorArithmetic
- revise the dot product attention primitive
- reference mlx: https://github.com/simonw/llm-mlx-llama/blob/main/llm_mlx_llama.py#L81
- learn the kv cache layout
- use uniform to store the meta buffers
- a better gemv
- refactor the buf code
- q8_0 dot product
- compare the matmul q8_0 FLOPS between ggml and crabml
- aligh the performance on dot prod: try using manual neon instructions
- find the performance difference between ggml
- add performance metrics on every token
- add b_offset to dot product
- ggml matmul: ~65ms per token, crabml: ~75ms per token, about 10ms slower
- add GEMV benchmark on 3200 x 8640, then 8640 x 3200
- tile seems no effect at all, maybe the performance gap is due to the memory reuse?
- set threads as 2 have the best performance, strange
- re-arrange the kv cache memory layout to leverage dense dot prod.
- benchmark between rayon and vanilla thread pool on gemv
- q8 quantization on webgpu
- add dequantize in CpuTensor