add basic WGPU tensor support
- referenced: https://github.com/0hq/WebGPT/blob/main/condensed/condensed.js
- referenced: https://github.com/burn-rs/burn/blob/main/burn-wgpu/src/template/matmul/naive.wgsl
- referenced: https://github.com/huggingface/candle/blob/main/candle-core/src/cpu_backend.rs
factor out TensorArithmetic
revise the dot product attention primitive
- reference mlx: https://github.com/simonw/llm-mlx-llama/blob/main/llm_mlx_llama.py#L81
- learn the kv cache layout
use uniform to store the meta buffers
a better gemv
refactor the buf code
q8_0 dot product
compare the matmul q8_0 FLOPS between ggml and crabml
- aligh the performance on dot prod: try using manual neon instructions
find the performance difference between ggml
- add performance metrics on every token
- add b_offset to dot product
ggml matmul: ~65ms per token, crabml: ~75ms per token, about 10ms slower
- add GEMV benchmark on 3200 x 8640, then 8640 x 3200
- tile seems no effect at all, maybe the performance gap is due to the memory reuse?
- set threads as 2 have the best performance, strange
- re-arrange the kv cache memory layout to leverage dense dot prod.
- benchmark between rayon and vanilla thread pool on gemv
q8 quantization on webgpu
- add dequantize in CpuTensor

Provide feedback

Saved searches