Faster CUDA prompt speeds #925

EricLBuehler · 2024-11-21T20:10:19Z

I measure +4% PP for Llama 3.2 3b (807 T/s -> 840 T/s, 42 prompt tokens) without quantization.

github-actions · 2024-11-21T20:27:51Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 Happy                   1          442          369            0           73
 JSON                   12          105          104            0            1
 Python                 53         2274         1949           63          262
 Shell                   1           57           22           18           17
 TOML                   18          583          520            2           61
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          196          169            1           26
 (Total)                            273          201           32           40
-------------------------------------------------------------------------------
 Markdown               40         3009            0         2286          723
 |- BASH                 6          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               6          114          102            0           12
 |- Rust                10          361          306            0           55
 |- TOML                 2           75           63            0           12
 (Total)                           3672          581         2286          805
-------------------------------------------------------------------------------
 Rust                  280        84887        76163         1764         6960
 |- Markdown           136         1435           25         1306          104
 (Total)                          86322        76188         3070         7064
===============================================================================
 Total                 415        91454        79196         4145         8113
===============================================================================

EricLBuehler added 3 commits November 21, 2024 14:37

Faster cuda attnmask impl

d93ff73

Faster cuda pagedattn pp speeds

1fdc4d8

Clippy and fmt

bbc886e

EricLBuehler added the optimization label Nov 21, 2024

EricLBuehler merged commit 2b20951 into master Nov 21, 2024
12 checks passed

EricLBuehler deleted the faster_cuda_attnmask branch November 21, 2024 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster CUDA prompt speeds #925

Faster CUDA prompt speeds #925

EricLBuehler commented Nov 21, 2024 •

edited

Loading

github-actions bot commented Nov 21, 2024

Faster CUDA prompt speeds #925

Faster CUDA prompt speeds #925

Conversation

EricLBuehler commented Nov 21, 2024 • edited Loading

github-actions bot commented Nov 21, 2024

EricLBuehler commented Nov 21, 2024 •

edited

Loading