-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
will there be newer technologies in upcoming series? #55
Comments
Hi @jdgh000 I see your point, this field evolves pretty fast, and there are new tricks and tweaks being constantly released. I wouldn't go as far as saying that attention is outdated, though. The concept of attention remains as relevant as ever, what FlashAttention brings to the table is "just" doing the same thing in a more memory-efficient way. In its core, FlashAttention works by implementing online (or tiled) softmax (the bottleneck of regular attention) and allocating and populating a single variable/structure in memory. I haven't heard of splitK before but, from what I saw, it seems pretty low-level. Right now, I'm working on a new and short book that is focused on engineering topics one must know in order to fine-tune LLMs. Quantization, low-rank adapters, and of course, Flash Attention. You can learn more about this upcoming book here: https://leanpub.com/finetuning I believe you'll like its TOC, as it goes along the lines of your suggestion :-) You can also sign up there to get notified when it's published (and get a coupon too!) Best, |
Yeah those are all good points, i dont think attention is outdated, it is more like fundamentals on all newer ones that are built upon the foundation of attention. |
I see SBS has 3 volumes which I came to know by searching through amazon, so far I like attention and transformer covered but we are increasingly working with even latter ones:
ie. - flash attention - increasingly used to speed inference.training more by reducing memory traffic and integrating / fusing various GPU tasks - https://github.com/Dao-AILab/flash-attention. By that, attention is already getting outdated as flash attention is getting much more performant with same load.
splitK - even have more obscure understanding, i believe it is sectioning keys to GPU units to parallelize the copuetation more.
While those have some demo codes provided to showcase benefit, will be nice if SBS exanples also adopt those newer trends for continuity if there are newer volumes planned.
The text was updated successfully, but these errors were encountered: