Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

will there be newer technologies in upcoming series? #55

Open
jdgh000 opened this issue Sep 26, 2024 · 2 comments
Open

will there be newer technologies in upcoming series? #55

jdgh000 opened this issue Sep 26, 2024 · 2 comments

Comments

@jdgh000
Copy link

jdgh000 commented Sep 26, 2024

I see SBS has 3 volumes which I came to know by searching through amazon, so far I like attention and transformer covered but we are increasingly working with even latter ones:
ie. - flash attention - increasingly used to speed inference.training more by reducing memory traffic and integrating / fusing various GPU tasks - https://github.com/Dao-AILab/flash-attention. By that, attention is already getting outdated as flash attention is getting much more performant with same load.
splitK - even have more obscure understanding, i believe it is sectioning keys to GPU units to parallelize the copuetation more.

While those have some demo codes provided to showcase benefit, will be nice if SBS exanples also adopt those newer trends for continuity if there are newer volumes planned.

@dvgodoy
Copy link
Owner

dvgodoy commented Sep 26, 2024

Hi @jdgh000

I see your point, this field evolves pretty fast, and there are new tricks and tweaks being constantly released.

I wouldn't go as far as saying that attention is outdated, though. The concept of attention remains as relevant as ever, what FlashAttention brings to the table is "just" doing the same thing in a more memory-efficient way. In its core, FlashAttention works by implementing online (or tiled) softmax (the bottleneck of regular attention) and allocating and populating a single variable/structure in memory.
Apart from that, there are other things such as KV-caching and Multi-Query Attention (MQA) that also decrease memory usage by using fewer and pre-computed keys and values, so the bulk of attention itself is relying mostly on the queries. There is also sliding attention and sparsing attention, which reduce the number of tokens to which attention is computed, thus allowing for longer contexts.
All these things, however, are just tweaks to how attention is computed or to which tokens it is applied, the fundamental concept remains unchanged.

I haven't heard of splitK before but, from what I saw, it seems pretty low-level.

Right now, I'm working on a new and short book that is focused on engineering topics one must know in order to fine-tune LLMs. Quantization, low-rank adapters, and of course, Flash Attention. You can learn more about this upcoming book here: https://leanpub.com/finetuning

I believe you'll like its TOC, as it goes along the lines of your suggestion :-)

You can also sign up there to get notified when it's published (and get a coupon too!)

Best,
Daniel

@jdgh000
Copy link
Author

jdgh000 commented Oct 12, 2024

Yeah those are all good points, i dont think attention is outdated, it is more like fundamentals on all newer ones that are built upon the foundation of attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants