GitHub - sleepingcat4/inference-decoding: continuously updating with my fav. fastest llm inference techniques and all are tested on supercomputer leonardo

A 6 year's rage repository. Where I explain and write easy to understandable inference technique and code to excecute them on European supercomputers like JUPITER, JUWELS and Leonardo.

What you can find?

Most fastest concoction of inference tips and techniques. That are well documented
Code to run them on colab
Code to run them on European Pre-exa and Exa-scale supercomputers.

These are tested and well-documented. I am planning to maintain this repository religiously. Filenames are self-explanatory + if you see a flag like leo or juwel it means it is supercomputer compatible code. I include my slurm scripts as well so have fun.

For colab code: https://colab.research.google.com/drive/17U4lj2YLNH0GdxR9iovBnHdONB4QEh_a?usp=sharing

KVPress: Fastest method according to my test on Italian supercomputer Leonardo. On a single A100 64GB card with 16 CPUs.

Prompt Caching: Prompt Caching is relatively good from my tests but it certainly does not come same as par as KVPress.

Graph Inference: One of my favourite methods that uses torch.compile() to get fast inference speed while use_cache method is turned-on. [Use cache uses KVPress caching] While it shows fast results on Google colab, it was significantly slower on Leonardo.

Prompt Lookup: Again one of my favourite methods and it had the most fastest inference speed of 1.21 seconds.

I am excited to share more methods in the coming future as I find more. Including batch inference technique using Ray lib and continous batching.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
README.md		README.md
graph_leo.py		graph_leo.py
kvpress_infer.py		kvpress_infer.py
kvpress_leo.py		kvpress_leo.py
prompt-lookup.py		prompt-lookup.py
prompt_leo.py		prompt_leo.py
submit.sh		submit.sh
ultra_fast_KVPress.py		ultra_fast_KVPress.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Languages

License

sleepingcat4/inference-decoding

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages