Skip to content

continuously updating with my fav. fastest llm inference techniques and all are tested on supercomputer leonardo

License

Notifications You must be signed in to change notification settings

sleepingcat4/inference-decoding

Repository files navigation

A 6 year's rage repository. Where I explain and write easy to understandable inference technique and code to excecute them on European supercomputers like JUPITER, JUWELS and Leonardo.

What you can find?

  1. Most fastest concoction of inference tips and techniques. That are well documented
  2. Code to run them on colab
  3. Code to run them on European Pre-exa and Exa-scale supercomputers.

These are tested and well-documented. I am planning to maintain this repository religiously. Filenames are self-explanatory + if you see a flag like leo or juwel it means it is supercomputer compatible code. I include my slurm scripts as well so have fun.

For colab code: https://colab.research.google.com/drive/17U4lj2YLNH0GdxR9iovBnHdONB4QEh_a?usp=sharing

KVPress: Fastest method according to my test on Italian supercomputer Leonardo. On a single A100 64GB card with 16 CPUs.

Prompt Caching: Prompt Caching is relatively good from my tests but it certainly does not come same as par as KVPress.

Graph Inference: One of my favourite methods that uses torch.compile() to get fast inference speed while use_cache method is turned-on. [Use cache uses KVPress caching] While it shows fast results on Google colab, it was significantly slower on Leonardo.

Prompt Lookup: Again one of my favourite methods and it had the most fastest inference speed of 1.21 seconds.

I am excited to share more methods in the coming future as I find more. Including batch inference technique using Ray lib and continous batching.

About

continuously updating with my fav. fastest llm inference techniques and all are tested on supercomputer leonardo

Topics

Resources

License

Stars

Watchers

Forks