Support AirLLM for running large models on low memory systems. #3640

Brog33 · 2024-09-11T13:53:56Z

Brog33
Sep 11, 2024

AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.

https://github.com/lyogavin/airllm

imtuyethan · 2024-09-12T05:14:36Z

imtuyethan
Sep 12, 2024
Maintainer

One comment from Discord:

My ME/CFS is too much in force for me to figure out what to do at GitHub at the moment, but if you meant what I hope by memory in the announcement, that would save me a great deal of time and effort. One of the hardest parts of my usage with LLMs is, at the end of every session, getting the LLM to write notes to itself to serve as medium to long term memory, then pasting those notes at the start of future sessions. ...And the occasional session dedicated to consolidating multiple sets of notes into something less unwieldy. ...Especially taxing on my end if I'm extra-drained and the LLM needs hand-holding every step of the way to make sure it documents at least most of what it actually needs and/or wants to remember.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jan

Support AirLLM for running large models on low memory systems. #3640

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Jan

Support AirLLM for running large models on low memory systems. #3640

Brog33 Sep 11, 2024

Replies: 1 comment

imtuyethan Sep 12, 2024 Maintainer

Brog33
Sep 11, 2024

imtuyethan
Sep 12, 2024
Maintainer