This project is dedicated to collecting and curating research papers focused on Large Language Model (LLM) inference acceleration. I hope to summarize and share the knowledge I gain during my self-learning journey.
It will be updated regularly. Contributions are welcome, so feel free to star or submit a pull request.
Design Rule:
- Keywords: frequent academic terms for paper search.
- Paper intro: core idea/image with useful extensions are included for quick start. See example [intro of SmoothQuant].
- Citation: data for the papers is sourced from Google Scholar.
Last update of citation was on 28th July, 2024.
Keywords | Title | Paper | Affiliation | Date | Citation |
---|---|---|---|---|---|
LLM Inference | Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | [pdf] [intro] | CMU | 2023.12 | 36 |
LLM Inference | A survey on efficient inference for large language models | [pdf] [intro] | THU, Infinigence-AI, SJTU etc | 2024.04 | 14 |
Last update of citation was on 20th Oct, 2024.
Upcoming topics include: FlashDecoding, QLoRA
Keywords | Title | Paper | Affiliation | Date | Citation |
---|---|---|---|---|---|
MQA | Fast Transformer Decoding: One Write-Head is All You Need | [pdf] | 2019.11 | 252 | |
GQA | GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints [EMNLP 2023] | [pdf] [intro] | 2023.05 | 335 | |
ALiBi | Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [ICLR 2022] | [pdf] [intro] | UW, Facebook, Allen Institute | 2021.08 | 523 |
RoPE | RoFormer: Enhanced Transformer with Rotary Position Embedding [Neurocomputing 2024] | [pdf] [intro] | Zhuiyi Technology | 2021.04 | 1337 |
CoPE | Contextual Position Encoding: Learning to Count What's Important | [pdf] [intro] | Meta | 2024.05 | 12 |
FlashAttention | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [NeurIPS 2022] | [pdf] [intro] | Stanford, University at Buffalo | 2022.05 | 1271 |
FlashAttention2 | FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | [pdf] [intro] | Princeton, Stanford | 2023.07 | 475 |
Longformer | Longformer: The Long-Document Transformer | [pdf] [intro] | Allen Institute | 2020.04 | 4120 |
Mistral | Mistral 7B | [pdf] [intro] | Mistral | 2023.10 | 726 |
StreamingLLM, Attention Sinks | Efficient Streaming Language Models with Attention Sinks [ICLR 2024] | [pdf] [intro] | MIT, Meta, CMU etc | 2023.09 | 249 |
LoRA | LoRA: Low-Rank Adaptation of Large Language Models [ICLR 2022] | [pdf] [intro] | Microsoft | 2021.06 | 7407 |
LISA | LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning | [pdf] [intro] | HKUST, UIUC | 2024.03 | 5 |
HydraLoRA | HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning [NIPS 2024] | [pdf] [intro] | University of Macau, University of Texas at Austin, Cambridge | 2024.04 | 2 |
Last update of citation was on 5th July, 2024.
Upcoming topics include: QLoRA
Keywords | Title | Paper | Affiliation | Date | Citation |
---|---|---|---|---|---|
LLM.int8 | LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [NeurIPS 2022] | [pdf] [intro] | UW, Facebook, Hugging Face etc | 2022.08 | 554 |
SmoothQuant | SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [ICML 2023] | [pdf] [intro] | MIT, NVIDIA | 2022.11 | 382 |
AWQ | AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [MLSys 2024] | [pdf] [intro] | MIT, SJTU, NVIDIA etc | 2023.06 | 235 |
OneBit | OneBit: Towards Extremely Low-bit Large Language Models | [pdf] [intro] | THU, HIT | 2024.02 | 4 |
OneBit | The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits | [pdf] [intro] | Microsoft, UCAS | 2024.02 | 39 |
Last update of citation was on 25th July, 2024.
Keywords | Title | Paper | Affiliation | Date | Citation |
---|---|---|---|---|---|
vLLM, PagedAttention | Efficient Memory Management for Large Language Model Serving with PagedAttention [SOSP 2023] | [pdf] [intro] | UC Berkeley, Stanford, UC San Diego | 2023.09 | 496 |
Last update of citation was on 5th August, 2024.
Keywords | Title | Paper | Affiliation | Date | Citation |
---|---|---|---|---|---|
Cellular Batching | Low latency rnn inference with cellular batching [EuroSys 2018] | [pdf] | THU, NYU | 2018.04 | 97 |
ORCA | Orca: A Distributed Serving System for Transformer-Based Generative Models [OSDI 2022] | [pdf] [intro] | SNU, FriendliAI | 2022.07 | 195 |
SARATHI | SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | [pdf] [intro] | Microsoft, GIT | 2023.08 | 36 |
Last update of citation was on 8th August, 2024.
Keywords | Title | Paper | Affiliation | Date | Citation |
---|---|---|---|---|---|
Block Transformer | Block Transformer: Global-to-Local Language Modeling for Fast Inference | [pdf] [intro] | KAIST, LG, Google | 2024.06 | 2 |
TTT | Learning to (Learn at Test Time): RNNs with Expressive Hidden States | [pdf] [intro] | Stanford, UC San Diego, UC Berkeley etc | 2024.07 | 3 |
LazyLLM | LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference | [pdf] [intro] | Apple, Meta | 2024.07 |