Skip to content

shishishu/LLM-Inference-Acceleration

Repository files navigation

LLM-Inference-Acceleration

Table of Contents

About This Project

This project is dedicated to collecting and curating research papers focused on Large Language Model (LLM) inference acceleration. I hope to summarize and share the knowledge I gain during my self-learning journey.

It will be updated regularly. Contributions are welcome, so feel free to star or submit a pull request.

Listing of Papers By Topics

Design Rule:

  • Keywords: frequent academic terms for paper search.
  • Paper intro: core idea/image with useful extensions are included for quick start. See example [intro of SmoothQuant].
  • Citation: data for the papers is sourced from Google Scholar.

Review

Last update of citation was on 28th July, 2024.

Keywords Title Paper Affiliation Date Citation
LLM Inference Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems [pdf] [intro] CMU 2023.12 36
LLM Inference A survey on efficient inference for large language models [pdf] [intro] THU, Infinigence-AI, SJTU etc 2024.04 14

Attention Mechanism

Last update of citation was on 20th Oct, 2024.

Upcoming topics include: FlashDecoding, QLoRA

Keywords Title Paper Affiliation Date Citation
MQA Fast Transformer Decoding: One Write-Head is All You Need [pdf] Google 2019.11 252
GQA GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints [EMNLP 2023] [pdf] [intro] Google 2023.05 335
ALiBi Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [ICLR 2022] [pdf] [intro] UW, Facebook, Allen Institute 2021.08 523
RoPE RoFormer: Enhanced Transformer with Rotary Position Embedding [Neurocomputing 2024] [pdf] [intro] Zhuiyi Technology 2021.04 1337
CoPE Contextual Position Encoding: Learning to Count What's Important [pdf] [intro] Meta 2024.05 12
FlashAttention FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [NeurIPS 2022] [pdf] [intro] Stanford, University at Buffalo 2022.05 1271
FlashAttention2 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning [pdf] [intro] Princeton, Stanford 2023.07 475
Longformer Longformer: The Long-Document Transformer [pdf] [intro] Allen Institute 2020.04 4120
Mistral Mistral 7B [pdf] [intro] Mistral 2023.10 726
StreamingLLM, Attention Sinks Efficient Streaming Language Models with Attention Sinks [ICLR 2024] [pdf] [intro] MIT, Meta, CMU etc 2023.09 249
LoRA LoRA: Low-Rank Adaptation of Large Language Models [ICLR 2022] [pdf] [intro] Microsoft 2021.06 7407
LISA LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning [pdf] [intro] HKUST, UIUC 2024.03 5
HydraLoRA HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning [NIPS 2024] [pdf] [intro] University of Macau, University of Texas at Austin, Cambridge 2024.04 2

Quantization

Last update of citation was on 5th July, 2024.

Upcoming topics include: QLoRA

Keywords Title Paper Affiliation Date Citation
LLM.int8 LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [NeurIPS 2022] [pdf] [intro] UW, Facebook, Hugging Face etc 2022.08 554
SmoothQuant SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [ICML 2023] [pdf] [intro] MIT, NVIDIA 2022.11 382
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [MLSys 2024] [pdf] [intro] MIT, SJTU, NVIDIA etc 2023.06 235
OneBit OneBit: Towards Extremely Low-bit Large Language Models [pdf] [intro] THU, HIT 2024.02 4
OneBit The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [pdf] [intro] Microsoft, UCAS 2024.02 39

KV Cache

Last update of citation was on 25th July, 2024.

Keywords Title Paper Affiliation Date Citation
vLLM, PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention [SOSP 2023] [pdf] [intro] UC Berkeley, Stanford, UC San Diego 2023.09 496

Continuous Batching

Last update of citation was on 5th August, 2024.

Keywords Title Paper Affiliation Date Citation
Cellular Batching Low latency rnn inference with cellular batching [EuroSys 2018] [pdf] THU, NYU 2018.04 97
ORCA Orca: A Distributed Serving System for Transformer-Based Generative Models [OSDI 2022] [pdf] [intro] SNU, FriendliAI 2022.07 195
SARATHI SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills [pdf] [intro] Microsoft, GIT 2023.08 36

HW/SW Co-design

Framework/System/Architecture

More

Last update of citation was on 8th August, 2024.

Keywords Title Paper Affiliation Date Citation
Block Transformer Block Transformer: Global-to-Local Language Modeling for Fast Inference [pdf] [intro] KAIST, LG, Google 2024.06 2
TTT Learning to (Learn at Test Time): RNNs with Expressive Hidden States [pdf] [intro] Stanford, UC San Diego, UC Berkeley etc 2024.07 3
LazyLLM LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference [pdf] [intro] Apple, Meta 2024.07

About

LLM Inference with Deep Learning Accelerator.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published