LLM-Inference-Acceleration

About This Project

This project is dedicated to collecting and curating research papers focused on Large Language Model (LLM) inference acceleration. I hope to summarize and share the knowledge I gain during my self-learning journey.

It will be updated regularly. Contributions are welcome, so feel free to star or submit a pull request.

Listing of Papers By Topics

Design Rule:

Keywords: frequent academic terms for paper search.
Paper intro: core idea/image with useful extensions are included for quick start. See example [intro of SmoothQuant].
Citation: data for the papers is sourced from Google Scholar.

Review

Last update of citation was on 28th July, 2024.

Keywords	Title	Paper	Affiliation	Date	Citation
LLM Inference	Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems	[pdf] [intro]	CMU	2023.12	36
LLM Inference	A survey on efficient inference for large language models	[pdf] [intro]	THU, Infinigence-AI, SJTU etc	2024.04	14

Attention Mechanism

Last update of citation was on 20th Oct, 2024.

Upcoming topics include: FlashDecoding, QLoRA

Keywords	Title	Paper	Affiliation	Date	Citation
MQA	Fast Transformer Decoding: One Write-Head is All You Need	[pdf]	Google	2019.11	252
GQA	GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints [EMNLP 2023]	[pdf] [intro]	Google	2023.05	335
ALiBi	Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [ICLR 2022]	[pdf] [intro]	UW, Facebook, Allen Institute	2021.08	523
RoPE	RoFormer: Enhanced Transformer with Rotary Position Embedding [Neurocomputing 2024]	[pdf] [intro]	Zhuiyi Technology	2021.04	1337
CoPE	Contextual Position Encoding: Learning to Count What's Important	[pdf] [intro]	Meta	2024.05	12
FlashAttention	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [NeurIPS 2022]	[pdf] [intro]	Stanford, University at Buffalo	2022.05	1271
FlashAttention2	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	[pdf] [intro]	Princeton, Stanford	2023.07	475
Longformer	Longformer: The Long-Document Transformer	[pdf] [intro]	Allen Institute	2020.04	4120
Mistral	Mistral 7B	[pdf] [intro]	Mistral	2023.10	726
StreamingLLM, Attention Sinks	Efficient Streaming Language Models with Attention Sinks [ICLR 2024]	[pdf] [intro]	MIT, Meta, CMU etc	2023.09	249
LoRA	LoRA: Low-Rank Adaptation of Large Language Models [ICLR 2022]	[pdf] [intro]	Microsoft	2021.06	7407
LISA	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning	[pdf] [intro]	HKUST, UIUC	2024.03	5
HydraLoRA	HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning [NIPS 2024]	[pdf] [intro]	University of Macau, University of Texas at Austin, Cambridge	2024.04	2

Quantization

Last update of citation was on 5th July, 2024.

Upcoming topics include: QLoRA

Keywords	Title	Paper	Affiliation	Date	Citation
LLM.int8	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [NeurIPS 2022]	[pdf] [intro]	UW, Facebook, Hugging Face etc	2022.08	554
SmoothQuant	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [ICML 2023]	[pdf] [intro]	MIT, NVIDIA	2022.11	382
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [MLSys 2024]	[pdf] [intro]	MIT, SJTU, NVIDIA etc	2023.06	235
OneBit	OneBit: Towards Extremely Low-bit Large Language Models	[pdf] [intro]	THU, HIT	2024.02	4
OneBit	The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits	[pdf] [intro]	Microsoft, UCAS	2024.02	39

KV Cache

Last update of citation was on 25th July, 2024.

Keywords	Title	Paper	Affiliation	Date	Citation
vLLM, PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention [SOSP 2023]	[pdf] [intro]	UC Berkeley, Stanford, UC San Diego	2023.09	496

Continuous Batching

Last update of citation was on 5th August, 2024.

Keywords	Title	Paper	Affiliation	Date	Citation
Cellular Batching	Low latency rnn inference with cellular batching [EuroSys 2018]	[pdf]	THU, NYU	2018.04	97
ORCA	Orca: A Distributed Serving System for Transformer-Based Generative Models [OSDI 2022]	[pdf] [intro]	SNU, FriendliAI	2022.07	195
SARATHI	SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills	[pdf] [intro]	Microsoft, GIT	2023.08	36

HW/SW Co-design

Framework/System/Architecture

More

Last update of citation was on 8th August, 2024.

Keywords	Title	Paper	Affiliation	Date	Citation
Block Transformer	Block Transformer: Global-to-Local Language Modeling for Fast Inference	[pdf] [intro]	KAIST, LG, Google	2024.06	2
TTT	Learning to (Learn at Test Time): RNNs with Expressive Hidden States	[pdf] [intro]	Stanford, UC San Diego, UC Berkeley etc	2024.07	3
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference	[pdf] [intro]	Apple, Meta	2024.07

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
attention-mechanism		attention-mechanism
continuous-batching		continuous-batching
kv-cache/efficient-memory-management-for-large-language-model-serving-with-pagedattention		kv-cache/efficient-memory-management-for-large-language-model-serving-with-pagedattention
more		more
quantization		quantization
review		review
template		template
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Inference-Acceleration

Table of Contents

About This Project

Listing of Papers By Topics

Review

Attention Mechanism

Quantization

KV Cache

Continuous Batching

HW/SW Co-design

Framework/System/Architecture

More

About

Releases

Packages

License

shishishu/LLM-Inference-Acceleration

Folders and files

Latest commit

History

Repository files navigation

LLM-Inference-Acceleration

Table of Contents

About This Project

Listing of Papers By Topics

Review

Attention Mechanism

Quantization

KV Cache

Continuous Batching

HW/SW Co-design

Framework/System/Architecture

More

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages