Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Now the core of Mooncake - Transfer Engine is open-sourced! This repository also hosts its technical report and the open sourced traces.
- Nov 28, 2024: We open sourced the Transfer Engine, the central component of Mooncake. We also provide two demonstrations of Transfer Engine: a P2P Store and vLLM integration.
- July 9, 2024: We open sourced the trace as a jsonl file!.
- June 27, 2024: We present a series of Chinese blogs with more discussions on zhihu 1, 2, 3, 4.
- June 26, 2024: Initial technical report release.
Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache.
The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs) requirements. Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncakeβs innovative architecture enables Kimi to handle 75% more requests.
- The bottom part of Mooncake is Transfer Engine, which supports rapid, reliable and flexible data transfer over TCP, RDMA, NVIDIA GPUDirect-based RDMA and and NVMe over Fabric (NVMe-of) protocols. Comparing with gloo (used by Distributed PyTorch) and TCP, Mooncake Transfer Engine has the lowest I/O latency.
- Based on Transfer Engine, we implemented the P2P Store library, supports sharing temporary objects (e.g., checkpoint files) among nodes in a cluster. It avoids bandwidth saturation on a single machine.
- Additionally, we modified vLLM so that Transfer Engine is integrated. It makes prefill-decode disaggregation more efficient by utilizing RDMA devices.
- In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled KVCache for more flexible P/D disaggregation.
Use Transfer Engine Standalone (Guide)
Transfer Engine is a high-performance data transfer framework. Transfer Engine provides a unified interface to transfer data from DRAM, VRAM or NVMe, while the technical details related to hardware are hidden. Transfer Engine supports TCP, RDMA (InfiniBand/RoCEv2/eRDMA/NVIDIA GPUDirect) and NVMe over Fabric (NVMe-of) protocols.
-
Efficient use of multiple RDMA NIC devices. Transfer Engine supports the use of multiple RDMA NIC devices to achieve the aggregation of transfer bandwidth.
-
Topology aware path selection. Transfer Engine can select optimal devices based on the location (NUMA affinity, etc.) of both source and destination.
-
More robust on temporary network error. Once transmission fails, Transfer Engine will try to use alternative paths for data delivery automatically.
With 40 GB of data (equivalent to the size of the KVCache generated by 128k tokens in the LLaMA3-70B model), Mooncake Transfer Engine delivers up to 87 GB/s and 190 GB/s of bandwidth in 4Γ200 Gbps and 8Γ400 Gbps RoCE networks respectively, which are about 2.4x and 4.6x faster than the TCP protocol.
P2P Store (Guide)
P2P Store is built on the Transfer Engine and supports sharing temporary objects between peer nodes in a cluster. P2P Store is ideal for scenarios like checkpoint transfer, where data needs to be rapidly and efficiently shared across a cluster. P2P Store has been used in the checkpoint transfer service of Moonshot AI.
-
Decentralized architecture. P2P Store leverages a pure client-side architecture with global metadata managed by the etcd service.
-
Efficient data distribution. Designed to enhance the efficiency of large-scale data distribution, P2P Store avoids bandwidth saturation issues by allowing replicated nodes to share data directly. This reduces the CPU/RDMA NIC pressures of data providers (e.g., trainers).
Thanks to the high performance of Transfer Engine, P2P Stores can also distribute objects with full utilization of hardware incoming bandwidth (e.g., A 25Gbps NIC was used in the following figure, and the throughput of get replica is about 3.1 GB/s).
vLLM Integration (Guide v0.1, v0.2-Nightly)
To optimize LLM inference, the vLLM's community is working at supporting disaggregated prefilling (PR 8498). This feature allows separating the prefill phase from the decode phase in different processes. The vLLM uses nccl
and gloo
as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.
We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of nccl
and gloo
, to support inter-node KVCache transfer. Transfer Engine provides simpler interface and more efficient use of RDMA devices. In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled prefill/decode disaggregation.
Update[Dec 4, 2024]: Here is the nightly vLLM Integration (Guide v0.2-Nightly) that is based on vLLM's main branch.
By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, TTFT of vLLM with Transfer Engine is up to 33% lower than traditional TCP-based transports. In the future, we will further improve TTFT through GPUDirect RDMA and zero-copy.
Backend/Setting | Output Token Throughput (tok/s) | Total Token Throughput (tok/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) |
---|---|---|---|---|---|
Transfer Engine (RDMA) | 12.07 | 2046.78 | 1165.25 | 678.74 | 4576.57 |
TCP | 12.06 | 2045.51 | 1925.52 | 1011.58 | 8149.52 |
- Click here to access detailed benchmark results.
More advanced features will coming soon, so stay tuned!
In order to install and use Mooncake, some preparation is required.
- RDMA Driver & SDK (e.g., Mellanox OFED).
- Linux-x86_64 with gcc, g++ (9.4+) and cmake (3.16+).
- Python (3.10 or above)
In addition, to support more features of Mooncake Transfer Engine, we recommand you to install the following components:
- CUDA 12.1 and above, including NVIDIA GPUDirect Storage Support, if you want to build with
-DUSE_CUDA
. You may install them from here.# Adding CUDA to PATH export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export CUDA_PATH=/usr/local/cuda
- Go 1.20+, if you want to build with
-DWITH_P2P_STORE
. You may download it from here. - Rust Toolclain, if you want to build with
-DWITH_WITH_RUST_EXAMPLE
. hiredis
, if you want to build with-DWITH_REDIS
, so that you use Redis instead of etcd as metadata servers.
-
Init source code
git clone https://github.com/kvcache-ai/Mooncake.git cd Mooncake
-
Install dependencies
bash dependencies.sh
-
Compile Mooncake and examples
mkdir build cd build cmake .. # (optional) Specify build options like -D make -j
- First release of Mooncake and integrate with latest vLLM
- Share KV caches across multiple serving engines
- User and developer documentation
{
"timestamp": 27482,
"input_length": 6955,
"output_length": 52,
"hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]
}
{
"timestamp": 30535,
"input_length": 6472,
"output_length": 26,
"hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366]
}
The above presents two samples from our trace dataset. The trace includes the timing of request arrivals, the number of input tokens, the number of output tokens, and the remapped block hash. To protect our customers' privacy, we applied several mechanisms to remove user-related information while preserving the dataset's utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the paper's Version 3.
Please kindly cite our paper if you find the paper or the trace is useful:@article{qin2024mooncake,
title = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving},
author = {Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu},
year = {2024},
url = {https://arxiv.org/abs/2407.00079}
}