部署llama.cpp到PC端

支持设备

Linux
macOS

步骤1：下载llama.cpp

通过Git克隆llama.cpp仓库：

git clone https://github.com/ggerganov/llama.cpp

步骤2：编译llama.cpp

进入llama.cpp目录并编译：

cd llama.cpp
make

步骤3：获取MiniCPM的gguf模型

方法1：直接下载

下载链接 - fp16格式
下载链接 - q4km格式

方法2：自行转换MiniCPM模型为gguf格式

创建模型存储路径
```
cd llama.cpp/models
mkdir Minicpm
```
下载MiniCPM pytorch模型 下载MiniCPM pytorch模型的所有文件，并保存到llama.cpp/models/Minicpm目录下。

修改转换脚本 检查llama.cpp/convert-hf-to-gguf.py文件中的_reverse_hf_permute函数，如果发现如下代码：

def _reverse_hf_permute(self, weights: Tensor, n_head: int, n_kv_head: int | None = None) -> Tensor:
    if n_kv_head is not None and n_head != n_kv_head:
        n_head //= n_kv_head

替换为：

@staticmethod
def permute(weights: Tensor, n_head: int, n_head_kv: int | None):
    if n_head_kv is not None and n_head != n_head_kv:
        n_head = n_head_kv
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
            .swapaxes(1, 2)
            .reshape(weights.shape))

def _reverse_hf_permute(self, weights: Tensor, n_head: int, n_kv_head: int | None = None) -> Tensor:
    if n_kv_head is not None and n_head != n_kv_head:
        n_head //= n_kv_head

安装依赖并转换模型
```
python3 -m pip install -r requirements.txt
python3 convert-hf-to-gguf.py models/Minicpm/
```
完成以上步骤后，llama.cpp/models/Minicpm目录下将会有一个名为ggml-model-f16.gguf的模型文件。

步骤4：量化fp16的gguf文件

若下载的模型已经是量化格式，则跳过此步骤。

./llama-quantize ./models/Minicpm/ggml-model-f16.gguf ./models/Minicpm/ggml-model-Q4_K_M.gguf Q4_K_M

如果找不到llama-quantize，可以尝试重新编译：

cd llama.cpp
make llama-quantize

步骤5：开始推理

使用量化后的模型进行推理：

./llama-cli -m ./models/Minicpm/ggml-model-Q4_K_M.gguf -n 128 --prompt "<用户>你知道openmbmb么<AI>"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp_pc.md

llama.cpp_pc.md

部署llama.cpp到PC端

支持设备

步骤1：下载llama.cpp

步骤2：编译llama.cpp

步骤3：获取MiniCPM的gguf模型

方法1：直接下载

方法2：自行转换MiniCPM模型为gguf格式

步骤4：量化fp16的gguf文件

步骤5：开始推理

Files

llama.cpp_pc.md

Latest commit

History

llama.cpp_pc.md

File metadata and controls

部署llama.cpp到PC端

支持设备

步骤1：下载llama.cpp

步骤2：编译llama.cpp

步骤3：获取MiniCPM的gguf模型

方法1：直接下载

方法2：自行转换MiniCPM模型为gguf格式

步骤4：量化fp16的gguf文件

步骤5：开始推理