KTransformers部署DeepSeek-R1-Q4_K_M

一、参考文档与信息说明

KTransformers 是由清华大学发起的一个项目,它利用 DeepSeek 模型的 MoE 架构特性,将专家模型的权重加载到内存上,并分配 CPU 完成相关计算,同时将 ML/KV Cache 加载到 GPU 上,从而实现 CPU+GPU 混合推理。这种方法能够在最大化降低显存占用的同时,保持一定的推理速度。KTransformers 项目旨在解决大模型本地部署难题,实现资源有限情况下大模型的高效本地部署,让更多人能够在自己的设备上运行曾经遥不可及的大型模型。

1.1 参考文档

  • 参考文章:

    • KT的github仓库:https://github.com/kvcache-ai/ktransformers/tree/v0.2.3post2

    • 安装文档:Kt官方安装文档(https://kvcache-ai.github.io/ktransformers/en/install.html)

    • https://mp.weixin.qq.com/s/1keAGOQlkTf_dKrzWmCRZQ

    • https://mp.weixin.qq.com/s/C4aTsxzYGV7bFrKyx6juug

    • https://kq4b3vgg5b.feishu.cn/wiki/QJ5ywpjnvieTKZk5kPHcG3sLnkd

  • 模型下载页面:

    • https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/files
  • 下载模型文件:

    • https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf

只下载DeepSeek-R1-Q4_K_M这个量化版本:https://www.modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/feedback/issueDetail/23220

1.2 信息说明

第1次此次部署是在Centos8-x86_64物理服务器上部署,一直有报错,相关报错信息描述在第2章。是其他同学的服务器,后建议其在物理服务器上创建ubuntu2204容器进行操作。

第2次,准备在自己的ubuntu20.04 LTS-x86_64物理服务器上进行操作,成功。

二、Centos8上报错处理

2.1 使用torch2.6.0时,KT仓库根目录下执行"pip install ."报“Read time out”

image-20250315204632081

#换成使用torch2.4.1

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124

#再指定pip安装源为清华源(因为KT就是清华主导的开源项目):

1
2
sh install.sh
##pip install . -i https://pypi.tuna.tsinghua

但报如下错误(报错时,安装的cuda版本是cuda_12.6.r12.6/compiler.34431801_0,torch版本是2.4.1+cu124):

如下提示探查到的CUDA12.6小版本与用来编译当前使用PyTorch2.4.1所使用的CUDA版本(应该是12.4)不匹配,但大部分情况下这不是一个严重问题,所以只是一个警告信息。

第2行提示,不存在为CUDA12.6定义的g++版本边界。

image-20250315211142720
image-20250315211246828

然后,安装cuda12.4+torch2.4.1,再执行sh install.sh,还有g++版本、cmake相关报错

image-20250316091400204

对于g++版本相关报错,可以考虑升级gcc与g++版本,然后再执行sh install.sh,再查看是否仍有报销

三、Ubuntu20.04 上部署

3.0 升级cmake

1
2
3
4
5
6
7
8
9
#升级cmake版本(从3.16.3到3.23.0)
root@ksp-registry:/opt/installPkgs# wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz
root@ksp-registry:/opt/installPkgs# tar -zxvf cmake-3.23.0.tar.gz
root@ksp-registry:/opt/installPkgs# cp -rp cmake-3.23.0 /usr/share/cmake-3.23.0
root@ksp-registry:/opt/installPkgs# ln -sf /usr/share/cmake-3.23.0/bin/cmake /usr/bin/cmake
root@ksp-registry:/opt/installPkgs# cmake --version
cmake version 3.23.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

3.1 下载DeepSeek-R1-Q4_K_M

下载GGUF文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
###下载方法1(推荐)
#使用conda创建python虚拟环境
root@ksp-registry:/opt/code_repos/AI_models# conda create -n self-llm python=3.12
root@ksp-registry:/opt/code_repos/AI_models# conda activate self-llm

(self-llm) root@ksp-registry:/opt/code_repos/AI_models# mkdir DeepSeek-R1-Q4_K_M
(self-llm) root@ksp-registry:/opt/code_repos/AI_models# cd DeepSeek-R1-Q4_K_M
(self-llm) root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-Q4_K_M# vi download-DeepSeek-R1-Q4_K_M.sh
#!/bin/bash

for i in $(seq 1 9); do
aria2c -s 16 -x 16 https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-0000${i}-of-00009.gguf
done

#所有文件下载完成后,大概有600多G
(self-llm) root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-Q4_K_M# apt-get update && apt -qy install aria2
(self-llm) root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-Q4_K_M# bash download-DeepSeek-R1-Q4_K_M.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
###下载方法2(会下载DeepSeek-R1-Zero-Q4_K_M-xxx文件,这些文件不需要)
(self-llm) root@ksp-registry:/opt/code_repos/AI_models# pip install modelscope
(self-llm) root@ksp-registry:/opt/code_repos/AI_models# vi download.py
from modelscope import snapshot_download
snapshot_download(
repo_id = "unsloth/DeepSeek-R1-GGUF",
local_dir = "DeepSeek-R1-GGUF",
allow_patterns = ["*Q4_K_M*"], # Select quant type Q4_K_M
)

###此python脚本会将“https://www.modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/files/DeepSeek-R1-Q4_K_M”h目录下所有18个文件都下载下来,即除了“DeepSeek-R1-Q4_K_M-0000X-of-00009.gguf”,还有“DeepSeek-R1-Zero-Q4_K_M-0000X-of-00009.gguf”
(self-llm) root@ksp-registry:/opt/code_repos/AI_models# python download.py
Downloading Model to directory: /opt/code_repos/AI_models/DeepSeek-R1-GGUF
2025-03-17 10:50:41,631 - modelscope - INFO - Got 18 files, start to download ...
Processing 18 items: 0%| | 0.00/18.0 [00:00<?, ?it/s]
Downloading [DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf]: 1%|▎ | 322M/45.0G [01:05<2:22:31, 5.61MB/s]

下载配置文件

1
2
3
4
5
6
7
root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-GGUF# wget https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/config.json

root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-GGUF# wget https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/.gitattributes

root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-GGUF# wget https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/configuration.json

root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-GGUF# wget https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/README.md

llama.cpp运行模型(可选)

参考:https://www.modelscope.cn/models/unsloth/DeepSeek-R1-GGUF

1
2
3
4
5
6
7
8
9
10
####以下方法有风险(可能会覆盖原有的cmake-3.16.3),暂未执行
#root@ksp-registry:/opt/installPkgs# cd cmake-3.23.0/
#检查系统环境并生成 Makefile(--prefix=/path 指定安装路径(默认为 /usr/local))
#root@ksp-registry:/opt/installPkgs/cmake-3.23.0# ./configure
#编译
#root@ksp-registry:/opt/installPkgs/cmake-3.23.0# make -j8
#安装
#root@ksp-registry:/opt/installPkgs/cmake-3.23.0# make install
##建立软链接,使用安装的新版本的cmake
#update-alternatives --install /usr/bin/cmake cmake /usr/local/bin/cmake 1 –force
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
apt-get update
apt-get install build-essential curl libcurl4-openssl-dev -y

###下面的步骤可选
(self-llm) root@ksp-registry:~# cd /opt/code_repos/
(self-llm) root@ksp-registry:~# git clone https://github.com/ggerganov/llama.cpp

cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

root@ksp-registry:/opt/code_repos# ll llama.cpp/build/bin/llama-*
-rwxr-xr-x 1 root root 451135880 Mar 17 11:13 llama.cpp/build/bin/llama-cli*
-rwxr-xr-x 1 root root 449090696 Mar 17 11:13 llama.cpp/build/bin/llama-gguf-split*
-rwxr-xr-x 1 root root 449629880 Mar 17 11:12 llama.cpp/build/bin/llama-quantize*
#将生成的3个文件复制到目录llama.cpp 下
cp llama.cpp/build/bin/llama-* llama.cpp

3.2 安装基础组件或依赖

3.2.1 NVIDIA驱动与cuda

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#NVIDIA驱动已经安装好
(self-llm) root@ksp-registry:/opt/code_repos# nvidia-smi
Mon Mar 17 14:53:50 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:C1:00.0 Off | 0 |
| 0% 33C P8 21W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#安装cuda12.4
(self-llm) root@ksp-registry:/opt/code_repos# ll /opt/nvidia-driver-cuda-for-A40/
total 4650004
drwxr-xr-x 2 root root 4096 Feb 14 14:31 ./
drwxr-xr-x 8 root root 4096 Mar 17 10:25 ../
-rw-r--r-- 1 root root 4454730420 Mar 29 2024 cuda_12.4.1_550.54.15_linux.run
-rwxrwxrwx 1 root root 306858135 May 17 2024 NVIDIA-Linux-x86_64-550.54.15.run*
#也已经安装好
(self-llm) root@ksp-registry:/opt/code_repos# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

#编辑/root/.bashrc,添加如下内容
(self-llm) root@ksp-registry:/opt/code_repos# vi /root/.bashrc
export PATH=/usr/local/cuda-12.4/bin/:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export CUDA_PATH=/usr/local/cuda
(self-llm) root@ksp-registry:/opt/code_repos# source /root/.bashrc

3.2.2 安装编译组件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apt-get update
apt-get install gcc g++ ninja-build

#查看gcc版本
(self-llm) root@ksp-registry:/opt/code_repos# gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#查看g++版本
(self-llm) root@ksp-registry:/opt/code_repos# g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#查看cmake版本
(self-llm) root@ksp-registry:/opt/code_repos# cmake --version
cmake version 3.23.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

#查看ninja版本
(self-llm) root@ksp-registry:/opt/code_repos# ninja --version
1.10.0
#安装基础组件
(self-llm) root@ksp-registry:/opt/code_repos# apt install build-essential curl libcurl4-openssl-dev -y

3.3 创建KT专用python虚拟环境

1
2
conda create --name kt python=3.11
conda activate kt

3.4 安装KTransformers

KT的github仓库:https://github.com/kvcache-ai/ktransformers/tree/v0.2.3post2

安装文档:Kt官方安装文档(https://kvcache-ai.github.io/ktransformers/en/install.html)

执行安装前准备

1
2
3
4
5
6
7
8
9
10
(kt) root@ksp-registry:/opt/code_repos# git clone -b v0.2.3post2 https://gitee.com/sy-jiang/ktransformers.git

#保证此python虚拟环境使用的GNU C++标准库版本包括GLIBCXX-3.4.32
#conda提供了一个名为libstdcxx-ng的包,它包含了新版本的libstdc++,其可以通过conda-forge进行安装
(kt) root@ksp-registry:/opt/code_repos# conda install -c conda-forge libstdcxx-ng
(kt) root@ksp-registry:/opt/code_repos# strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX

#安装PyTorch, packaging, ninja 等
(kt) root@ksp-registry:/opt/code_repos# pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
(kt) root@ksp-registry:/opt/code_repos# pip install packaging ninja cpufeature numpy flash-attn

初始化源码

1
2
3
4
5
6
7
8
9
10
11
12
13
#init source code
cd ktransformers
(kt) root@ksp-registry:/opt/code_repos/ktransformers# git submodule init
(kt) root@ksp-registry:/opt/code_repos/ktransformers# git submodule update
#如下llama.cpp、pybind11 两个目录是刚刚新生成的
(kt) root@ksp-registry:/opt/code_repos/ktransformers# ll third_party/
total 20
drwxr-xr-x 5 root root 4096 Mar 18 10:41 ./
drwxr-xr-x 9 root root 4096 Mar 18 10:50 ../
drwxr-xr-x 24 root root 4096 Mar 18 10:58 llama.cpp/
drwxr-xr-x 2 root root 4096 Mar 18 10:41 llamafile/
drwxr-xr-x 8 root root 4096 Mar 18 10:58 pybind11/

编译kt-website

参考文档:https://kvcache-ai.github.io/ktransformers/en/api/server/website.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#要求Node.js>=18.3
#如果已经通过ubuntu20默认安装源安装了nodejs,其版本太低,需要先卸载掉
#apt-get remove nodejs npm -y && sudo apt-get autoremove -y
apt-get update -y && apt-get install -y apt-transport-https ca-certificates curl gnupg

curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /usr/share/keyrings/nodesource.gpg

chmod 644 /usr/share/keyrings/nodesource.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/nodesource.gpg] https://deb.nodesource.com/node_23.x nodistro main" | sudo tee /etc/apt/sources.list.d/nodesource.list

apt-get update -y
apt-get install nodejs -y

#查看nodejs与npm版本
(kt) root@ksp-registry:/opt/code_repos/ktransformers# node -v
v23.10.0
(kt) root@ksp-registry:/opt/code_repos/ktransformers# npm -v
10.9.2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#安装Vue CLI
(kt) root@ksp-registry:/opt/code_repos/ktransformers/ktransformers/website# npm install @vue/cli
(kt) root@ksp-registry:/opt/code_repos/ktransformers/ktransformers/website# npm run build

#暂时也可不执行,后续会有步骤执行此操作
#使用website编译ktransformers
(kt) root@ksp-registry:/opt/code_repos/ktransformers/ktransformers/website# cd ../../
(kt) root@ksp-registry:/opt/code_repos/ktransformers# pip install .
#查看安装的kt信息
(kt) root@ksp-registry:/opt/code_repos/ktransformers# pip show ktransformers
Name: ktransformers
Version: 0.2.3.post2
Summary: KTransformers, pronounced as Quick Transformers, is designed to enhance your Transformers experience with advanced kernel optimizations and placement/parallelism strategies.
Home-page: https://kvcache.ai
Author:
Author-email: "KVCache.AI" <zhang.mingxing@outlook.com>
License: Apache License
...

安装KT

1
2
3
4
5
6
7
#1)对于有双槽CPU和内存是模型文件两倍大小以上的服务器
(kt) root@ksp-registry:/opt/code_repos/ktransformers# apt install libnuma-dev
###(kt) root@ksp-registry:/opt/code_repos/ktransformers# export USE_NUMA=1 #不要执行,笔者执行了后面启动模型服务时内存爆了
(kt) root@ksp-registry:/opt/code_repos/ktransformers# bash install.sh # or #make dev_install

#2)否则,直接执行如下命令
(kt) root@ksp-registry:/opt/code_repos/ktransformers# bash install.sh
image-20250318155035637

如下查看安装成功的KTransformers:pip show ktransformers

image-20250318155131355

3.5 Local Chat本地对话

参考:https://github.com/kvcache-ai/ktransformers/blob/v0.2.3post2/doc/zh/DeepseekR1_V3_tutorial_zh.md#v02-%E5%B1%95%E7%A4%BA

image-20250319160528821

3.5.1 启动本地对话

1
2
3
4
5
6
7
8
(kt) root@ksp-registry:/opt/code_repos/ktransformers# cp ./ktransformers/models/configuration_deepseek.py /opt/code_repos/AI_models/DeepSeek-R1-GGUF
(kt) root@ksp-registry:/opt/code_repos/ktransformers# cp ./ktransformers/models/configuration_deepseek_v3.py /opt/code_repos/AI_models/DeepSeek-R1-GGUF

##在KT仓库根目录执行如下命令
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 \
--gguf_path /opt/code_repos/AI_models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M \
--cpu_infer 36 --max_new_tokens 8192 --port 10002 --web True
#启动过程上,可以看到进程使用了GPU
image-20250319102718590

最终看到对话窗口:

image-20250319174522477

加载过程中及完成后,占用的内存很少:

image-20250319175945166

但GPU 显存占用较多,但远没满

image-20250319180135883

3.5.2 报错与处理

3.5.2.1 couldn't connect to 'https://huggingface.co'

第一次,笔者执行命令python ./ktransformers/local_chat.py --model_path unsloth/DeepSeek-R1-GGUF --gguf_path /opt/code_repos/AI_models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --cpu_infer 36 --max_new_tokens 8192时,提示

“OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like unsloth/DeepSeek-R1-GGUF is not the path to a directory containing a file named config.json.

Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.”

具体如下:

image-20250318162546417

解决办法:

1
2
3
4
5
6
#参考https://github.com/huggingface/diffusers/issues/6223
#添加变量
(kt) root@ksp-registry:/opt/code_repos/ktransformers# export HF_ENDPOINT=https://hf-mirror.com
#再次执行
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path unsloth/DeepSeek-R1-GGUF --gguf_path /opt/code_repos/AI_models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --cpu_infer 36 --max_new_tokens 8192
#此时还是报错,但报错内容不一样。如下“OSError: unsloth/DeepSeek-R1-GGUF does not appear to have a file named configuration_deepseek.py. Checkout 'https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main' for available files.”
image-20250319101437770
1
2
3
4
#笔者先前在“https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/files”下载的"DeepSeek-R1-Q4_K_M"相关文件,但此链接下并没有上述报错中提到的(启动对话过程中需要用到的)configuration_deepseek.py 文件,根据KT官网的示例,需要指定“--model_path deepseek-ai/DeepSeek-R1”,此时再执行就会正常下载configuration_deepseek.py 文件并继续往后执行了
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 \
--gguf_path /opt/code_repos/AI_models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M \
--cpu_infer 36 --max_new_tokens 8192

3.5.2.2 内存不够,被 killed

1
2
3
4
5
(kt) root@ksp-registry:/opt/code_repos/ktransformers# export USE_NUMA=1
(kt) root@ksp-registry:/opt/code_repos/ktransformers# bash install.sh # or #make dev_install
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 \
--gguf_path /opt/code_repos/AI_models_/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M \
--cpu_infer 32 --max_new_tokens 8192

同时可以看到可用内存急剧下降。

此服务器只有503G总内存,加载块的过程中因为内存不足,进程最终被自动杀死:

image-20250319103756422

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 创建 800GB 的 Swap 文件(因为笔者此服务器安装操作系统的系统盘就是SSD盘,所以直接从系统盘中划分出300G)
(kt) root@ksp-registry:/# fallocate -l 800G /opt/code_repos/test_swap/swapfile
(kt) root@ksp-registry:/# chmod 600 /opt/code_repos/test_swap/swapfile
#将文件或分区初始化为交换空间
(kt) root@ksp-registry:/# mkswap /opt/code_repos/test_swap/swapfile
#启用交换文件
(kt) root@ksp-registry:/# swapon /opt/code_repos/test_swap/swapfile
#查看物理内存与交换内存
root@ksp-registry:~# free -h
total used free shared buff/cache available
Mem: 503Gi 4.7Gi 226Gi 8.0Mi 272Gi 496Gi
Swap: 799Gi 0B 799Gi

#删除交换空间
#swapoff /swapfile
#rm /swapfile
1
2
3
4
5
##在KT仓库根目录执行如下命令,使用“--web True”website会一起启动,“--port 10002”指定web访问端口
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 \
--gguf_path /opt/code_repos/AI_models_/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M \
--cpu_infer 32 --max_new_tokens 8192 --port 10002 --web True
###我的实验结果还是失败,加载到28层就被killed,缺内存

image-20250319164809383

解决办法:

1
2
3
4
#此步骤很关键
(kt) root@ksp-registry:/opt/code_repos/ktransformers# unset USE_NUMA
#重新安装KTransformers
(kt) root@ksp-registry:/opt/code_repos/ktransformers# bash install.sh # or #make dev_install
image-20250319174522477

3.5.2.3 安装flashinfer(可选)

每次启动模型服务时,都会提示flashinfer not found, use triton for linux using custom modeling_xxx.py.

flashinfer 是一个用于加速大型语言模型(LLM)部署的核库。它通过提供高效的内存带宽共享前缀批处理解码技术,显著提升了自注意力机制的性能。FlashInfer 支持多种 GPU 架构,包括 sm80、sm86、sm89 和 sm90,并且正在开发对 sm75 和 sm70 的支持。

其在github上的代码仓库地址:https://github.com/flashinfer-ai/flashinfer.git

官方安装文档:https://docs.flashinfer.ai/installation.html

通过pip安装
image-20250319215515077
1
2
3
4
5
6
7
8
9
10
11
12
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/
#
wget https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.0.post1/flashinfer-0.2.0.post1+cu124torch2.4-cp311-cp311-linux_x86_64.whl
pip install flashinfer-0.2.0.post1+cu124torch2.4-cp311-cp311-linux_x86_64.whl

###但此时会提示flashinfer do not have attribute mla,看到如下处理方法(https://github.com/kvcache-ai/ktransformers/issues/792):
# replace pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3 with:
# install JIT version:
pip install flashinfer-python
conda install cuda-nvcc -c nvidia
export CUDA_HOME=$CONDA_PREFIX
export TORCH_CUDA_ARCH_LIST="8.0+PTX"
通过源码安装

参考官方文档:https://docs.flashinfer.ai/installation.html#install-from-source

1
git clone -b v0.2.1.post1 https://github.com/flashinfer-ai/flashinfer.git --recursive

KTransformers部署DeepSeek-R1-Q4_K_M
https://jiangsanyin.github.io/2025/03/12/KTransformers部署DeepSeek-R1-Q4-K-M/
作者
sanyinjiang
发布于
2025年3月12日
许可协议