KTransformers部署DeepSeek-R1-Q4_K_M

一、参考文档与信息说明

KTransformers 是由清华大学发起的一个项目，它利用 DeepSeek 模型的 MoE 架构特性，将专家模型的权重加载到内存上，并分配 CPU 完成相关计算，同时将 ML/KV Cache 加载到 GPU 上，从而实现 CPU+GPU 混合推理。这种方法能够在最大化降低显存占用的同时，保持一定的推理速度。KTransformers 项目旨在解决大模型本地部署难题，实现资源有限情况下大模型的高效本地部署，让更多人能够在自己的设备上运行曾经遥不可及的大型模型。

1.1 参考文档

参考文章：
- KT的github仓库：https://github.com/kvcache-ai/ktransformers/tree/v0.2.3post2
- 安装文档：Kt官方安装文档（https://kvcache-ai.github.io/ktransformers/en/install.html）
- https://mp.weixin.qq.com/s/1keAGOQlkTf_dKrzWmCRZQ
- https://mp.weixin.qq.com/s/C4aTsxzYGV7bFrKyx6juug
- https://kq4b3vgg5b.feishu.cn/wiki/QJ5ywpjnvieTKZk5kPHcG3sLnkd
模型下载页面：
- https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/files
下载模型文件：
- https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf

只下载DeepSeek-R1-Q4_K_M这个量化版本：https://www.modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/feedback/issueDetail/23220

1.2 信息说明

第1次此次部署是在Centos8-x86_64物理服务器上部署，一直有报错，相关报错信息描述在第2章。是其他同学的服务器，后建议其在物理服务器上创建ubuntu2204容器进行操作。

第2次，准备在自己的ubuntu20.04 LTS-x86_64物理服务器上进行操作，成功。

二、Centos8上报错处理

2.1 使用torch2.6.0时，KT仓库根目录下执行"pip install ."报“Read time out”

#换成使用torch2.4.1

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124

#再指定pip安装源为清华源（因为KT就是清华主导的开源项目）：

1 2	`sh install.sh ##pip install . -i https://pypi.tuna.tsinghua`

但报如下错误（报错时，安装的cuda版本是cuda_12.6.r12.6/compiler.34431801_0，torch版本是2.4.1+cu124）：

如下提示探查到的CUDA12.6小版本与用来编译当前使用PyTorch2.4.1所使用的CUDA版本（应该是12.4）不匹配，但大部分情况下这不是一个严重问题，所以只是一个警告信息。

第2行提示，不存在为CUDA12.6定义的g++版本边界。

然后，安装cuda12.4+torch2.4.1，再执行sh install.sh，还有g++版本、cmake相关报错

对于g++版本相关报错，可以考虑升级gcc与g++版本，然后再执行sh install.sh，再查看是否仍有报销

三、Ubuntu20.04 上部署

3.0 升级cmake

#升级cmake版本（从3.16.3到3.23.0）
root@ksp-registry:/opt/installPkgs# wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz
root@ksp-registry:/opt/installPkgs# tar -zxvf cmake-3.23.0.tar.gz
root@ksp-registry:/opt/installPkgs# cp -rp cmake-3.23.0 /usr/share/cmake-3.23.0
root@ksp-registry:/opt/installPkgs# ln -sf /usr/share/cmake-3.23.0/bin/cmake /usr/bin/cmake  
root@ksp-registry:/opt/installPkgs# cmake --version
cmake version 3.23.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

3.1 下载DeepSeek-R1-Q4_K_M

下载GGUF文件

###下载方法1（推荐）
#使用conda创建python虚拟环境
root@ksp-registry:/opt/code_repos/AI_models# conda create -n self-llm python=3.12
root@ksp-registry:/opt/code_repos/AI_models# conda activate self-llm

(self-llm) root@ksp-registry:/opt/code_repos/AI_models# mkdir DeepSeek-R1-Q4_K_M
(self-llm) root@ksp-registry:/opt/code_repos/AI_models# cd DeepSeek-R1-Q4_K_M
(self-llm) root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-Q4_K_M# vi download-DeepSeek-R1-Q4_K_M.sh
#!/bin/bash

for i in $(seq 1 9); do
  aria2c -s 16 -x 16 https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-0000${i}-of-00009.gguf
done

#所有文件下载完成后，大概有600多G
(self-llm) root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-Q4_K_M# apt-get update && apt -qy install aria2
(self-llm) root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-Q4_K_M# bash download-DeepSeek-R1-Q4_K_M.sh

###下载方法2（会下载DeepSeek-R1-Zero-Q4_K_M-xxx文件，这些文件不需要）
(self-llm) root@ksp-registry:/opt/code_repos/AI_models# pip install modelscope
(self-llm) root@ksp-registry:/opt/code_repos/AI_models# vi download.py
from modelscope import snapshot_download
snapshot_download(
  repo_id = "unsloth/DeepSeek-R1-GGUF",
  local_dir = "DeepSeek-R1-GGUF",
  allow_patterns = ["*Q4_K_M*"], # Select quant type Q4_K_M
)

###此python脚本会将“https://www.modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/files/DeepSeek-R1-Q4_K_M”h目录下所有18个文件都下载下来，即除了“DeepSeek-R1-Q4_K_M-0000X-of-00009.gguf”，还有“DeepSeek-R1-Zero-Q4_K_M-0000X-of-00009.gguf”
(self-llm) root@ksp-registry:/opt/code_repos/AI_models# python download.py
Downloading Model to directory: /opt/code_repos/AI_models/DeepSeek-R1-GGUF
2025-03-17 10:50:41,631 - modelscope - INFO - Got 18 files, start to download ...
Processing 18 items:   0%|                                                                                                     | 0.00/18.0 [00:00<?, ?it/s]
Downloading [DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf]:   1%|▎                                      | 322M/45.0G [01:05<2:22:31, 5.61MB/s]

下载配置文件

root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-GGUF# wget https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/config.json

root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-GGUF# wget https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/.gitattributes

root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-GGUF# wget https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/configuration.json

root@ksp-registry:/opt/code_repos/AI_models/DeepSeek-R1-GGUF# wget https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/README.md

llama.cpp运行模型(可选)

参考：https://www.modelscope.cn/models/unsloth/DeepSeek-R1-GGUF

####以下方法有风险(可能会覆盖原有的cmake-3.16.3)，暂未执行
#root@ksp-registry:/opt/installPkgs# cd cmake-3.23.0/
#检查系统环境并生成 Makefile（--prefix=/path	指定安装路径（默认为 /usr/local））
#root@ksp-registry:/opt/installPkgs/cmake-3.23.0# ./configure
#编译
#root@ksp-registry:/opt/installPkgs/cmake-3.23.0# make -j8
#安装
#root@ksp-registry:/opt/installPkgs/cmake-3.23.0# make install
##建立软链接，使用安装的新版本的cmake
#update-alternatives --install /usr/bin/cmake  cmake /usr/local/bin/cmake  1 –force

apt-get update
apt-get install build-essential curl libcurl4-openssl-dev -y

###下面的步骤可选
(self-llm) root@ksp-registry:~# cd /opt/code_repos/
(self-llm) root@ksp-registry:~# git clone https://github.com/ggerganov/llama.cpp

cmake llama.cpp -B llama.cpp/build \
	-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

root@ksp-registry:/opt/code_repos# ll llama.cpp/build/bin/llama-*
-rwxr-xr-x 1 root root 451135880 Mar 17 11:13 llama.cpp/build/bin/llama-cli*
-rwxr-xr-x 1 root root 449090696 Mar 17 11:13 llama.cpp/build/bin/llama-gguf-split*
-rwxr-xr-x 1 root root 449629880 Mar 17 11:12 llama.cpp/build/bin/llama-quantize*
#将生成的3个文件复制到目录llama.cpp 下
cp llama.cpp/build/bin/llama-* llama.cpp

3.2 安装基础组件或依赖

3.2.1 NVIDIA驱动与cuda

#NVIDIA驱动已经安装好
(self-llm) root@ksp-registry:/opt/code_repos# nvidia-smi 
Mon Mar 17 14:53:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:C1:00.0 Off |                    0 |
|  0%   33C    P8             21W /  300W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

#安装cuda12.4
(self-llm) root@ksp-registry:/opt/code_repos# ll /opt/nvidia-driver-cuda-for-A40/
total 4650004
drwxr-xr-x 2 root root       4096 Feb 14 14:31 ./
drwxr-xr-x 8 root root       4096 Mar 17 10:25 ../
-rw-r--r-- 1 root root 4454730420 Mar 29  2024 cuda_12.4.1_550.54.15_linux.run
-rwxrwxrwx 1 root root  306858135 May 17  2024 NVIDIA-Linux-x86_64-550.54.15.run*
#也已经安装好
(self-llm) root@ksp-registry:/opt/code_repos# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

#编辑/root/.bashrc，添加如下内容
(self-llm) root@ksp-registry:/opt/code_repos# vi /root/.bashrc 
export PATH=/usr/local/cuda-12.4/bin/:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export CUDA_PATH=/usr/local/cuda
(self-llm) root@ksp-registry:/opt/code_repos# source /root/.bashrc

3.2.2 安装编译组件

apt-get update
apt-get install gcc g++ ninja-build

#查看gcc版本
(self-llm) root@ksp-registry:/opt/code_repos# gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#查看g++版本
(self-llm) root@ksp-registry:/opt/code_repos# g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#查看cmake版本
(self-llm) root@ksp-registry:/opt/code_repos# cmake --version
cmake version 3.23.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

#查看ninja版本
(self-llm) root@ksp-registry:/opt/code_repos# ninja --version
1.10.0
#安装基础组件
(self-llm) root@ksp-registry:/opt/code_repos# apt install build-essential curl libcurl4-openssl-dev -y

3.3 创建KT专用python虚拟环境

1 2	`conda create --name kt python=3.11 conda activate kt`

3.4 安装KTransformers

KT的github仓库：https://github.com/kvcache-ai/ktransformers/tree/v0.2.3post2

安装文档：Kt官方安装文档（https://kvcache-ai.github.io/ktransformers/en/install.html）

执行安装前准备

(kt) root@ksp-registry:/opt/code_repos# git clone -b v0.2.3post2 https://gitee.com/sy-jiang/ktransformers.git

#保证此python虚拟环境使用的GNU C++标准库版本包括GLIBCXX-3.4.32
#conda提供了一个名为libstdcxx-ng的包，它包含了新版本的libstdc++，其可以通过conda-forge进行安装
(kt) root@ksp-registry:/opt/code_repos# conda install -c conda-forge libstdcxx-ng
(kt) root@ksp-registry:/opt/code_repos# strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX

#安装PyTorch, packaging, ninja 等
(kt) root@ksp-registry:/opt/code_repos# pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
(kt) root@ksp-registry:/opt/code_repos# pip install packaging ninja cpufeature numpy flash-attn

初始化源码

#init source code
cd ktransformers
(kt) root@ksp-registry:/opt/code_repos/ktransformers# git submodule init
(kt) root@ksp-registry:/opt/code_repos/ktransformers# git submodule update
#如下llama.cpp、pybind11 两个目录是刚刚新生成的
(kt) root@ksp-registry:/opt/code_repos/ktransformers# ll third_party/
total 20
drwxr-xr-x  5 root root 4096 Mar 18 10:41 ./
drwxr-xr-x  9 root root 4096 Mar 18 10:50 ../
drwxr-xr-x 24 root root 4096 Mar 18 10:58 llama.cpp/
drwxr-xr-x  2 root root 4096 Mar 18 10:41 llamafile/
drwxr-xr-x  8 root root 4096 Mar 18 10:58 pybind11/

编译kt-website

参考文档：https://kvcache-ai.github.io/ktransformers/en/api/server/website.html

#要求Node.js>=18.3
#如果已经通过ubuntu20默认安装源安装了nodejs，其版本太低，需要先卸载掉
#apt-get remove nodejs npm -y && sudo apt-get autoremove -y
apt-get update -y && apt-get install -y apt-transport-https ca-certificates curl gnupg

curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /usr/share/keyrings/nodesource.gpg

chmod 644 /usr/share/keyrings/nodesource.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/nodesource.gpg] https://deb.nodesource.com/node_23.x nodistro main" | sudo tee /etc/apt/sources.list.d/nodesource.list

apt-get update -y
apt-get install nodejs -y

#查看nodejs与npm版本
(kt) root@ksp-registry:/opt/code_repos/ktransformers# node -v
v23.10.0
(kt) root@ksp-registry:/opt/code_repos/ktransformers# npm -v
10.9.2

#安装Vue CLI
(kt) root@ksp-registry:/opt/code_repos/ktransformers/ktransformers/website# npm install @vue/cli
(kt) root@ksp-registry:/opt/code_repos/ktransformers/ktransformers/website# npm run build

#暂时也可不执行，后续会有步骤执行此操作
#使用website编译ktransformers 
(kt) root@ksp-registry:/opt/code_repos/ktransformers/ktransformers/website# cd ../../
(kt) root@ksp-registry:/opt/code_repos/ktransformers# pip install .
#查看安装的kt信息
(kt) root@ksp-registry:/opt/code_repos/ktransformers# pip show ktransformers
Name: ktransformers
Version: 0.2.3.post2
Summary: KTransformers, pronounced as Quick Transformers, is designed to enhance your Transformers experience with advanced kernel optimizations and placement/parallelism strategies.
Home-page: https://kvcache.ai
Author: 
Author-email: "KVCache.AI" <zhang.mingxing@outlook.com>
License: Apache License
...

安装KT

#1）对于有双槽CPU和内存是模型文件两倍大小以上的服务器
(kt) root@ksp-registry:/opt/code_repos/ktransformers# apt install libnuma-dev
###(kt) root@ksp-registry:/opt/code_repos/ktransformers# export USE_NUMA=1  #不要执行，笔者执行了后面启动模型服务时内存爆了
(kt) root@ksp-registry:/opt/code_repos/ktransformers# bash install.sh   # or #make dev_install

#2）否则，直接执行如下命令
(kt) root@ksp-registry:/opt/code_repos/ktransformers# bash install.sh

如下查看安装成功的KTransformers：pip show ktransformers

3.5 Local Chat本地对话

参考：https://github.com/kvcache-ai/ktransformers/blob/v0.2.3post2/doc/zh/DeepseekR1_V3_tutorial_zh.md#v02-%E5%B1%95%E7%A4%BA

3.5.1 启动本地对话

(kt) root@ksp-registry:/opt/code_repos/ktransformers# cp ./ktransformers/models/configuration_deepseek.py /opt/code_repos/AI_models/DeepSeek-R1-GGUF
(kt) root@ksp-registry:/opt/code_repos/ktransformers# cp ./ktransformers/models/configuration_deepseek_v3.py /opt/code_repos/AI_models/DeepSeek-R1-GGUF

##在KT仓库根目录执行如下命令
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 \
  --gguf_path /opt/code_repos/AI_models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M \
  --cpu_infer 36 --max_new_tokens 8192 --port 10002 --web True  
#启动过程上，可以看到进程使用了GPU

最终看到对话窗口：

加载过程中及完成后，占用的内存很少：

但GPU 显存占用较多，但远没满

3.5.2 报错与处理

3.5.2.1 couldn't connect to 'https://huggingface.co'

第一次，笔者执行命令python ./ktransformers/local_chat.py --model_path unsloth/DeepSeek-R1-GGUF --gguf_path /opt/code_repos/AI_models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --cpu_infer 36 --max_new_tokens 8192时，提示

“OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like unsloth/DeepSeek-R1-GGUF is not the path to a directory containing a file named config.json.

Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.”

具体如下：

解决办法：

#参考https://github.com/huggingface/diffusers/issues/6223
#添加变量
(kt) root@ksp-registry:/opt/code_repos/ktransformers# export HF_ENDPOINT=https://hf-mirror.com
#再次执行
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path unsloth/DeepSeek-R1-GGUF --gguf_path /opt/code_repos/AI_models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --cpu_infer 36 --max_new_tokens 8192
#此时还是报错，但报错内容不一样。如下“OSError: unsloth/DeepSeek-R1-GGUF does not appear to have a file named configuration_deepseek.py. Checkout 'https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main' for available files.”

#笔者先前在“https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/files”下载的"DeepSeek-R1-Q4_K_M"相关文件，但此链接下并没有上述报错中提到的（启动对话过程中需要用到的）configuration_deepseek.py 文件，根据KT官网的示例，需要指定“--model_path deepseek-ai/DeepSeek-R1”，此时再执行就会正常下载configuration_deepseek.py 文件并继续往后执行了
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 \
  --gguf_path /opt/code_repos/AI_models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M \
  --cpu_infer 36 --max_new_tokens 8192

3.5.2.2 内存不够，被 killed

(kt) root@ksp-registry:/opt/code_repos/ktransformers# export USE_NUMA=1
(kt) root@ksp-registry:/opt/code_repos/ktransformers# bash install.sh   # or #make dev_install
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 \
  --gguf_path /opt/code_repos/AI_models_/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M \
  --cpu_infer 32 --max_new_tokens 8192

同时可以看到可用内存急剧下降。

此服务器只有503G总内存，加载块的过程中因为内存不足，进程最终被自动杀死：

# 创建 800GB 的 Swap 文件（因为笔者此服务器安装操作系统的系统盘就是SSD盘，所以直接从系统盘中划分出300G）
(kt) root@ksp-registry:/# fallocate -l 800G /opt/code_repos/test_swap/swapfile
(kt) root@ksp-registry:/# chmod 600 /opt/code_repos/test_swap/swapfile
#将文件或分区初始化为交换空间
(kt) root@ksp-registry:/# mkswap /opt/code_repos/test_swap/swapfile
#启用交换文件
(kt) root@ksp-registry:/# swapon /opt/code_repos/test_swap/swapfile
#查看物理内存与交换内存
root@ksp-registry:~# free -h
              total        used        free      shared  buff/cache   available
Mem:          503Gi       4.7Gi       226Gi       8.0Mi       272Gi       496Gi
Swap:         799Gi          0B       799Gi

#删除交换空间
#swapoff /swapfile 
#rm /swapfile

##在KT仓库根目录执行如下命令，使用“--web True”website会一起启动，“--port 10002”指定web访问端口
(kt) root@ksp-registry:/opt/code_repos/ktransformers# python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 \
  --gguf_path /opt/code_repos/AI_models_/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M \
  --cpu_infer 32 --max_new_tokens 8192 --port 10002 --web True
###我的实验结果还是失败，加载到28层就被killed，缺内存

解决办法：

#此步骤很关键
(kt) root@ksp-registry:/opt/code_repos/ktransformers# unset USE_NUMA
#重新安装KTransformers
(kt) root@ksp-registry:/opt/code_repos/ktransformers# bash install.sh   # or #make dev_install

3.5.2.3 安装flashinfer(可选)

每次启动模型服务时，都会提示flashinfer not found, use triton for linux using custom modeling_xxx.py.

flashinfer 是一个用于加速大型语言模型（LLM）部署的核库。它通过提供高效的内存带宽共享前缀批处理解码技术，显著提升了自注意力机制的性能。FlashInfer 支持多种 GPU 架构，包括 sm80、sm86、sm89 和 sm90，并且正在开发对 sm75 和 sm70 的支持。

其在github上的代码仓库地址：https://github.com/flashinfer-ai/flashinfer.git

官方安装文档：https://docs.flashinfer.ai/installation.html

通过pip安装

pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/
#或
wget https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.0.post1/flashinfer-0.2.0.post1+cu124torch2.4-cp311-cp311-linux_x86_64.whl
pip install flashinfer-0.2.0.post1+cu124torch2.4-cp311-cp311-linux_x86_64.whl

###但此时会提示flashinfer do not have attribute mla，看到如下处理方法（https://github.com/kvcache-ai/ktransformers/issues/792）：
# replace pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3 with:
# install JIT version:
pip install flashinfer-python
conda install cuda-nvcc -c nvidia
export CUDA_HOME=$CONDA_PREFIX
export TORCH_CUDA_ARCH_LIST="8.0+PTX"

通过源码安装

参考官方文档：https://docs.flashinfer.ai/installation.html#install-from-source

1	`git clone -b v0.2.1.post1 https://github.com/flashinfer-ai/flashinfer.git --recursive`

大模型

#KTransformers

KTransformers部署DeepSeek-R1-Q4_K_M

https://jiangsanyin.github.io/2025/03/12/KTransformers部署DeepSeek-R1-Q4-K-M/

作者

sanyinjiang

发布于

2025年3月12日

许可协议

4✖8✖A800服务器部署满血版DeepSeek-R1-671B步骤上一篇

跟随《大语言模型-赵鑫教授团队》入门大语言模型下一篇