一、tensorflow与cuda对应关系表
https://tensorflow.google.cn/install/source?hl=en#tested_build_configurations
参考:https://blog.csdn.net/FL1768317420/article/details/134840200 、
https://www.weiyeji.com/2025/wsl-tensorflow-gpu/
TensorFlow Version
Python Version
Compiler
Build Tools
cuDNN
CUDA
2.15.0
3.9-3.11
Clang 16.0.0
Bazel 6.1.0
8.8
12.2
2.14.0
3.9-3.11
Clang 16.0.0
Bazel 6.1.0
8.7
11.8
2.13.0
3.8-3.11
Clang 16.0.0
Bazel 5.3.0
8.6
11.8
2.12.0
3.8-3.11
GCC 9.3.1
Bazel 5.3.0
8.6
11.8
2.11.0
3.7-3.10
GCC 9.3.1
Bazel 5.3.0
8.1
11.2
2.10.0
3.7-3.10
GCC 9.3.1
Bazel 5.1.1
8.1
11.2
2.9.0
3.7-3.10
GCC 9.3.1
Bazel 5.0.0
8.1
11.2
2.8.0
3.7-3.10
GCC 7.3.1
Bazel 4.2.1
8.1
11.2
2.7.0
3.7-3.9
GCC 7.3.1
Bazel 3.7.2
8.1
11.2
2.6.0
3.6-3.9
GCC 7.3.1
Bazel 3.7.2
8.1
11.2
2.5.0
3.6-3.9
GCC 7.3.1
Bazel 3.7.2
8.1
11.2
2.4.0
3.6-3.8
GCC 7.3.1
Bazel 3.1.0
8.0
11.0
2.3.0
3.5-3.8
GCC 7.3.1
Bazel 3.1.0
7.6
10.1
二、在原镜像容器中做修改
以175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v3
镜像为基础,做如下操作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 # 创建容器test01 docker run --name=test01 \ --gpus all --runtime=nvidia \ -it \ 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v3 \ --entry-point=/bin/bash# 进入容器内 root@59cbbed5325b:~# root@controller01:~# nvidia-smi Mon Jun 30 06:22:35 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A40 Off | 00000000:41:00.0 Off | 0 | | 0% 39C P8 14W / 300W | 3MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
以下都是在容器test01内操作:
1 2 3 4 # 安装pip3 root@59cbbed5325b:/# apt-get update root@59cbbed5325b:/# apt install python3-pip root@59cbbed5325b:/# pip3 install pip --upgrade
1 2 # 卸载旧版本TensorFlow(如果有) root@59cbbed5325b:/# pip uninstall tensorflow tensorflow-gpu
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # 安装cuda11.8 root@59cbbed5325b:/# bash cuda_11.8.0_520.61.05_linux.run root@59cbbed5325b:/# vi /root/.bashrc #添加如下内容 export CUDA_HOME=/usr/local/cuda export PATH=/usr/local/cuda-11.8/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64# 加载/root/.bashrc root@59cbbed5325b:/# source /root/.bashrc# 验证cuda版本 root@59cbbed5325b:/# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
1 2 3 4 # 使用anaconda安装虚拟python3.9环境 bash Anaconda3-2024.06-1-Linux-x86_64.sh conda create -n py3.9 python==3.9.19 conda activate py3.9
1 2 # 安装TensorFlow 2.14.0(支持CUDA 11.8) (py3.9) root@59cbbed5325b:/# pip install tensorflow==2.14.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 # 安装libcudnn (py3.9) root@59cbbed5325b:/# pip install nvidia-cudnn-cu11# 安装cudnn # 解压下载的文件(下载地址:https://developer.nvidia.com/cudnn-archive) # 具体是 cuDNN 8.x - 1.x (December 2023 - August 2014) (py3.9) root@59cbbed5325b:/opt# tar -xJf cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz# 复制文件到 CUDA 安装目录 (py3.9) root@59cbbed5325b:/opt# mv cudnn-linux-x86_64-8.7.0.84_cuda11-archive/ cudnn-linux-x86 (py3.9) root@59cbbed5325b:/opt/install-cudnn# ll cudnn-linux-x86 total 48 drwxr-xr-x 4 25503 2174 4096 Nov 21 2022 ./ drwxr-xr-x 3 root root 4096 Jun 30 03:25 ../ -rw-r--r-- 1 25503 2174 28994 Nov 21 2022 LICENSE drwxr-xr-x 2 25503 2174 4096 Nov 21 2022 include/ drwxr-xr-x 2 25503 2174 4096 Nov 21 2022 lib/ (py3.9) root@59cbbed5325b:/opt/install-cudnn# cp cudnn-linux-x86/include/cudnn*.h /usr/local/cuda/include (py3.9) root@59cbbed5325b:/opt/install-cudnn# cp -P cudnn-linux-x86/lib/libcudnn* /usr/local/cuda/lib64 (py3.9) root@59cbbed5325b:/opt/install-cudnn# chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*# 检查 cuDNN 版本 (py3.9) root@59cbbed5325b:/opt/install-cudnn# cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2# define CUDNN_MAJOR 8 # define CUDNN_MINOR 7 # define CUDNN_PATCHLEVEL 0 --# define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # 验证CUDA和cuDNN兼容性 # (py3.9) root@59cbbed5325b:/# python -c " import tensorflow as tf print('TensorFlow version:', tf.__version__) print('GPU is available?:', tf.config.list_physical_devices('GPU')) print('CUDA is supported?:', tf.test.is_built_with_cuda()) "# 2025-06-30 03:33:34.678335: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2025-06-30 03:33:34.678411: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2025-06-30 03:33:34.678443: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-06-30 03:33:34.687982: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-06-30 03:33:36.040071: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT TensorFlow version: 2.14.0 GPU is available?: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] CUDA is supported?: True
三、单个容器内验证与测试
最终的文件:gpu_dist_mnist-v6.py, 进行tensorflow分布式训练
在容器test01中,启动PS服务器:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 (py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# export TF_CONFIG='{"cluster": {"ps": ["localhost:2222"], "worker": ["localhost:2223"]}, "task": {"type": "ps", "index": 0}}' (py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# python gpu_dist_mnist-v6.py# 运行结果的最后几行: ================================================== GPU和CUDA配置检查: ================================================== TensorFlow版本: 2.14.0 TensorFlow是否支持GPU: True CUDA是否可用: True 可用GPU设备数量: 1 GPU 0: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') - 详细信息: {'compute_capability': (8, 6), 'device_name': 'NVIDIA A40'} GPU内存增长模式已启用 2025-06-30 07:03:14.830558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43598 MB memory: -> device: 0, name: NVIDIA A40, pci bus id: 0000:41:00.0, compute capability: 8.6 所有可用设备: - /job:localhost/replica:0/task:0/device:CPU:0: CPU - /job:localhost/replica:0/task:0/device:GPU:0: GPU ================================================== tf_config: {'cluster': {'ps': ['localhost:2222'], 'worker': ['localhost:2223']}, 'task': {'type': 'ps', 'index': 0}} job name = ps task index = 0 2025-06-30 07:03:15.331182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:ps/replica:0/task:0/device:GPU:0 with 43598 MB memory: -> device: 0, name: NVIDIA A40, pci bus id: 0000:41:00.0, compute capability: 8.6 2025-06-30 07:03:15.346094: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://localhost:2222
同时在容器test01中的另一个窗口中启动Worker:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# export TF_CONFIG='{"cluster": {"ps": ["localhost:2222"], "worker": ["localhost:2223"]}, "task": {"type": "worker", "index": 0}}' (py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# python gpu_dist_mnist-v6.py --num_gpus=1 --train_steps=5000 # 运行结果的最后几行: 1751267105.262005: Worker 0: training step 4995 done (global step: 4994) 1751267105.266335: Worker 0: training step 4996 done (global step: 4995) 1751267105.270457: Worker 0: training step 4997 done (global step: 4996) 1751267105.274166: Worker 0: training step 4998 done (global step: 4997) 1751267105.278662: Worker 0: training step 4999 done (global step: 4998) 1751267105.283150: Worker 0: training step 5000 done (global step: 4999) 1751267105.287258: Worker 0: training step 5001 done (global step: 5000) Training ends @ 1751267105.287361 Training elapsed time: 28.788610 s
在运行过程中,可以在宿主机或容器test01中查看nvidia
GPU的使用情况,如下是在宿主机上查看:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # 可以看到有两个进程在使用NVIDIA GPU (base) root@controller01:/# nvidia-smi Mon Jun 30 15:04:51 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A40 Off | 00000000:41:00.0 Off | 0 | | 0% 72C P0 94W / 300W | 547MiB / 46068MiB | 4% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 59697 C python 260MiB | | 0 N/A N/A 63680 C python 274MiB | +-----------------------------------------------------------------------------------------+
四、将容器保存成镜像
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # 将容器test01保存成镜像175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4 root@controller01:~# docker commit -m "1)using python 3.9.19 in conda rather than py3.6.9 2)NVIDIA Driver Version: 550.54.15 3)cuda11.8 rather than 12.4 4)cudnn-linux-x86_64-8.7.0.84_cuda11 5)using tensorflow2.14.0 rather than tensorflow1.18.0" test01 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4# 上述命令中-m参数的值写错了,其实是从tensorflow1.8.0升级到tensorflow2.14.0 # 可以看到v4镜像比v3镜像大了很多,主要是增加的文件在/usr目录与/root/anaconda3目录下。前者是CUDA、cuDNN相关文件,后者主要是由于安装虚拟python3.9环境、tensorflow2.14.0增加的文件 root@controller01:~# docker images | grep "175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example" 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example v4 3c83959b2e82 3 minutes ago 23.4GB 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example v3 59b8c787e86a 12 months ago 1.53GB# 推送到自己的harbor仓库(需要用harbor仓库户名与密码) root@controller01:~# docker push 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4# 保存到本地 root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.12/tensorflow2.14.0# docker save 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4 -o ubuntu18-dist-mnist-tf-example-v4.tar# 查看文件 root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.12/tensorflow2.14.0# ls -alh total 23G drwxr-xr-x 2 root root 4.0K Jun 30 15:51 . drwxr-xr-x 5 root root 4.0K Jun 30 14:29 .. -rw-r--r-- 1 root root 2.2K Jun 30 10:43 gpu_diagnostic_script.sh -rw-r--r-- 1 root root 19K Jun 29 22:34 gpu_dist_mnist-v5.py -rw-r--r-- 1 root root 20K Jun 30 15:00 gpu_dist_mnist-v6-ok.py -rw-r--r-- 1 root root 20K Jun 30 15:29 gpu_dist_mnist-v6.py -rw-r--r-- 1 root root 134 Jun 30 14:33 README.txt -rw------- 1 root root 23G Jun 30 15:51 ubuntu18-dist-mnist-tf-example-v4.tar
五、使用volcano1.10.0进行任务调度
5.1 只使用CPU
tf-dist-mnist-example-cpu-v2-2000times-4presentation.yaml:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: tensorflow-dist-mnist-cpu spec: minAvailable: 2 schedulerName: volcano plugins: env: [] svc: [] policies: - event: PodEvicted action: RestartJob tasks: - replicas: 1 name: ps template: spec: containers: - command: - sh - -c - | PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; PYTHONUNBUFFERED=1 python /var/tf_dist_mnist/dist_mnist.py --num_gpus=0 image: 175.6 .40 .93 :8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.5 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} restartPolicy: Never - replicas: 1 name: worker policies: - event: TaskCompleted action: CompleteJob template: spec: containers: - command: - sh - -c - | PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; python /var/tf_dist_mnist/dist_mnist.py --num_gpus=0 --train_steps=2000 --batch_size=10000 image: 175.6 .40 .93 :8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.5 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} restartPolicy: Never
1 2 3 4 5 6 7 8 9 # 执行任务: root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl create -f tf-dist-mnist-example-cpu-v2-2000times-4presentation.yaml job.batch.volcano.sh/tensorflow-dist-mnist-cpu created# 查看pod root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tensorflow-dist-mnist-cpu-ps-0 1/1 Running 0 8s 10.250.0.17 controller01 <none> <none> tensorflow-dist-mnist-cpu-worker-0 1/1 Running 0 8s 10.250.1.27 ksp-registry <none> <none>
5.2 使用NVIDIA GPU
以下文件中使用到的镜像很大,有23G。
tf-dist-mnist-example-vgpu-v2-2000times-4presentation.yaml:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: tensorflow-dist-mnist-vgpu-v2 spec: minAvailable: 2 schedulerName: volcano plugins: env: [] svc: [] policies: - event: PodEvicted action: RestartJob tasks: - replicas: 1 name: ps template: spec: containers: - command: - bash - -c - | source /root/.bashrc && __conda_setup="$('/root/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" && eval "$__conda_setup" &&conda activate py3.9; export PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG="{\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}"; echo $PS_HOST; echo $WORKER_HOST; echo $TF_CONFIG; python /opt/tensorflow2.14.0/gpu_dist_mnist-v6.py image: 175.6 .40 .93 :8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4 imagePullPolicy: IfNotPresent name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: limits: volcano.sh/gpu-number: 1 volcano.sh/gpu-memory: 3072 restartPolicy: Never tolerations: - key: "node.kubernetes.io/disk-pressure" operator: "Exists" effect: "NoSchedule" - replicas: 1 name: worker policies: - event: TaskCompleted action: CompleteJob template: spec: containers: - command: - bash - -c - | source /root/.bashrc && __conda_setup="$('/root/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" && eval "$__conda_setup" && conda activate py3.9; export PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG="{\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}"; echo $PS_HOST; echo $WORKER_HOST; echo $TF_CONFIG; python /opt/tensorflow2.14.0/gpu_dist_mnist-v6.py --train_steps=2000 --batch_size=10000 image: 175.6 .40 .93 :8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4 imagePullPolicy: IfNotPresent name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: limits: volcano.sh/gpu-number: 1 volcano.sh/gpu-memory: 3072 restartPolicy: Never tolerations: - key: "node.kubernetes.io/disk-pressure" operator: "Exists" effect: "NoSchedule"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # 执行任务: root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl create -f tf-dist-mnist-example-vgpu-v2-2000times-4presentation.yaml job.batch.volcano.sh/tensorflow-dist-mnist-vgpu-v2 created# 查看pod root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tensorflow-dist-mnist-cpu-ps-0 1/1 Terminating 0 3m56s 10.250.0.17 controller01 <none> <none> tensorflow-dist-mnist-cpu-worker-0 0/1 Completed 0 3m56s 10.250.1.27 ksp-registry <none> <none> tensorflow-dist-mnist-vgpu-v2-ps-0 1/1 Running 0 23s 10.250.0.18 controller01 <none> <none> tensorflow-dist-mnist-vgpu-v2-worker-0 1/1 Running 0 23s 10.250.1.28 ksp-registry <none> <none># 在pod运行的k8s节点上,都可以看到一个使用NVIDIA GPU的进程 root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# nvidia-smi Tue Jul 1 15:28:15 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A40 Off | 00000000:41:00.0 Off | 0 | | 0% 72C P0 94W / 300W | 268MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 30877 C python 260MiB | +-----------------------------------------------------------------------------------------+