从tensorflow1.8.0升级到2.14.0并支持使用NVIDIA GPU做mnist分布式训练

一、tensorflow与cuda对应关系表

https://tensorflow.google.cn/install/source?hl=en#tested_build_configurations

参考：https://blog.csdn.net/FL1768317420/article/details/134840200 、 https://www.weiyeji.com/2025/wsl-tensorflow-gpu/

TensorFlow Version	Python Version	Compiler	Build Tools	cuDNN	CUDA
2.15.0	3.9-3.11	Clang 16.0.0	Bazel 6.1.0	8.8	12.2
2.14.0	3.9-3.11	Clang 16.0.0	Bazel 6.1.0	8.7	11.8
2.13.0	3.8-3.11	Clang 16.0.0	Bazel 5.3.0	8.6	11.8
2.12.0	3.8-3.11	GCC 9.3.1	Bazel 5.3.0	8.6	11.8
2.11.0	3.7-3.10	GCC 9.3.1	Bazel 5.3.0	8.1	11.2
2.10.0	3.7-3.10	GCC 9.3.1	Bazel 5.1.1	8.1	11.2
2.9.0	3.7-3.10	GCC 9.3.1	Bazel 5.0.0	8.1	11.2
2.8.0	3.7-3.10	GCC 7.3.1	Bazel 4.2.1	8.1	11.2
2.7.0	3.7-3.9	GCC 7.3.1	Bazel 3.7.2	8.1	11.2
2.6.0	3.6-3.9	GCC 7.3.1	Bazel 3.7.2	8.1	11.2
2.5.0	3.6-3.9	GCC 7.3.1	Bazel 3.7.2	8.1	11.2
2.4.0	3.6-3.8	GCC 7.3.1	Bazel 3.1.0	8.0	11.0
2.3.0	3.5-3.8	GCC 7.3.1	Bazel 3.1.0	7.6	10.1

二、在原镜像容器中做修改

以175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v3镜像为基础，做如下操作。

#创建容器test01
docker run --name=test01 \
        --gpus all
        --runtime=nvidia \
        -it \
        175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v3 \
        --entry-point=/bin/bash

#进入容器内
root@59cbbed5325b:~# root@controller01:~# nvidia-smi 
Mon Jun 30 06:22:35 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:41:00.0 Off |                    0 |
|  0%   39C    P8             14W /  300W |       3MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

以下都是在容器test01内操作：

#安装pip3
root@59cbbed5325b:/# apt-get update
root@59cbbed5325b:/# apt install python3-pip
root@59cbbed5325b:/# pip3 install pip --upgrade

1 2	`#卸载旧版本TensorFlow（如果有） root@59cbbed5325b:/# pip uninstall tensorflow tensorflow-gpu`

#安装cuda11.8
root@59cbbed5325b:/# bash cuda_11.8.0_520.61.05_linux.run

root@59cbbed5325b:/# vi /root/.bashrc   #添加如下内容
export CUDA_HOME=/usr/local/cuda
export PATH=/usr/local/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64

#加载/root/.bashrc
root@59cbbed5325b:/# source /root/.bashrc

#验证cuda版本
root@59cbbed5325b:/# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

#使用anaconda安装虚拟python3.9环境
bash Anaconda3-2024.06-1-Linux-x86_64.sh
conda create -n py3.9 python==3.9.19
conda activate py3.9

1 2	`#安装TensorFlow 2.14.0（支持CUDA 11.8） (py3.9) root@59cbbed5325b:/# pip install tensorflow==2.14.0`

#安装libcudnn
(py3.9) root@59cbbed5325b:/# pip install nvidia-cudnn-cu11

#安装cudnn
# 解压下载的文件（下载地址：https://developer.nvidia.com/cudnn-archive）
# 具体是 cuDNN 8.x - 1.x (December 2023 - August 2014)
(py3.9) root@59cbbed5325b:/opt# tar -xJf cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz

# 复制文件到 CUDA 安装目录
(py3.9) root@59cbbed5325b:/opt# mv cudnn-linux-x86_64-8.7.0.84_cuda11-archive/ cudnn-linux-x86
(py3.9) root@59cbbed5325b:/opt/install-cudnn# ll cudnn-linux-x86
total 48
drwxr-xr-x 4 25503 2174  4096 Nov 21  2022 ./
drwxr-xr-x 3 root  root  4096 Jun 30 03:25 ../
-rw-r--r-- 1 25503 2174 28994 Nov 21  2022 LICENSE
drwxr-xr-x 2 25503 2174  4096 Nov 21  2022 include/
drwxr-xr-x 2 25503 2174  4096 Nov 21  2022 lib/

(py3.9) root@59cbbed5325b:/opt/install-cudnn# cp cudnn-linux-x86/include/cudnn*.h /usr/local/cuda/include
(py3.9) root@59cbbed5325b:/opt/install-cudnn# cp -P cudnn-linux-x86/lib/libcudnn* /usr/local/cuda/lib64
(py3.9) root@59cbbed5325b:/opt/install-cudnn# chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*


#检查 cuDNN 版本
(py3.9) root@59cbbed5325b:/opt/install-cudnn# cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 7
#define CUDNN_PATCHLEVEL 0
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#验证CUDA和cuDNN兼容性
##对于CUDA 12.4，TensorFlow会自动使用合适的cuDNN版本。检查兼容性：
(py3.9) root@59cbbed5325b:/# python -c "
import tensorflow as tf
print('TensorFlow version:', tf.__version__)
print('GPU is available?:', tf.config.list_physical_devices('GPU'))
print('CUDA is supported?:', tf.test.is_built_with_cuda())
"

###输出内容如下：
2025-06-30 03:33:34.678335: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-06-30 03:33:34.678411: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-06-30 03:33:34.678443: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-30 03:33:34.687982: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-06-30 03:33:36.040071: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
TensorFlow version: 2.14.0
GPU is available?: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
CUDA is supported?: True

三、单个容器内验证与测试

最终的文件：gpu_dist_mnist-v6.py，进行tensorflow分布式训练在容器test01中，启动PS服务器：

(py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# export TF_CONFIG='{"cluster": {"ps": ["localhost:2222"], "worker": ["localhost:2223"]}, "task": {"type": "ps", "index": 0}}' 
(py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# python gpu_dist_mnist-v6.py


#运行结果的最后几行：
==================================================
GPU和CUDA配置检查:
==================================================
TensorFlow版本: 2.14.0
TensorFlow是否支持GPU: True
CUDA是否可用: True
可用GPU设备数量: 1
GPU 0: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
  - 详细信息: {'compute_capability': (8, 6), 'device_name': 'NVIDIA A40'}
GPU内存增长模式已启用
2025-06-30 07:03:14.830558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43598 MB memory:  -> device: 0, name: NVIDIA A40, pci bus id: 0000:41:00.0, compute capability: 8.6

所有可用设备:
  - /job:localhost/replica:0/task:0/device:CPU:0: CPU
  - /job:localhost/replica:0/task:0/device:GPU:0: GPU
==================================================
tf_config: {'cluster': {'ps': ['localhost:2222'], 'worker': ['localhost:2223']}, 'task': {'type': 'ps', 'index': 0}}
job name = ps
task index = 0
2025-06-30 07:03:15.331182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:ps/replica:0/task:0/device:GPU:0 with 43598 MB memory:  -> device: 0, name: NVIDIA A40, pci bus id: 0000:41:00.0, compute capability: 8.6
2025-06-30 07:03:15.346094: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://localhost:2222

同时在容器test01中的另一个窗口中启动Worker：

(py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# export TF_CONFIG='{"cluster": {"ps": ["localhost:2222"], "worker": ["localhost:2223"]}, "task": {"type": "worker", "index": 0}}' 
(py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# python gpu_dist_mnist-v6.py --num_gpus=1 --train_steps=5000  



#运行结果的最后几行：
1751267105.262005: Worker 0: training step 4995 done (global step: 4994)
1751267105.266335: Worker 0: training step 4996 done (global step: 4995)
1751267105.270457: Worker 0: training step 4997 done (global step: 4996)
1751267105.274166: Worker 0: training step 4998 done (global step: 4997)
1751267105.278662: Worker 0: training step 4999 done (global step: 4998)
1751267105.283150: Worker 0: training step 5000 done (global step: 4999)
1751267105.287258: Worker 0: training step 5001 done (global step: 5000)
Training ends @ 1751267105.287361
Training elapsed time: 28.788610 s

在运行过程中，可以在宿主机或容器test01中查看nvidia GPU的使用情况，如下是在宿主机上查看：

#可以看到有两个进程在使用NVIDIA GPU
(base) root@controller01:/# nvidia-smi 
Mon Jun 30 15:04:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:41:00.0 Off |                    0 |
|  0%   72C    P0             94W /  300W |     547MiB /  46068MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     59697      C   python                                        260MiB |
|    0   N/A  N/A     63680      C   python                                        274MiB |
+-----------------------------------------------------------------------------------------+

四、将容器保存成镜像

#将容器test01保存成镜像175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4
root@controller01:~# docker commit -m "1)using python 3.9.19 in conda rather than py3.6.9  2)NVIDIA Driver Version: 550.54.15 3)cuda11.8 rather than 12.4  4)cudnn-linux-x86_64-8.7.0.84_cuda11   5)using tensorflow2.14.0 rather than tensorflow1.18.0" test01 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4
#上述命令中-m参数的值写错了，其实是从tensorflow1.8.0升级到tensorflow2.14.0

#可以看到v4镜像比v3镜像大了很多，主要是增加的文件在/usr目录与/root/anaconda3目录下。前者是CUDA、cuDNN相关文件，后者主要是由于安装虚拟python3.9环境、tensorflow2.14.0增加的文件
root@controller01:~# docker images | grep "175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example"
175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example     v4      3c83959b2e82   3 minutes ago   23.4GB
175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example     v3      59b8c787e86a   12 months ago   1.53GB
#推送到自己的harbor仓库（需要用harbor仓库户名与密码）
root@controller01:~# docker push 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4

#保存到本地
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.12/tensorflow2.14.0# docker save 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4 -o ubuntu18-dist-mnist-tf-example-v4.tar
#查看文件
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.12/tensorflow2.14.0# ls -alh
total 23G
drwxr-xr-x 2 root root 4.0K Jun 30 15:51 .
drwxr-xr-x 5 root root 4.0K Jun 30 14:29 ..
-rw-r--r-- 1 root root 2.2K Jun 30 10:43 gpu_diagnostic_script.sh
-rw-r--r-- 1 root root  19K Jun 29 22:34 gpu_dist_mnist-v5.py
-rw-r--r-- 1 root root  20K Jun 30 15:00 gpu_dist_mnist-v6-ok.py
-rw-r--r-- 1 root root  20K Jun 30 15:29 gpu_dist_mnist-v6.py
-rw-r--r-- 1 root root  134 Jun 30 14:33 README.txt
-rw------- 1 root root  23G Jun 30 15:51 ubuntu18-dist-mnist-tf-example-v4.tar

五、使用volcano1.10.0进行任务调度

5.1 只使用CPU

tf-dist-mnist-example-cpu-v2-2000times-4presentation.yaml：

#以tensorflow为例，创建一个具有1个ps和2个worker的工作负载
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tensorflow-dist-mnist-cpu
spec:
  minAvailable: 2   # 该job的3个pod必须都可用
  schedulerName: volcano    # 指定volcano为调度器
  plugins:
    env: []
    svc: []
  policies:
    - event: PodEvicted # 当pod被驱逐时，重启该job
      action: RestartJob
  tasks:
    - replicas: 1   # 指定1个ps pod
      name: ps
      template: # ps pod的具体定义
        spec:
          containers:
            - command:
                - sh
                - -c
                - |
                  PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
                  PYTHONUNBUFFERED=1 python /var/tf_dist_mnist/dist_mnist.py --num_gpus=0
              image: 175.6.40.93:8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.5
              #image: 175.6.40.93:8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.1
              name: tensorflow
              ports:
                - containerPort: 2222
                  name: tfjob-port
              resources: {}
          restartPolicy: Never
    - replicas: 1   # 指定2个worker pod
      name: worker
      policies:
        - event: TaskCompleted  # 2个worker完成任务时认为该job完成任务
          action: CompleteJob
      template: # worker pod的具体定义
        spec:
          containers:
            - command:
                - sh
                - -c
                - |
                  PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
                  python /var/tf_dist_mnist/dist_mnist.py --num_gpus=0 --train_steps=2000 --batch_size=10000
              image: 175.6.40.93:8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.5
              #image: 175.6.40.93:8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.1
              name: tensorflow
              ports:
                - containerPort: 2222
                  name: tfjob-port
              resources: {}
          restartPolicy: Never

#执行任务：
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl create -f tf-dist-mnist-example-cpu-v2-2000times-4presentation.yaml                             
job.batch.volcano.sh/tensorflow-dist-mnist-cpu created

#查看pod
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl get pods -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
tensorflow-dist-mnist-cpu-ps-0       1/1     Running   0          8s    10.250.0.17   controller01   <none>           <none>
tensorflow-dist-mnist-cpu-worker-0   1/1     Running   0          8s    10.250.1.27   ksp-registry   <none>           <none>

5.2 使用NVIDIA GPU

以下文件中使用到的镜像很大，有23G。

tf-dist-mnist-example-vgpu-v2-2000times-4presentation.yaml：

#以tensorflow为例，创建一个具有1个ps和2个worker的工作负载
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tensorflow-dist-mnist-vgpu-v2
  #  annotations:
  #    volcano.sh/vgpu-mode: "hami-core"
spec:
  minAvailable: 2   # 该job的3个pod必须都执行或结束
  schedulerName: volcano    # 指定volcano为调度器
  plugins:
    env: []
    svc: []
  policies:
    - event: PodEvicted # 当pod被驱逐时，重启该job
      action: RestartJob
  tasks:
    - replicas: 1   # 指定1个ps pod
      name: ps
      template: # ps pod的具体定义
        spec:
          containers:
            - command:
                - bash
                - -c
                - |
                  source /root/.bashrc && __conda_setup="$('/root/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" && eval "$__conda_setup" &&conda activate py3.9;
                  export PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  export WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  export TF_CONFIG="{\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}";
                  echo $PS_HOST;
                  echo $WORKER_HOST;
                  echo $TF_CONFIG;
                  python /opt/tensorflow2.14.0/gpu_dist_mnist-v6.py
              image: 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4
              #imagePullPolicy: Always
              imagePullPolicy: IfNotPresent
              name: tensorflow
              ports:
                - containerPort: 2222
                  name: tfjob-port
              resources:
                limits:
                  volcano.sh/gpu-number: 1 # requesting 1 GPUs
                  volcano.sh/gpu-memory: 3072 # requesting 3072MB GPU memory
          restartPolicy: Never
          tolerations:
            - key: "node.kubernetes.io/disk-pressure"
              operator: "Exists"  # 容忍所有 disk-pressure 污点（无需指定 value）
              effect: "NoSchedule"
    - replicas: 1  # 指定1个worker pod
      name: worker
      policies:
        - event: TaskCompleted  # 2个worker完成任务时认为该job完成任务
          action: CompleteJob
      template: # worker pod的具体定义
        spec:
          containers:
            - command:
                - bash
                - -c
                - |
                  source /root/.bashrc && __conda_setup="$('/root/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" && eval "$__conda_setup" && conda activate py3.9;
                  export PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  export WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  export TF_CONFIG="{\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}";
                  echo $PS_HOST;
                  echo $WORKER_HOST;
                  echo $TF_CONFIG;
                  python /opt/tensorflow2.14.0/gpu_dist_mnist-v6.py --train_steps=2000 --batch_size=10000
              image: 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4
              #imagePullPolicy: Always
              imagePullPolicy: IfNotPresent
              name: tensorflow
              ports:
                - containerPort: 2222
                  name: tfjob-port
              resources:
                limits:
                  volcano.sh/gpu-number: 1 # requesting 1 GPUs
                  volcano.sh/gpu-memory: 3072 # requesting 3072MB GPU memory
          restartPolicy: Never
          tolerations:
            - key: "node.kubernetes.io/disk-pressure"
              operator: "Exists"  # 容忍所有 disk-pressure 污点（无需指定 value）
              effect: "NoSchedule"

#执行任务：
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl create -f tf-dist-mnist-example-vgpu-v2-2000times-4presentation.yaml                             
job.batch.volcano.sh/tensorflow-dist-mnist-vgpu-v2 created

#查看pod
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl get pods -o wide
NAME                                     READY   STATUS        RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
tensorflow-dist-mnist-cpu-ps-0           1/1     Terminating   0          3m56s   10.250.0.17   controller01   <none>           <none>
tensorflow-dist-mnist-cpu-worker-0       0/1     Completed     0          3m56s   10.250.1.27   ksp-registry   <none>           <none>
tensorflow-dist-mnist-vgpu-v2-ps-0       1/1     Running       0          23s     10.250.0.18   controller01   <none>           <none>
tensorflow-dist-mnist-vgpu-v2-worker-0   1/1     Running       0          23s     10.250.1.28   ksp-registry   <none>           <none>

#在pod运行的k8s节点上，都可以看到一个使用NVIDIA GPU的进程
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# nvidia-smi 
Tue Jul  1 15:28:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:41:00.0 Off |                    0 |
|  0%   72C    P0             94W /  300W |     268MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     30877      C   python                                        260MiB |
+-----------------------------------------------------------------------------------------+

从tensorflow1.8.0升级到2.14.0并支持使用NVIDIA GPU做mnist分布式训练

https://jiangsanyin.github.io/2025/06/29/从tensorflow1.8.0升级到2.14.0并支持使用NVIDIA GPU做mnist分布式训练/

作者

sanyinjiang

发布于

2025年6月29日

许可协议

k8s在线部署-kubeadm部署arm64单机版k8s1.23.17 上一篇

GPU等加速卡共享方案的分类汇总与比较下一篇