从tensorflow1.8.0升级到2.14.0并支持使用NVIDIA GPU做mnist分布式训练

一、tensorflow与cuda对应关系表

https://tensorflow.google.cn/install/source?hl=en#tested_build_configurations

参考:https://blog.csdn.net/FL1768317420/article/details/134840200 、 https://www.weiyeji.com/2025/wsl-tensorflow-gpu/

TensorFlow Version Python Version Compiler Build Tools cuDNN CUDA
2.15.0 3.9-3.11 Clang 16.0.0 Bazel 6.1.0 8.8 12.2
2.14.0 3.9-3.11 Clang 16.0.0 Bazel 6.1.0 8.7 11.8
2.13.0 3.8-3.11 Clang 16.0.0 Bazel 5.3.0 8.6 11.8
2.12.0 3.8-3.11 GCC 9.3.1 Bazel 5.3.0 8.6 11.8
2.11.0 3.7-3.10 GCC 9.3.1 Bazel 5.3.0 8.1 11.2
2.10.0 3.7-3.10 GCC 9.3.1 Bazel 5.1.1 8.1 11.2
2.9.0 3.7-3.10 GCC 9.3.1 Bazel 5.0.0 8.1 11.2
2.8.0 3.7-3.10 GCC 7.3.1 Bazel 4.2.1 8.1 11.2
2.7.0 3.7-3.9 GCC 7.3.1 Bazel 3.7.2 8.1 11.2
2.6.0 3.6-3.9 GCC 7.3.1 Bazel 3.7.2 8.1 11.2
2.5.0 3.6-3.9 GCC 7.3.1 Bazel 3.7.2 8.1 11.2
2.4.0 3.6-3.8 GCC 7.3.1 Bazel 3.1.0 8.0 11.0
2.3.0 3.5-3.8 GCC 7.3.1 Bazel 3.1.0 7.6 10.1

二、在原镜像容器中做修改

175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v3镜像为基础,做如下操作。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#创建容器test01
docker run --name=test01 \
--gpus all
--runtime=nvidia \
-it \
175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v3 \
--entry-point=/bin/bash

#进入容器内
root@59cbbed5325b:~# root@controller01:~# nvidia-smi
Mon Jun 30 06:22:35 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:41:00.0 Off | 0 |
| 0% 39C P8 14W / 300W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

以下都是在容器test01内操作:

1
2
3
4
#安装pip3
root@59cbbed5325b:/# apt-get update
root@59cbbed5325b:/# apt install python3-pip
root@59cbbed5325b:/# pip3 install pip --upgrade
1
2
#卸载旧版本TensorFlow(如果有)
root@59cbbed5325b:/# pip uninstall tensorflow tensorflow-gpu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#安装cuda11.8
root@59cbbed5325b:/# bash cuda_11.8.0_520.61.05_linux.run

root@59cbbed5325b:/# vi /root/.bashrc #添加如下内容
export CUDA_HOME=/usr/local/cuda
export PATH=/usr/local/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64

#加载/root/.bashrc
root@59cbbed5325b:/# source /root/.bashrc

#验证cuda版本
root@59cbbed5325b:/# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
1
2
3
4
#使用anaconda安装虚拟python3.9环境
bash Anaconda3-2024.06-1-Linux-x86_64.sh
conda create -n py3.9 python==3.9.19
conda activate py3.9
1
2
#安装TensorFlow 2.14.0(支持CUDA 11.8)
(py3.9) root@59cbbed5325b:/# pip install tensorflow==2.14.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#安装libcudnn
(py3.9) root@59cbbed5325b:/# pip install nvidia-cudnn-cu11

#安装cudnn
# 解压下载的文件(下载地址:https://developer.nvidia.com/cudnn-archive)
# 具体是 cuDNN 8.x - 1.x (December 2023 - August 2014)
(py3.9) root@59cbbed5325b:/opt# tar -xJf cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz

# 复制文件到 CUDA 安装目录
(py3.9) root@59cbbed5325b:/opt# mv cudnn-linux-x86_64-8.7.0.84_cuda11-archive/ cudnn-linux-x86
(py3.9) root@59cbbed5325b:/opt/install-cudnn# ll cudnn-linux-x86
total 48
drwxr-xr-x 4 25503 2174 4096 Nov 21 2022 ./
drwxr-xr-x 3 root root 4096 Jun 30 03:25 ../
-rw-r--r-- 1 25503 2174 28994 Nov 21 2022 LICENSE
drwxr-xr-x 2 25503 2174 4096 Nov 21 2022 include/
drwxr-xr-x 2 25503 2174 4096 Nov 21 2022 lib/

(py3.9) root@59cbbed5325b:/opt/install-cudnn# cp cudnn-linux-x86/include/cudnn*.h /usr/local/cuda/include
(py3.9) root@59cbbed5325b:/opt/install-cudnn# cp -P cudnn-linux-x86/lib/libcudnn* /usr/local/cuda/lib64
(py3.9) root@59cbbed5325b:/opt/install-cudnn# chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*


#检查 cuDNN 版本
(py3.9) root@59cbbed5325b:/opt/install-cudnn# cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 7
#define CUDNN_PATCHLEVEL 0
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#验证CUDA和cuDNN兼容性
##对于CUDA 12.4,TensorFlow会自动使用合适的cuDNN版本。检查兼容性:
(py3.9) root@59cbbed5325b:/# python -c "
import tensorflow as tf
print('TensorFlow version:', tf.__version__)
print('GPU is available?:', tf.config.list_physical_devices('GPU'))
print('CUDA is supported?:', tf.test.is_built_with_cuda())
"

###输出内容如下:
2025-06-30 03:33:34.678335: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-06-30 03:33:34.678411: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-06-30 03:33:34.678443: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-30 03:33:34.687982: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-06-30 03:33:36.040071: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
TensorFlow version: 2.14.0
GPU is available?: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
CUDA is supported?: True

三、单个容器内验证与测试

最终的文件:gpu_dist_mnist-v6.py, 进行tensorflow分布式训练 在容器test01中,启动PS服务器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
(py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# export TF_CONFIG='{"cluster": {"ps": ["localhost:2222"], "worker": ["localhost:2223"]}, "task": {"type": "ps", "index": 0}}' 
(py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# python gpu_dist_mnist-v6.py


#运行结果的最后几行:
==================================================
GPU和CUDA配置检查:
==================================================
TensorFlow版本: 2.14.0
TensorFlow是否支持GPU: True
CUDA是否可用: True
可用GPU设备数量: 1
GPU 0: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
- 详细信息: {'compute_capability': (8, 6), 'device_name': 'NVIDIA A40'}
GPU内存增长模式已启用
2025-06-30 07:03:14.830558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43598 MB memory: -> device: 0, name: NVIDIA A40, pci bus id: 0000:41:00.0, compute capability: 8.6

所有可用设备:
- /job:localhost/replica:0/task:0/device:CPU:0: CPU
- /job:localhost/replica:0/task:0/device:GPU:0: GPU
==================================================
tf_config: {'cluster': {'ps': ['localhost:2222'], 'worker': ['localhost:2223']}, 'task': {'type': 'ps', 'index': 0}}
job name = ps
task index = 0
2025-06-30 07:03:15.331182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:ps/replica:0/task:0/device:GPU:0 with 43598 MB memory: -> device: 0, name: NVIDIA A40, pci bus id: 0000:41:00.0, compute capability: 8.6
2025-06-30 07:03:15.346094: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://localhost:2222

同时在容器test01中的另一个窗口中启动Worker:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# export TF_CONFIG='{"cluster": {"ps": ["localhost:2222"], "worker": ["localhost:2223"]}, "task": {"type": "worker", "index": 0}}' 
(py3.9) root@59cbbed5325b:/opt/tensorflow2.14.0# python gpu_dist_mnist-v6.py --num_gpus=1 --train_steps=5000



#运行结果的最后几行:
1751267105.262005: Worker 0: training step 4995 done (global step: 4994)
1751267105.266335: Worker 0: training step 4996 done (global step: 4995)
1751267105.270457: Worker 0: training step 4997 done (global step: 4996)
1751267105.274166: Worker 0: training step 4998 done (global step: 4997)
1751267105.278662: Worker 0: training step 4999 done (global step: 4998)
1751267105.283150: Worker 0: training step 5000 done (global step: 4999)
1751267105.287258: Worker 0: training step 5001 done (global step: 5000)
Training ends @ 1751267105.287361
Training elapsed time: 28.788610 s

在运行过程中,可以在宿主机或容器test01中查看nvidia GPU的使用情况,如下是在宿主机上查看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#可以看到有两个进程在使用NVIDIA GPU
(base) root@controller01:/# nvidia-smi
Mon Jun 30 15:04:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:41:00.0 Off | 0 |
| 0% 72C P0 94W / 300W | 547MiB / 46068MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 59697 C python 260MiB |
| 0 N/A N/A 63680 C python 274MiB |
+-----------------------------------------------------------------------------------------+

四、将容器保存成镜像

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#将容器test01保存成镜像175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4
root@controller01:~# docker commit -m "1)using python 3.9.19 in conda rather than py3.6.9 2)NVIDIA Driver Version: 550.54.15 3)cuda11.8 rather than 12.4 4)cudnn-linux-x86_64-8.7.0.84_cuda11 5)using tensorflow2.14.0 rather than tensorflow1.18.0" test01 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4
#上述命令中-m参数的值写错了,其实是从tensorflow1.8.0升级到tensorflow2.14.0

#可以看到v4镜像比v3镜像大了很多,主要是增加的文件在/usr目录与/root/anaconda3目录下。前者是CUDA、cuDNN相关文件,后者主要是由于安装虚拟python3.9环境、tensorflow2.14.0增加的文件
root@controller01:~# docker images | grep "175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example"
175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example v4 3c83959b2e82 3 minutes ago 23.4GB
175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example v3 59b8c787e86a 12 months ago 1.53GB
#推送到自己的harbor仓库(需要用harbor仓库户名与密码)
root@controller01:~# docker push 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4

#保存到本地
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.12/tensorflow2.14.0# docker save 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4 -o ubuntu18-dist-mnist-tf-example-v4.tar
#查看文件
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.12/tensorflow2.14.0# ls -alh
total 23G
drwxr-xr-x 2 root root 4.0K Jun 30 15:51 .
drwxr-xr-x 5 root root 4.0K Jun 30 14:29 ..
-rw-r--r-- 1 root root 2.2K Jun 30 10:43 gpu_diagnostic_script.sh
-rw-r--r-- 1 root root 19K Jun 29 22:34 gpu_dist_mnist-v5.py
-rw-r--r-- 1 root root 20K Jun 30 15:00 gpu_dist_mnist-v6-ok.py
-rw-r--r-- 1 root root 20K Jun 30 15:29 gpu_dist_mnist-v6.py
-rw-r--r-- 1 root root 134 Jun 30 14:33 README.txt
-rw------- 1 root root 23G Jun 30 15:51 ubuntu18-dist-mnist-tf-example-v4.tar

五、使用volcano1.10.0进行任务调度

5.1 只使用CPU

tf-dist-mnist-example-cpu-v2-2000times-4presentation.yaml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#以tensorflow为例,创建一个具有1个ps和2个worker的工作负载
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: tensorflow-dist-mnist-cpu
spec:
minAvailable: 2 # 该job的3个pod必须都可用
schedulerName: volcano # 指定volcano为调度器
plugins:
env: []
svc: []
policies:
- event: PodEvicted # 当pod被驱逐时,重启该job
action: RestartJob
tasks:
- replicas: 1 # 指定1个ps pod
name: ps
template: # ps pod的具体定义
spec:
containers:
- command:
- sh
- -c
- |
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
PYTHONUNBUFFERED=1 python /var/tf_dist_mnist/dist_mnist.py --num_gpus=0
image: 175.6.40.93:8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.5
#image: 175.6.40.93:8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.1
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
restartPolicy: Never
- replicas: 1 # 指定2个worker pod
name: worker
policies:
- event: TaskCompleted # 2个worker完成任务时认为该job完成任务
action: CompleteJob
template: # worker pod的具体定义
spec:
containers:
- command:
- sh
- -c
- |
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
python /var/tf_dist_mnist/dist_mnist.py --num_gpus=0 --train_steps=2000 --batch_size=10000
image: 175.6.40.93:8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.5
#image: 175.6.40.93:8196/volcanosh/volcanosh/dist-mnist-tf-example:0.0.1
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
restartPolicy: Never
1
2
3
4
5
6
7
8
9
#执行任务:
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl create -f tf-dist-mnist-example-cpu-v2-2000times-4presentation.yaml
job.batch.volcano.sh/tensorflow-dist-mnist-cpu created

#查看pod
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tensorflow-dist-mnist-cpu-ps-0 1/1 Running 0 8s 10.250.0.17 controller01 <none> <none>
tensorflow-dist-mnist-cpu-worker-0 1/1 Running 0 8s 10.250.1.27 ksp-registry <none> <none>

5.2 使用NVIDIA GPU

以下文件中使用到的镜像很大,有23G。

tf-dist-mnist-example-vgpu-v2-2000times-4presentation.yaml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#以tensorflow为例,创建一个具有1个ps和2个worker的工作负载
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: tensorflow-dist-mnist-vgpu-v2
# annotations:
# volcano.sh/vgpu-mode: "hami-core"
spec:
minAvailable: 2 # 该job的3个pod必须都执行或结束
schedulerName: volcano # 指定volcano为调度器
plugins:
env: []
svc: []
policies:
- event: PodEvicted # 当pod被驱逐时,重启该job
action: RestartJob
tasks:
- replicas: 1 # 指定1个ps pod
name: ps
template: # ps pod的具体定义
spec:
containers:
- command:
- bash
- -c
- |
source /root/.bashrc && __conda_setup="$('/root/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" && eval "$__conda_setup" &&conda activate py3.9;
export PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG="{\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}";
echo $PS_HOST;
echo $WORKER_HOST;
echo $TF_CONFIG;
python /opt/tensorflow2.14.0/gpu_dist_mnist-v6.py
image: 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4
#imagePullPolicy: Always
imagePullPolicy: IfNotPresent
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
volcano.sh/gpu-number: 1 # requesting 1 GPUs
volcano.sh/gpu-memory: 3072 # requesting 3072MB GPU memory
restartPolicy: Never
tolerations:
- key: "node.kubernetes.io/disk-pressure"
operator: "Exists" # 容忍所有 disk-pressure 污点(无需指定 value)
effect: "NoSchedule"
- replicas: 1 # 指定1个worker pod
name: worker
policies:
- event: TaskCompleted # 2个worker完成任务时认为该job完成任务
action: CompleteJob
template: # worker pod的具体定义
spec:
containers:
- command:
- bash
- -c
- |
source /root/.bashrc && __conda_setup="$('/root/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" && eval "$__conda_setup" && conda activate py3.9;
export PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG="{\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}";
echo $PS_HOST;
echo $WORKER_HOST;
echo $TF_CONFIG;
python /opt/tensorflow2.14.0/gpu_dist_mnist-v6.py --train_steps=2000 --batch_size=10000
image: 175.6.40.93:8196/volcanosh/ubuntu18-dist-mnist-tf-example:v4
#imagePullPolicy: Always
imagePullPolicy: IfNotPresent
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
volcano.sh/gpu-number: 1 # requesting 1 GPUs
volcano.sh/gpu-memory: 3072 # requesting 3072MB GPU memory
restartPolicy: Never
tolerations:
- key: "node.kubernetes.io/disk-pressure"
operator: "Exists" # 容忍所有 disk-pressure 污点(无需指定 value)
effect: "NoSchedule"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#执行任务:
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl create -f tf-dist-mnist-example-vgpu-v2-2000times-4presentation.yaml
job.batch.volcano.sh/tensorflow-dist-mnist-vgpu-v2 created

#查看pod
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tensorflow-dist-mnist-cpu-ps-0 1/1 Terminating 0 3m56s 10.250.0.17 controller01 <none> <none>
tensorflow-dist-mnist-cpu-worker-0 0/1 Completed 0 3m56s 10.250.1.27 ksp-registry <none> <none>
tensorflow-dist-mnist-vgpu-v2-ps-0 1/1 Running 0 23s 10.250.0.18 controller01 <none> <none>
tensorflow-dist-mnist-vgpu-v2-worker-0 1/1 Running 0 23s 10.250.1.28 ksp-registry <none> <none>

#在pod运行的k8s节点上,都可以看到一个使用NVIDIA GPU的进程
root@controller01:/opt/172.20.0.21_backup/installPkgs/install-volcano/volcano-1.10.0/example# nvidia-smi
Tue Jul 1 15:28:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:41:00.0 Off | 0 |
| 0% 72C P0 94W / 300W | 268MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 30877 C python 260MiB |
+-----------------------------------------------------------------------------------------+

从tensorflow1.8.0升级到2.14.0并支持使用NVIDIA GPU做mnist分布式训练
https://jiangsanyin.github.io/2025/06/29/从tensorflow1.8.0升级到2.14.0并支持使用NVIDIA GPU做mnist分布式训练/
作者
sanyinjiang
发布于
2025年6月29日
许可协议