一、介绍
HAMi-WebUI项目官网:https://github.com/Project-HAMi/HAMi-WebUI
根据官网的介绍,它是一个基于开源项目HAMi的另一开源项目,主要用来管理与监视GPU等加速设备资源的。它通过提供一个可视化的web接口扩展了HAMi的功能与可用性,这个web接口可以查看与管理跨节点的GPU资源分配与使用情况。HAMi-WebUI支持针对任务与GPU使用细节的查看,使得团队人员能高效地查看与监控相关资源的使用与消耗情况,更好地进行团队协作开发、更加便捷与充分使用相关资源。它具体提供如下4个功能:
资源概览:
提供所有资源的综合视图,包括节点和显卡的资源使用情况。快速评估所有节点和显卡的状态。
节点管理:
浏览详细的节点信息,包括节点状态、资源使用情况。
显卡管理:
可视化各节点的显卡使用情况,详细展示算力与显存的分配与使用。
任务管理:
追踪任务及其资源消耗。查看任务创建时间、状态、显卡分配等信息。
二、部署
部署文档:https://github.com/Project-HAMi/HAMi-WebUI/blob/main/docs/installation/helm/index.md
部署HAMi-WebUI之前必须先满足以下先决条件:
已经部署好了k8s环境(对k8s版本要求由HAMi版本决定,具体查看HAMi官方安装文档)
已经部署好HAMi环境且其版本>=2.4.0
部署好Prometheus且其版本>2.8.0
部署好Helm且其版本>3.0
(目前官方文档中说明了通过helm安装HAMi-WebUI这一安装途经与工具)
2.1 部署HAMi-WebUI Helm charts
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 # 1.添加HAMi-WebUI repository helm repo add hami-webui https://project-hami.github.io/HAMi-WebUI# 2.使用helm部署HAMi-WebUI(其中externalPrometheus.address参数的值以k8s环境中prometheus实际服务名为准,端口一般都是9090) helm install my-hami-webui hami-webui/hami-webui --set externalPrometheus.enabled=true --set externalPrometheus.address="http://prometheus-k8s.monitoring.svc.cluster.local:9090" -n kube-system# 上述命令执行后,会有如下输出 # NAME: my-hami-webui # LAST DEPLOYED: Wed Jan 22 14:46:54 2025 # NAMESPACE: kube-system # STATUS: deployed # REVISION: 1 # TEST SUITE: None # NOTES: # 1. Get the application URL by running these commands: # export POD_NAME=$(kubectl get pods --namespace kube-system -l "app.kubernetes.io/name=hami-webui,app.kubernetes.io/instance=my-hami-webui" -o # export CONTAINER_PORT=$(kubectl get pod --namespace kube-system $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}" ) # echo "Visit http://127.0.0.1:3000 to use your application" # kubectl --namespace kube-system port-forward $POD_NAME 3000:$CONTAINER_PORT # 以下是执行"helm install xxx" 前的确认性命令 # 查看与确认k8s环境中prometheus实际服务名与端口 root@controller01:~# kubectl -n monitoring get svc | egrep "NAME|prometheus" NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-adapter ClusterIP 10.68.33.74 <none> 443/TCP 11d prometheus-k8s NodePort 10.68.168.213 <none> 9090:31819/TCP,8080:31903/TCP 11d prometheus-operated ClusterIP None <none> 9090/TCP 11d prometheus-operator ClusterIP None <none> 8443/TCP 11d# 确认 prometheus-k8s.monitoring.svc.cluster.local 服务名是否可用 root@controller01:~# kubectl run -it --restart=Never --image=swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/infoblox/dnstools:v3 dnstools --rm If you don't see a command prompt, try pressing enter.dnstools# dnstools# nslookup prometheus-k8s.monitoring.svc.cluster.local Server: 169.254.20.10 Address: 169.254.20.10#53 Name: prometheus-k8s.monitoring.svc.cluster.local Address: 10.68.168.213dnstools#
1 2 3 4 5 # 3.执行如下命令确认安装结果 root@controller01:~# kubectl get pods -n kube-system | grep webui my-hami-webui-c6f4b6c98-mpgd7 2/2 Running 0 17h my-hami-webui-dcgm-exporter-v6gnx 1/1 Running 0 16m# 其中使用到了镜像“nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04”很可能下载失败
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # 4.查看svc并修改svc类型为NodePort root@controller01:~# kubectl -n kube-system get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE dashboard-metrics-scraper ClusterIP 10.68.5.223 <none> 8000/TCP 12d hami-device-plugin-monitor NodePort 10.68.183.209 <none> 31992:31992/TCP 8d hami-scheduler NodePort 10.68.83.62 <none> 443:31998/TCP,31993:31993/TCP 8d kube-dns ClusterIP 10.68.0.2 <none> 53/UDP,53/TCP,9153/TCP 12d kube-dns-upstream ClusterIP 10.68.48.155 <none> 53/UDP,53/TCP 12d kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 12d kubernetes-dashboard NodePort 10.68.176.76 <none> 443:30119/TCP 12d metrics-server ClusterIP 10.68.24.142 <none> 443/TCP 12d my-hami-webui ClusterIP 10.68.74.135 <none> 3000/TCP,8000/TCP 17h my-hami-webui-dcgm-exporter ClusterIP 10.68.174.215 <none> 9400/TCP 17h node-local-dns ClusterIP None <none> 9253/TCP 12d# 将svc/my-hami-webui 类型由ClusterIP修改为NodePort root@controller01:~# kubectl -n kube-system get svc | grep webui my-hami-webui NodePort 10.68.74.135 <none> 3000:31254/TCP,8000:32691/TCP 17h my-hami-webui-dcgm-exporter ClusterIP 10.68.174.215 <none> 9400/TCP 17h
2.2 访问HAMi-WebUI
http://{k8s集群任意节点IP}:{svc/my-hami-webui的在k8s节点上的映射端口号},根据上述命令的回显结果,svc/my-hami-webui的在k8s节点上的映射端口号是31254。
如果k8s集群中某个控制节点IP是172.20.0.21,则HAMi-WebUI的访问url是如下:
image-20250123093641069
2.3 卸载HAMi-WebUI
1 root@controller01:~# helm -n kube-system uninstall my-hami-webui
2.4 报错与处理
2.4.1
pod/my-hami-webui-dcgm-exporter提示"Failed to watch metrics: Error
watching fields: The third-party Profiling module returned an
unrecoverable error"
具体问题描述在:https://github.com/Project-HAMi/HAMi-WebUI/issues/20
image-20250122162820887
处理办法:
2025年1月22日:不知如何处理,暂时搁置。社区提问与hami微信群提问,暂时(2025年1月22日)都无人回复,不知道是不是快过年了,大家不上班
2025年1月23日:
1 2 # 修改daemonsets/my-hami-webui-dcgm-exporter,将参数“args: -f /etc/dcgm-exporter/dcp-metrics-inclueded.csv”去掉 root@controller01:~# kubectl -n kube-system edit daemonsets.apps my-hami-webui-dcgm-exporter
image-20250123092714778
image-20250123092747161
1 2 # 然后可以看到pod/my-hami-webui-dcgm-exporter-v6gnx 正常运行 root@controller01:~# kubectl -n kube-system get pods
image-20250123092608635
三、使用
创建任务
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 root@controller01:/opt/installPkgs/k8s-vgpu-basedon-HAMi# cat gpu-test4.yaml apiVersion: v1 kind: Pod metadata: name: gpu-test4 spec: restartPolicy: OnFailure # nodeName: controller01 containers: - name: gpu-test4-01 image: xxx/ubuntu2004:pytorch2.2.2-classification-example command: - python3 - /opt/classification/train.py resources: limits: nvidia.com/vgpu: 2 # requesting 1 vGPUs nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer) nvidia.com/gpucores: 10 # Each vGPU uses 10% of the entire GPU (Optional,Integer) root@controller01:/opt/installPkgs/k8s-vgpu-basedon-HAMi# cat gpu-test6.yaml apiVersion: v1 kind: Pod metadata: name: gpu-test6 spec: restartPolicy: OnFailure # nodeName: controller01 containers: - name: gpu-test6-01 image: 175.6.40.93:8196/k8s-kubekey/ubuntu2004:pytorch2.2.2-classification-example command: - python3 - /opt/classification/train.py resources: limits: nvidia.com/vgpu: 1 # requesting 1 vGPUs nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer) #nvidia.com/gpumem-percentage: 10 #Each vGPU containers 10% device memory of that GPU. Can not be used with nvidia.com/gpumem nvidia.com/gpucores: 10 # Utilization limit of this vGPU is set to 10% of total GPU utilization #nvidia.com/priority: 0 # We only have two priority class, 0(high) and 1(low), default: 1 #The utilization of high priority task won't be limited to resourceCores unless sharing GPU node with other high priority tasks. #The utilization of low priority task won't be limited to resourceCores if no other tasks sharing its GPU. root@controller01:/opt/installPkgs/k8s-vgpu-basedon-HAMi# kubectl apply -f gpu-test4.yaml -f gpu-test6.yaml root@controller01:/opt/installPkgs/k8s-vgpu-basedon-HAMi# kubectl get pods NAME READY STATUS RESTARTS AGE dnstools 1/1 Running 0 23h gpu-test4 1/1 Running 0 2m17s gpu-test6 1/1 Running 0 2m17s
资源总览
image-20250123103935595
任务管理
查看所有任务的资源使用情况、单个任务的资源使用详情
image-20250123103950833
image-20250123104011189
更多细节分析,待续
四、开源贡献
https://github.com/Project-HAMi/HAMi-WebUI/blob/main/README_ZH.md#%E5%8F%82%E4%B8%8E%E8%B4%A1%E7%8C%AE