nvidia-smi相关命令执行时输出内容

一、参考文档

docs.nvidia.com官方文档关于nvidia-smi的解释：https://docs.nvidia.com/deploy/nvidia-smi/

二、简介

nvidia-smi，全称NVIDIA System Management Interface，简称NVSMI，是为Fermi和更高架构的Tesla、Quadro、GRID和GeForce设备提供监控和管理功能（如开启或关闭GPU的ECC模式）的工具。GeForce Titan系列设备支持大多数功能，但为GeForce品牌的其他功能提供的信息非常有限。NVSMI是一个跨平台工具，支持所有标准的NVIDIA驱动程序支持的Linux发行版，以及从Windows Server 2008 R2开始的64位Windows版本。用户可以通过stdout直接使用指标，也可以通过CSV和XML格式的文件提供指标，以供脚本使用。

NVSMI的大部分功能都是由底层基于C语言的NVML库提供的。有关NVML的更多信息，参阅NVIDIA开发者网站链接（http://developer.nvidia.com/nvidia-management-library-nvml/）。也可以使用基于nvml的python绑定（http://pypi.python.org/pypi/nvidia-ml-py/）。

NVSMI的输出不能保证是向后兼容的。但是，NVML和Python绑定都是向后兼容的，在编写任何必须维护多个NVIDIA驱动程序发行版的工具时，它是首选。

三、nvidia-smi命令使用

nvidia-smi命令的选项，划分为不同的类别，如“LIST OPTIONS”（包含-L、-B）、"SUMMARY OPTIONS"（-i等）、“SELECTIVE QUERY OPTIONS”（含-q等）、“DEVICE MODIFICATION OPTIONS”（所有以“--query”开头选项，如--query-gpu），还有“DEVICE MODIFICATION OPTIONS”等多个类别。可通过如下命令查看：nvidia-smi --help

3.1 nvidia-smi

#查看帮助文档
nvidia-smi --help

(kt) root@ksp-registry:~# nvidia-smi 
Thu Mar 20 11:08:24 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:C1:00.0 Off |                    0 |
|  0%   46C    P0             77W /  300W |   10376MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     28419      C   python                                      10368MiB |
+-----------------------------------------------------------------------------------------+

Thu Mar 20 11:08:24 2025    表示查询时的日期与时间
NVIDIA-SMI 550.54.15        当前安装的 NVIDIA 驱动程序的版本号，格式为：主版本号.次版本号.修订号
Driver Version: 550.54.15   表示 NVIDIA 驱动程序的版本号，与前面内容一致
CUDA Version: 12.4          建议的cuda版本号，格式为：主版本号.次版本号

GPU:            整型数字，是GPU在当前服务器上的编号，从0开始
Name:           GPU的名字，显示的是GPU型号（如NVIDIA A40、Tesla V100-PCIE...）
Persistence-M：  持续模式的状态。设置后马上生效，重启后恢复成默认的Disabled即off状态，仅在Linux上可用。
                 持久模式是一种 NVIDIA GPU 的功能，用于控制 GPU 的电源管理行为。当持久模式启用时，GPU 会始终保持初始化状                    态，即使没有正在运行的 GPU 任务。禁用持久模式后，GPU 在空闲时会进入低功耗状态，重新启动任务时需要重新初始                    化， 这会增加延迟。启用持久模式可以避免这种延迟。
Fan：                   风扇转速（0%--100%），N/A表示没有风扇
Temp：                  GPU温度（GPU温度过高会导致GPU频率下降）
Perf：                  性能状态，从P0（最高性能）到P12（最低性能）。它会根据 GPU 负载、功耗限制和温度自动调整。
Pwr:Usage/Cap           GPU功耗，显示当前 GPU 的功耗和功耗上限

Bus-Id：        GPU总线，domain:bus:device.function
Disp.A：        Display Active，表示GPU 是否正在运行显示任务，例如是否连接了显示器。
Memory-Usage：  显存使用率

Volatile Uncorr. ECC：    是否开启错误检查和纠正技术，0/DISABLED, 1/ENABLED。N/A 表示这个 GPU 不支持 ECC 或者没                                 有启用 ECC。
GPU-Util:                 GPU使用率，0% 表示 GPU 目前在空闲状态。
Compute M.：              计算模式，0/DEFAULT,1/EXCLUSIVE_PROCESS,2/PROHIBITED。默认模式表示任何人都可以启动                                  GPU 计算任务。在其他模式下，只有特定用户或进程可以启动计算任务。

##Processes 部分显示了 当前正在使用 GPU 的进程信息。比如哪些进程正在占用 GPU 资源，以及它们的使用情况
GPU	             GPU 的 ID（如果有多个 GPU，第一个GPU的编号从 0 开始，显示当前进程所使用的GPU-id）
GI	             GPU Instance ID（GPU 实例 ID，通常用于 MIG 多实例 GPU），未用MIG时显示 N/A
CI	             Compute Instance ID（计算实例 ID，通常用于 MIG 多实例 GPU），未用MIG时显示 N/A
PID	             进程 ID（Process ID），表示正在使用 GPU 的进程的操作系统进程 ID
Type	         进程类型：
                    - C：Compute 计算进程
                    - G：Graphics 图形进程
                    - C+G：同时使用计算和图形
Process name	   进程的名称或路径，表示正在使用 GPU 的进程的名称。可能只显示进程中第一个空格之间的内容
GPU Memory Usage	进程占用的 GPU 显存大小（以 MiB 为单位）。

3.2 QUERY OPTIONS

3.2.1 nvidia-smi -q -xxx

1 2	`#如果不再加其他参数，如下命令会将GPU的所有详细信息全部查询并输出到控制台 nvidia-smi -q`

#查看nvidia-smi 命令的用法
(kt) root@ksp-registry:/opt# nvidia-smi -h | more                
NVIDIA System Management Interface -- v550.54.15

NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.

Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available.  The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.

http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
    - All Tesla products, starting with the Kepler architecture
    - All Quadro products, starting with the Kepler architecture
    - All GRID products, starting with the Kepler architecture
    - GeForce Titan products, starting with the Kepler architecture
- Limited Support
    - All Geforce products, starting with the Kepler architecture
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...

    -h,   --help                Print usage information and exit.

          --version             Print version information and exit.

  LIST OPTIONS:

    -L,   --list-gpus           Display a list of GPUs connected to the system.

    -B,   --list-excluded-gpus  Display a list of excluded GPUs in the system.

  SUMMARY OPTIONS:

    <no arguments>              Show a summary of GPUs connected to the system.

    [plus any of]

    -i,   --id=                 Target a specific GPU.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -l,   --loop=               Probe until Ctrl+C at specified second interval.

  QUERY OPTIONS:

    -q,   --query               Display GPU or Unit info.

    [plus any of]

    -u,   --unit                Show unit, rather than GPU, attributes.
    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -x,   --xml-format          Produce XML output.
          --dtd                 When showing xml output, embed DTD.
    -d,   --display=            Display only selected information: MEMORY,
                                    UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
                                    COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
                                    PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS,
                                    SUPPORTED_GPU_TARGET_TEMP, VOLTAGE, FBC_STATS
                                    ROW_REMAPPER, RESET_STATUS, GSP_FIRMWARE_VERSION
                                Flags can be combined with comma e.g. ECC,POWER.
                                Sampling data with max/min/avg is also returned 
                                for POWER, UTILIZATION and CLOCK display types.
                                Doesn't work with -u or -x flags.
    -l,   --loop=               Probe until Ctrl+C at specified second interval.

    -lms, --loop-ms=            Probe until Ctrl+C at specified millisecond interval.
...

#-d参数 指定要输出GPU哪些信息，如下为ECC与POWER，多个指标间中英文逗号连接
#-f参数 指定查询结果输出到某个文件中，而不是标准输出即控制台
(kt) root@ksp-registry:/opt# nvidia-smi -q -d ECC,POWER -f test01
(kt) root@ksp-registry:/opt# cat test01 

==============NVSMI LOG==============

Timestamp                                 : Thu Mar 20 15:28:46 2025
Driver Version                            : 550.54.15
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:C1:00.0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    GPU Power Readings
        Power Draw                        : 21.68 W
        Current Power Limit               : 300.00 W
        Requested Power Limit             : 300.00 W
        Default Power Limit               : 300.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Power Samples
        Duration                          : 117.96 sec
        Number of Samples                 : 119
        Max                               : 26.65 W
        Min                               : 21.49 W
        Avg                               : 21.75 W
    GPU Memory Power Readings 
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
        
#还可以将查询结果以xml格式输出，但POWER, UTILIZATION and CLOCK不能输出为xml文件或者与-u一起使用
nvidia-smi -q -x -f test01

###一些常用命令
#1)输出第0个GPU的详细信息
nvidia-smi -q -i 0


#2)显示单元，而不是device, DTD 信息
(kt) root@ksp-registry:/opt# nvidia-smi -q -u
==============NVSMI LOG==============

Timestamp                                 : Thu Mar 20 15:46:29 2025
Driver Version                            : 550.54.15
CUDA Version                              : 12.4

HIC Info                                  : N/A
Attached Units                            : 0


#3）第隔一段时间查询一次GPU信息，如下是5（单位默认是秒）
nvidia-smi -q -i 0 -l 5
nvidia-smi -q -i 0 -lms 500   #500毫秒
#也可以将-l与-f参数结合起来使用，实现定期查询某GPU的情况并将查询结果重定向到某文件中
nvidia-smi -q -i 0 -l 5 -f ./gpu0.log

#4)只查询GPU某些指标（具体支持的指标通过如下命令查询：nvidia-smi --help | egrep -w "\-d" -A 10）
nvidia-smi -q -d ECC,POWER -f test01
nvidia-smi -q -i 0 -l 5 -d PIDS,MEMORY,UTILIZATION -f ./gpu0.log

#5)

3.3 SELECTIVE QUERY OPTIONS

3.3.1 nvidia-smi --query-gpu

1
2
3

#此叫做SELECTIVE QUERY OPTIONS 选择性查询选项，能够指定需要查询的具体指标与属性
#查看nvidia-smi --query-gpu的帮助文档
(kt) root@ksp-registry:~# nvidia-smi --help-query-gpu

#具体--query-gpu参数有哪些参数值可用，可通过“nvidia-smi --help-query-gpu”获知
nvidia-smi --query-gpu=gpu_name,driver_version,memory.total,memory.used --format=csv
#其中“--format”参数值中一定要包含csv，此外还可以添加其他值（有noheader,nounits）
#其中noheader表示不打印表头， nounits表示不打印数值的单位
nvidia-smi --query-gpu=gpu_name,driver_version,memory.total,memory.used --format=csv,noheader

3.3.2 nvidia-smi --query-compute-apps

#查询当前正在使用 GPU 进行计算任务的应用程序信息
#查看帮助文档
nvidia-smi --help-query-compute-apps

(kt) root@ksp-registry:~# nvidia-smi --query-compute-apps=timestamp,gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory --format=csv
timestamp, gpu_name, gpu_bus_id, gpu_serial, gpu_uuid, pid, process_name, used_gpu_memory [MiB]
2025/03/22 18:13:47.456, NVIDIA A40, 00000000:C1:00.0, 1320922048526, GPU-c10058bc-1eae-1b32-ba0d-85d26c9ed9ff, 9507, python, 11110 MiB

3.3.3 nvidia-smi --query-supported-clocks

#查询 GPU 支持的时钟频率及其组合
#查看帮助文档
nvidia-smi --help-query-supported-clocks

(kt) root@ksp-registry:~# nvidia-smi --query-supported-clocks=timestamp,gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,memory,graphics --format=csv

3.3.4 nvidia-smi --query-accounted-apps

#查询启用了 GPU Accounting 功能的系统中，各个应用程序对 GPU 资源的使用情况
#查看帮助文档
nvidia-smi --help-query-accounted-apps

#需要GPU型号支持、GPU驱动版本支持、GPU开启Accounting模式、有进程正在使用GPU
(self-llm) root@controller01:~# nvidia-smi --query-accounted-apps=timestamp,gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid, --format=csv
timestamp, gpu_name, gpu_bus_id, gpu_serial, gpu_uuid, pid
2025/03/22 18:12:36.487, NVIDIA A40, 00000000:41:00.0, 1320922048169, GPU-cfca7d85-be74-7c22-4385-6fd15d698cb4, 41337

3.3.5 nvidia-smi --query-retired-pages

#查询已退役的GPU设备内存页列表
#查看帮助文档
nvidia-smi --help-query-query-retired-pages

(kt) root@ksp-registry:~# nvidia-smi --query-retired-pages=gpu_name,retired_pages.timestamp,retired_pages.cause --format=csv
gpu_name, retired_pages.timestamp, retired_pages.cause
NVIDIA A40, [N/A], Single Bit ECC
NVIDIA A40, [N/A], Double Bit ECC

3.3.6 nvidia-smi --query-remapped-rows

#查询 GPU 显存中的行重映射信息
#查看帮助文档
nvidia-smi --help-query-remapped-rows

(kt) root@ksp-registry:~# nvidia-smi --query-remapped-rows=gpu_name,remapped_rows.correctable,remapped_rows.uncorrectable --format=csv
gpu_name, remapped_rows.correctable, remapped_rows.uncorrectable
NVIDIA A40, 0, 0

3.4 Device Monitoring

3.4.1 nvidia-smi dmon

#是用来查询输出GPU设备（以GPU设备为维度）实时监控信息的工具，会持续输出数据（默认间隔为1秒），直到“ctrl+x”或指定刷新次数
#会在标准输出中持续向下以固定的间隔滚动刷新输出
#查看帮助文档
nvidia-smi dmon --help

#示例
(kt) root@ksp-registry:~# nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk 
# Idx      W      C      C      %      %      %      %      %      %    MHz    MHz 
    0     90     48      -     11      8      0      0      0      0   7250   1740 
    0     90     48      -     10      7      0      0      0      0   7250   1740 
    0     92     48      -     10      7      0      0      0      0   7250   1740 
    0     88     48      -      7      5      0      0      0      0   7250   1740 
    0     92     48      -      9      6      0      0      0      0   7250   1740 
    0     92     49      -     10      7      0      0      0      0   7250   1740 
...

#1）-i参数指定被监控的GPU，可以是如下3种值：Idx/PCI bus ID或UUID
#其中Idx就是GPU的序号从0开始，UUID是GPU的唯一标识，二者都可通过如下方式查询：nvidia-smi -L
#PCI bus ID，可以通过如下方式查询：nvidia-smi -q | grep -i "bus id"
nvidia-smi dmon -i xxx

#2）-d参数，指定刷新间隔，单位是秒
nvidia-smi dmon –d xxx

#3）-c参数，指定刷新次数，然后退出
nvidia-smi dmon –c xxx

#4）-s参数，指定查询的监控指标。支持如下指标，可同时使用多个
[-s | --select]:      One or more metrics [default=puc]
                      Can be any of the following:
                          p - Power Usage and Temperature
                          u - Utilization
                          c - Proc and Mem Clocks
                          v - Power and Thermal Violations
                          m - FB, Bar1 and CC Protected Memory
                          e - ECC Errors and PCIe Replay errors
                          t - PCIe Rx and Tx Throughput
#如要监控电耗与温度、GPU利用率，则使用如下命令
nvidia-smi dmon -s pu

#5）-o参数，在监控信息前显示监控信息的获取日期或时间，D表示日期，T表示时间
nvidia-smi dmon -s pu -o DT

#6）-f参数，将查询的监控不输出到标准输出，而是重定向到指定的文件中
nvidia-smi dmon -s pu -o DT -f gpu0.log

3.4.2 nvidia-smi pmon

#是用来查询输出使用GPU设备的进程（以进程为维度）实时监控信息的工具，会持续输出数据（默认间隔为1秒），直到“ctrl+x”或指定刷新次数
#会在标准输出中持续向下以固定的间隔滚动刷新输出
#查看帮助文档
nvidia-smi pmon --help

#示例
(kt) root@ksp-registry:~# nvidia-smi pmon
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0       9507     C      0      0      -      -      -      -    python         
    0       9507     C      0      0      -      -      -      -    python         
    0       9507     C      0      0      -      -      -      -    python         
    0       9507     C      0      0      -      -      -      -    python         
...

1	`#“nvidia-smi pmon”支持的常用参数，与“nvidia-smi dmon”支持的常用参数，基本上是一样的`

3.5 DEVICE MODIFICATION OPTIONS

#查看帮助文档，如下命令执行输出的"DEVICE MODIFICATION OPTIONS"部分
nvidia-smi --help

#设置持久化模式，0/DISABLED, 1/ENABLED
nvidia-smi -pm 0/1

#切换ECC支持，0/DISABLED, 1/ENABLED
nvidia-smi –e 0/1

#重置ECC错误计数，0/VOLATILE, 1/AGGREGATE
nvidia-smi -p 0/1

#设置GPU计算模式，0/DEFAULT, 1/EXCLUSIVE_THREAD (DEPRECATED),
#2/PROHIBITED, 3/EXCLUSIVE_PROCESS
nvidia-smi -c xx

#设置应用在GPU上执行时的时钟频率（GPU支持地时钟频率见nvidia-smi --query-accounted-apps）
#有两个值，分别是<memory,graphics>，分别代表核心时钟频率、显存时钟频率
nvidia-smi -ac 2000,800

#重置上述时钟频率
nvidia-smi -rac

#启停GPU的accounting模式，0/DISABLED, 1/ENABLED
nvidia-smi -ac 0/1

#启停MIG（需要重启服务器），0/DISABLED, 1/ENABLED
nvidia-smi -mig 0/1

#...

NVIDIA生态

#nvidia

nvidia-smi相关命令执行时输出内容

https://jiangsanyin.github.io/2025/03/20/nvidia-smi相关命令执行时输出内容/

作者

sanyinjiang

发布于

2025年3月20日

许可协议

Ubuntu2004安装maven制作容器镜像上一篇

联想Yoga笔记本经typeC线连接显示器无效下一篇