显卡常规检测 GPUBURN gpu-burn 测试显卡步骤

2023-07-27 16:40:43 NJTST 3850

gpu-burn是我们基础服务器检查的必备工具

=========================================

http://wili.cc/blog/gpu-burn.html

https://github.com/wilicc/gpu-burn

=========================================

1.Linux下下载软件

wget https://codeload.github.com/wilicc/gpu-burn/zip/master

Easy docker build and run

git clone  

cd gpu-burn

docker build -t gpu_burn .

docker run --rm --gpus all gpu_burn

也可以直接点击这里

高性能工作站服务器找我们 master.zip

2.解压缩

unzip gpu-burn-master.zip

3.进入目录编译(确保cuda环境变量已经配置成功 nvcc -v能显示结果)

cd gpu-burn-master

make

4.编译成功后,会在当前目录生成 gpu_burn 这个文件

gpu_burn

5.默认执行,跑全部GPU卡,空格后面参数为时间,一般快速测试设置100,稳定性测试为500

[root@localhost gpu-burn-master]#

./gpu_burn 100

GPU 0: Tesla V100 (UUID: GPU-6250466c-35ed-c279-fc0b-3b9b613a586f)

GPU 1: Tesla V100 (UUID: GPU-0a4a2b9c-d32c-1ba2-42a0-151ed9907d57)

GPU 2: Tesla V100 (UUID: GPU-f6cf184f-9173-1edd-648f-71e841afe152)

GPU 3: Tesla V100 (UUID: GPU-044f96e6-cc66-cc93-6283-07b829216f91)

Initialized device 2 with 11178 MB of memory (10993 MB available, using 9894 MB of it), using FLOATS

Initialized device 1 with 11178 MB of memory (10993 MB available, using 9894 MB of it), using FLOATS

Initialized device 3 with 11178 MB of memory (10993 MB available, using 9894 MB of it), using FLOATS

Initialized device 0 with 11178 MB of memory (10993 MB available, using 9894 MB of it), using FLOATS

6.可以指定某几张卡跑,比如指定0和1号卡

export CUDA_VISIBLE_DEVICES=0,1

./gpu_burn 100

如何找出故障卡

1. dmesg -l err 筛选出错误卡的Bus-Id

2. 根据Bus-Id找出对应的GPU卡编号,在跑测试的时候排除它,比如机器8张卡,device 2 故障,那个参数这样写:

export CUDA_VISIBLE_DEVICES=0,1,3,4,5,6,7 #2不写在里面

./gpu_burn 100

3. 跑完之后关机,找出那张没有温度的卡,即故障卡

==============================================================

Building

To build GPU Burn:

make

To remove artifacts built by GPU Burn:

make clean

GPU Burn builds with a default Compute Capability of 5.0. To override this with a different value:

make COMPUTE=

CFLAGS can be added when invoking make to add to the default list of compiler flags:

make CFLAGS=-Wall

LDFLAGS can be added when invoking make to add to the default list of linker flags:

make LDFLAGS=-lmylib

NVCCFLAGS can be added when invoking make to add to the default list of nvcc flags:

make NVCCFLAGS=-ccbin

CUDAPATH can be added to point to a non standard install or specific version of the cuda toolkit (default is /usr/local/cuda):

make CUDAPATH=/usr/local/cuda-

CCPATH can be specified to point to a specific gcc (default is /usr/bin):

make CCPATH=/usr/local/bin

CUDA_VERSION and IMAGE_DISTRO can be used to override the base images used when building the Docker image target, while IMAGE_NAME can be set to change the resulting image tag:

make IMAGE_NAME=myregistry.private.com/gpu-burn CUDA_VERSION=12.0.1 IMAGE_DISTRO=ubuntu22.04 image

Usage

GPU Burn
Usage: gpu_burn [OPTIONS] [TIME]

-m X Use X MB of memory
-m N%Use N% of the available GPU memory
-d Use doubles
-tcTry to use Tensor cores (if available)
-l List all GPUs in the system
-i N Execute only on GPU N
-h Show this help message

Example:
gpu_burn -d 3600

行业新闻

显卡常规检测 GPUBURN gpu-burn 测试显卡步骤

Easy docker build and run

Building

Usage

技术文档

行业新闻

显卡常规检测 GPUBURN gpu-burn 测试显卡步骤

Easy docker build and run

Building

Usage

技术文档

为您推荐