windows下有HDTune能夠查看磁盤的狀態,防止磁盤掛掉纔會本身知道,CentOS下有SMART (Self-Monitoring, Analysis and Reporting Technology System) 一樣對磁盤作狀態檢測html
http://www.smartmontools.org/
linux
下面以dell R720服務器舉例,/dev/sda是1T的scsi接口普通硬盤,/dev/sdd 是三塊盤作的raid5ios
# df -h #查看磁盤的名字web
# dmesg |grep sdd #查看開機信息裏面的磁盤info
vim
sd 0:2:0:0: [sdd] Attached SCSI diskwindows
# hdparm -I /dev/sda #查看磁盤硬件信息、開啓的功能等,信息特別詳細centos
下面用smart查看磁盤的狀態:
api
# yum install smartmontools //安裝SMART # smartctl -H /dev/sdd //磁盤健康情況查看 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.56-11.el6.centos.alt.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net SMART Health Status: OK
# smartctl -A /dev/sda 或者 smartctl --all /dev/sda #硬盤的smart信息bash
# smartctl -a /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.56-11.el6.centos.alt.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net Vendor: DELL Product: PERC H310 Revision: 2.12 User Capacity: 598,879,502,336 bytes [598 GB] Logical block size: 512 bytes Logical Unit id: Serial number: Device type: disk Local Time is: Wed Jan 14 15:37:39 2015 CST Device does not support SMART Error Counter logging not supported Device does not support Self Test logging
這裏提示Device does not support SMART,因此按下面方式查看
服務器
查看raid5中第一塊磁盤的狀態
# smartctl -a -d megaraid,0 /dev/sdd
一樣查看第二塊、第三塊磁盤的狀態,根據本身的監控狀況,加速nagios、zabbix報警
# smartctl -a -d megaraid,1 /dev/sdd
# smartctl -a -d megaraid,2 /dev/sdd
除此以外的smartctl用法,介紹的很詳細:
# smartctl -h Usage: smartctl [options] device ============================================ SHOW INFORMATION OPTIONS ===== -h, --help, --usage Display this help and exit -V, --version, --copyright, --license Print license, copyright, and version information and exit -i, --info Show identity information for device -g NAME, --get=NAME Get device setting: all, aam, apm, lookahead, security, wcache -a, --all Show all SMART information for device -x, --xall Show all information for device --scan Scan for devices --scan-open Scan for devices and try to open each device ================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS ===== -q TYPE, --quietmode=TYPE (ATA) Set smartctl quiet mode to one of: errorsonly, silent, noserial -d TYPE, --device=TYPE Specify device type to one of: ata, scsi, sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,x][,N], usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, cciss,N, auto, test -T TYPE, --tolerance=TYPE (ATA) Tolerance: normal, conservative, permissive, verypermissive -b TYPE, --badsum=TYPE (ATA) Set action on bad checksum to one of: warn, exit, ignore -r TYPE, --report=TYPE Report transactions (see man page) -n MODE, --nocheck=MODE (ATA) No check if: never, sleep, standby, idle (see man page) ============================== DEVICE FEATURE ENABLE/DISABLE COMMANDS ===== -s VALUE, --smart=VALUE Enable/disable SMART on device (on/off) -o VALUE, --offlineauto=VALUE (ATA) Enable/disable automatic offline testing on device (on/off) -S VALUE, --saveauto=VALUE (ATA) Enable/disable Attribute autosave on device (on/off) -s NAME[,VALUE], --set=NAME[,VALUE] Enable/disable/change device setting: aam,[N|off], apm,[N|off], lookahead,[on|off], security-freeze, standby,[N|off|now], wcache,[on|off] ======================================= READ AND DISPLAY DATA OPTIONS ===== -H, --health Show device SMART health status -c, --capabilities (ATA) Show device SMART capabilities -A, --attributes Show device SMART vendor-specific Attributes and values -f FORMAT, --format=FORMAT (ATA) Set output format for attributes: old, brief, hex[,id|val] -l TYPE, --log=TYPE Show device log. TYPE: error, selftest, selective, directory[,g|s], xerror[,N][,error], xselftest[,N][,selftest], background, sasphy[,reset], sataphy[,reset], scttemp[sts,hist], scttempint,N[,p], scterc[,N,M], devstat[,N], ssd, gplog,N[,RANGE], smartlog,N[,RANGE] -v N,OPTION , --vendorattribute=N,OPTION (ATA) Set display OPTION for vendor Attribute N (see man page) -F TYPE, --firmwarebug=TYPE (ATA) Use firmware bug workaround: none, samsung, samsung2, samsung3, swapid -P TYPE, --presets=TYPE (ATA) Drive-specific presets: use, ignore, show, showall -B [+]FILE, --drivedb=[+]FILE (ATA) Read and replace [add] drive database from FILE [default is +/etc/smart_drivedb.h and then /usr/share/smartmontools/drivedb.h] ============================================ DEVICE SELF-TEST OPTIONS ===== -t TEST, --test=TEST Run test. TEST: offline, short, long, conveyance, force, vendor,N, select,M-N, pending,N, afterselect,[on|off] -C, --captive Do test in captive mode (along with -t) -X, --abort Abort any non-captive test on device =================================================== SMARTCTL EXAMPLES ===== smartctl --all /dev/hda (Prints all SMART information) smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda (Enables SMART on first disk) smartctl --test=long /dev/hda (Executes extended disk self-test) smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda (Prints Self-Test & Attribute errors) smartctl --all --device=3ware,2 /dev/sda smartctl --all --device=3ware,2 /dev/twe0 smartctl --all --device=3ware,2 /dev/twa0 smartctl --all --device=3ware,2 /dev/twl0 (Prints all SMART info for 3rd ATA disk on 3ware RAID controller) smartctl --all --device=hpt,1/1/3 /dev/sda (Prints all SMART info for the SATA disk attached to the 3rd PMPort of the 1st channel on the 1st HighPoint RAID controller) smartctl --all --device=areca,3/1 /dev/sg2 (Prints all SMART info for 3rd ATA disk of the 1st enclosure on Areca RAID controller)
http://linux-wiki.cn/wiki/zh-hans/SSD_(%E5%9B%BA%E6%80%81%E7%A1%AC%E7%9B%98)
nagios設置
下面檢測raid5磁盤,總共3塊磁盤
root@web: /usr/local/nagios/libexec # vim check_disk_status.sh #!/bin/bash # STATE_OK=0 STATE_W ARNING=1 SMARTCTL="/usr/sbin/smartctl" CHECK_DISK="/dev/sda" DISK_HEALTH1=`$SMARTCTL -a -d megaraid,0 $CHECK_DISK |grep "SMART Health Status"|awk '{print $4}'` if [ "$DISK_HEALTH1" = "OK" ]|| [ "$DISK_HEALTH1" = "PASSED" ];then echo "OK - $CHECK_DISK 1 status is $DISK_HEALTH1 " else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH1 " exit $STATE_CRITICAL fi DISK_HEALTH2=`$SMARTCTL -a -d megaraid,1 $CHECK_DISK |grep "SMART Health Status"|awk '{print $4}'` if [ "$DISK_HEALTH2" = "OK" ]|| [ "$DISK_HEALTH2" = "PASSED" ];then echo "OK - $CHECK_DISK 2 status is $DISK_HEALTH2 " else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH2 " exit $STATE_CRITICAL fi DISK_HEALTH3=`$SMARTCTL -a -d megaraid,2 $CHECK_DISK |grep "SMART Health Status"|awk '{print $4}'` if [ "$DISK_HEALTH3" = "OK" ]|| [ "$DISK_HEALTH3" = "PASSED" ];then echo "OK - $CHECK_DISK 3 status is $DISK_HEALTH3 " else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH3 " exit $STATE_CRITICAL fi # chmod 755 check_disk_status.sh
vim /usr/local/nagios/etc/nrpe.cfg command[check_disk_status]=/usr/bin/sudo /usr/local/nagios/libexec/check_disk_status.sh
由於/usr/sbin/smartctl必需要root才能夠運行,獲得磁盤的狀態
vim /etc/sudoers #Defaults requiretty nagios ALL=(ALL) NOPASSWD:/usr/local/nagios/libexec/check_disk_status.sh
在nagios服務器端執行命令來測試:
root@nagios: /usr/local/nagios/libexec # ./check_nrpe -H 192.168.2.2 -c check_disk_status OK - /dev/sda 1 status is OK OK - /dev/sda 2 status is OK OK - /dev/sda 3 status is OK
定義nagios服務
define service{ use linux-service host_name 192_168_2_2 service_description check disk status check_command check_nrpe!check_disk_status }
再把時間定義爲1天一次,省的總掃描硬盤,對硬盤也很差
參考http://blog.chinaunix.net/uid-20592013-id-2436813.html
執行腳本,發郵件
最簡單的,加入crontab,查看郵件便可,下面是腳本