http://nolinux.blog.51cto.com/4824967/1665075 html
上面博文講解了服務器硬件監控的知識,在文章的結尾提到了check_openmanage工具。linux
本文就主要介紹這個工具在服務器硬件監控方面的做用。ios
1、check_openmanage介紹git
check_openmanage 是一個 Nagios 的插件,它基於 OMSA 獲取相關的報道信息,用來檢測安裝有 OpenManage Server Administrator (OMSA) 的戴爾服務器的運行狀態,包括存儲系統、電源、溫度等信息。web
官網:http://folk.uio.no/trondham/software/check_openmanage.htmlapache
最新版本下載連接:http://folk.uio.no/trondham/software/files/check_openmanage-3.7.12.tar.gzbootstrap
體系結構:tomcat
如上圖,nagios提供了兩種方式進行監控信息的獲取。bash
一、nagios 服務器端 check_nrpe 調用被監控端的 check_openmanage 來實現,這種方式要在被監控端安裝 OMSA 和 check_openmanage服務器
二、nagios 服務器端直接經過 check_openmanage 來遠程監控。這種方式要在 nagios 服務器端安裝 perl-Net-SNMP,在被監控端安裝SNMP和OMSA。
注意:
因爲第一種方式,check_nrpe會消耗服務器性能,所以建議使用第二種方式。另外,第二種方式也適合使用zabbix的運維監控環境。
2、check_openmanage安裝
check_openmanage的安裝很是簡單,只須要把它的包拿下來解壓便可。因爲包的來源有git倉庫和gz包,因此這裏列舉兩種安裝方式。
方式一:
[root@kvm-phy04-jz ~]# cd /usr/local/src [root@kvm-phy04-jz src]# git clone git://git.uio.no/check_openmanage [root@kvm-phy04-jz src]# cd check_openmanage [root@kvm-phy04-jz check_openmanage]# ./check_openmanage # 不帶任何參數默認輸出服務器的warning和critical的報警信息
方式二:
12345 [root@kvm-phy04-jz ~]# cd /usr/local/src[root@kvm-phy04-jz src]# wget http://folk.uio.no/trondham/software/files/check_openmanage-3.7.11.tar.gz[root@kvm-phy04-jz src]# tar zxf check_openmanage-3.7.11.tar.gz[root@kvm-phy04-jz src]# cd check_openmanage-3.7.11[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage
注意:
若是提示"Storage Error",則須要加上--no-storage參數
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage --no-storage
3、check_openmanage使用詳解
check_openmanage提供了不少選項和參數供咱們使用,因爲官方提供的幫助文檔都是英文的,這裏我就根據使用的經驗進行了翻譯和註解,幫助你們快速的上手這個工具。
【通用選項】 -f,--config # 指定配置文件 -p,--perfdata # 輸出性能數據,常和--only連用,不要和-d連用 -t,--timeout 時間值 # 設定check_openmanage的執行超時時間 -c,--critical # 自定義溫度的critical閾值 -w,--warning # 自定義溫度的warning閾值 -F,--fahrenheit # 使用華氏溫度做爲溫度單位 -d,--debug # 顯示全部檢查項目 -h,--help # 獲取check_openmanage幫助信息 -V,--version # 獲取check_openmanage的版本信息 【SNMP選項】 -H,--hostname # 使用snmp協議,獲取指定主機名或ip的服務器硬件信息 -C,--community # 自定義snmp的團體名,默認爲public -P,--protocol # 自定義snmp的協議版本,默認爲2c --port # 自定義snmp的端口號,默認爲161 -6,--ipv6 # 使用ipv6替代ipv4,默認爲no --tcp # 使用TCP協議替代UDP協議,默認爲no 【輸出選項】 -i,--info # 輸出的警告信息加上服務器的SN號做爲前綴 -e,--extinfo # 輸出系統信息 -s,--state # 輸出的信息以前自帶警告級別,如warning或critical -S,--short-state # 輸出的信息以前自帶警告級別縮寫,如W或C -o,--okinfo # 輸出信息爲一行(默認) -B,--show-blacklist # 輸出黑名單列表信息,若是加入黑名單的信息多了,能夠經過-B查看黑名單的列表信息 -I,--htmlinfo # 輸出帶可點擊連接的html格式信息 【檢查控制和黑名單】 -a,--all # 獲取日誌統計和詳細日誌輸出 -b,--blacklist component=ID號 # 黑名單,指定某個組件的指定ID信息不顯示。ID信息經過./check_openmanage -d能夠看到。和-d搭配使用無效 --only # 僅輸入某項監控數據 --check component=[0|1],esmlog=[0|1] # 檢查單個項目或組合項目,0爲不檢查,1爲檢查,單獨使用 --no-storage # 不檢查存儲信息 --vdisk-critical # 將虛擬磁盤的任何警告都提高爲崩潰級別critical 【自定義輸出信息】 --postmsg '自定義信息' # 在輸出的結尾輸出該自定義信息 在自定義信息中,咱們可使用以下變量 %m # 系統型號 %s # 系統SN號 %b # BIOS版本 %d # BIOS髮型時間 %o # 操做系統名稱 %r # 操做系統內核版本 %p # 物理磁盤數量 %l # 邏輯磁盤數量 %n # 表示換行符 %% # 表示%百分號
參考資料:
一、http://folk.uio.no/trondham/software/check_openmanage.html#download
二、check_openmanage -h
4、實用範例
因爲check_openmanage命令有不少選項,所以在實際使用當中可能會讓使用者很迷惑如何使用,所以這裏列舉一些經常使用的查看需求和對應的命令組合。上面介紹了,check_openmanage有兩種獲取信息的方式,我這裏的範例,主要是上面介紹的第一種方式的前一部分,即便用本地check_openmanage命令查看。
一、若是執行的時候不帶任何無參數 不帶任何參數默認輸出服務器的warning和critical的報警信息
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage Controller 0 [PERC H310 Mini]: Firmware '20.12.1-0002' is out of date Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
二、輸出帶有狀態提示的信息
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -s WARNING: Controller 0 [PERC H310 Mini]: Firmware '20.12.1-0002' is out of date WARNING: Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
三、使用黑名單,不檢查Firmware固件版本更新提示
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -s -b ctrl_fw=0 WARNING: Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
四、使用黑名單,不檢查磁盤未認證的提示
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -s -b pdisk_cert=all
WARNING: Controller 0 [PERC H310 Mini]: Firmware '20.12.1-0002' is out of date
五、使用黑名單,不檢查ID爲0的Firmware固件版本更新提示和ID爲0:0:1:0的物理磁盤的未認證提示
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -b ctrl_fw=0\/pdisk=0:0:1:0 Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
六、使用黑名單,不檢查ID爲0的Firmware固件版本更新提示和未認證的物理磁盤提示
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -b ctrl_fw=0\/pdisk=ALL OK - System: 'PowerEdge R720', SN: '33R0G42', 32 GB ram (4 dimms), 1 logical drives, 6 physical drives
七、輸出全部檢查項目
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -d System: PowerEdge R720 OMSA version: 8.1.0 ServiceTag: 33R0G42 Plugin version: 3.7.11 BIOS/date: 2.4.3 07/09/2014 Checking mode: local ----------------------------------------------------------------------------- Storage Components ============================================================================= STATE | ID | MESSAGE TEXT ---------+----------+-------------------------------------------------------- WARNING | 0 | Controller 0 [PERC H310 Mini]: Firmware '20.12.1-0002' is out of date OK | 0 | Controller 0 [PERC H310 Mini] is Degraded WARNING | 0:0:1:0 | Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:1 | Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:2 | Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:3 | Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:4 | Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:5 | Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified OK | 0:0 | Logical Drive '/dev/sda' [RAID-10, 836.63 GB] is Ready OK | 0:0 | Connector 0 [SAS Port RAID Mode] on controller 0 is Ready OK | 0:1 | Connector 1 [SAS Port RAID Mode] on controller 0 is Ready OK | 0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is Ready ----------------------------------------------------------------------------- Chassis Components ============================================================================= STATE | ID | MESSAGE TEXT ---------+------+------------------------------------------------------------ OK | 0 | Memory module 0 [DIMM_A1, 8192 MB] is Ok OK | 1 | Memory module 1 [DIMM_A2, 8192 MB] is Ok OK | 2 | Memory module 2 [DIMM_B1, 8192 MB] is Ok OK | 3 | Memory module 3 [DIMM_B2, 8192 MB] is Ok OK | 0 | Chassis fan 0 [System Board Fan1 RPM] reading: 3000 RPM OK | 1 | Chassis fan 1 [System Board Fan2 RPM] reading: 3000 RPM OK | 2 | Chassis fan 2 [System Board Fan3 RPM] reading: 2880 RPM OK | 3 | Chassis fan 3 [System Board Fan4 RPM] reading: 3000 RPM OK | 4 | Chassis fan 4 [System Board Fan5 RPM] reading: 2880 RPM OK | 5 | Chassis fan 5 [System Board Fan6 RPM] reading: 3000 RPM OK | 0 | Power Supply 0 [AC]: Presence Detected OK | 0 | Temperature Probe 0 [System Board Inlet Temp] reads 27 C (min=3/-7, max=42/47) OK | 1 | Temperature Probe 1 [System Board Exhaust Temp] reads 31 C (min=8/3, max=70/75) OK | 2 | Temperature Probe 2 [CPU1 Temp] reads 36 C (min=8/3, max=79/84) OK | 3 | Temperature Probe 3 [CPU2 Temp] reads 31 C (min=8/3, max=79/84) OK | 0 | Processor 0 [Intel Xeon E5-2630 v2 2.60GHz] is Present OK | 1 | Processor 1 [Intel Xeon E5-2630 v2 2.60GHz] is Present OK | 0 | Voltage sensor 0 [CPU1 VCORE PG] is Good OK | 1 | Voltage sensor 1 [CPU2 VCORE PG] is Good OK | 2 | Voltage sensor 2 [System Board 3.3V PG] is Good OK | 3 | Voltage sensor 3 [System Board 5V PG] is Good OK | 4 | Voltage sensor 4 [CPU2 PLL PG] is Good OK | 5 | Voltage sensor 5 [CPU1 PLL PG] is Good OK | 6 | Voltage sensor 6 [System Board 1.1V PG] is Good OK | 7 | Voltage sensor 7 [CPU1 M23 VDDQ PG] is Good OK | 8 | Voltage sensor 8 [CPU1 M23 VTT PG] is Good OK | 9 | Voltage sensor 9 [System Board FETDRV PG] is Good OK | 10 | Voltage sensor 10 [CPU2 VSA PG] is Good OK | 11 | Voltage sensor 11 [CPU1 VSA PG] is Good OK | 12 | Voltage sensor 12 [CPU2 M01 VDDQ PG] is Good OK | 13 | Voltage sensor 13 [CPU1 M01 VDDQ PG] is Good OK | 14 | Voltage sensor 14 [CPU2 M23 VTT PG] is Good OK | 15 | Voltage sensor 15 [CPU2 M01 VTT PG] is Good OK | 16 | Voltage sensor 16 [System Board NDC PG] is Good OK | 17 | Voltage sensor 17 [CPU2 VTT PG] is Good OK | 18 | Voltage sensor 18 [CPU1 VTT PG] is Good OK | 19 | Voltage sensor 19 [CPU2 M23 VDDQ PG] is Good OK | 20 | Voltage sensor 20 [System Board 1.5V PG] is Good OK | 21 | Voltage sensor 21 [System Board PS2 PG Fail] is Good OK | 22 | Voltage sensor 22 [System Board PS1 PG Fail] is Good OK | 23 | Voltage sensor 23 [System Board BP1 5V PG] is Good OK | 24 | Voltage sensor 24 [CPU1 M01 VTT PG] is Good OK | 25 | Voltage sensor 25 [PS1 Voltage 1] reads 220 V OK | 0 | Battery probe 0 [System Board CMOS Battery] is Good OK | 1 | Amperage probe 1 [System Board Pwr Consumption] reads 112 W OK | 0 | Chassis intrusion 0 detection: Ok (Chassis is closed) OK | 0 | SD Card 0 [vFlash] is Absent ----------------------------------------------------------------------------- Other messages ============================================================================= STATE | MESSAGE TEXT ---------+------------------------------------------------------------------- OK | ESM log health is Ok (less than 80% full) OK | Chassis Service Tag is sane
八、將服務器的SN號做爲警告信息的輸出前綴
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -i [33R0G42] Controller 0 [PERC H310 Mini]: Firmware '20.12.1-0002' is out of date [33R0G42] Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
九、不檢查存儲
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage --no-storage OK - System: 'PowerEdge R720', SN: '33R0G42', 32 GB ram (4 dimms), not checking storage
十、使用黑名單,不顯示Firmware固件版本更新和未認證磁盤提示信息,並輸出系統信息
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -e -b ctrl_fw=0\/pdisk=ALL ------ SYSTEM: PowerEdge R720, SN: 33R0G42
5、使用check_openmanage獲取遠端服務器信息
正常狀況下,若是使用check_openmanage檢查本機的信息,能夠直接像上面的命令同樣直接使用check_openmanage命令去查看。他也支持在某一臺機器上集中查看其它物理服務器的信息,此時要跟上-H ip_address信息才行。而且,被監控的服務器上還須要安裝以下幾個包:
net-snmp
perl-Net-SNMP
srvadmin-all
安裝順序上,net-snmp必定要放在srvadmin-all以前安裝。這樣子,srvadmin-all在安裝的時候,會自動幫助你設置好snmp的信息。
安裝範例:
被監控服務器kvm-phy04-jz:
[root@kvm-phy05-jz ~]# yum install -y net-snmp net-snmp-devel net-snmp-utils [root@kvm-phy05-jz ~]# wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash [root@kvm-phy05-jz ~]# yum -y install OpenIPMI srvadmin-all [root@kvm-phy05-jz ~]# yum remove -y srvadmin-tomcat srvadmin-jre srvadmin-smweb [root@kvm-phy05-jz ~]# rm -rf /opt/dell/srvadmin/lib64/openmanage/apache-tomcat [root@kvm-phy05-jz ~]# /etc/init.d/snmpd restart [root@kvm-phy05-jz ~]# chkconfig snmpd on [root@kvm-phy05-jz ~]# /opt/dell/srvadmin/sbin/srvadmin-services.sh restart [root@kvm-phy05-jz ~]# /opt/dell/srvadmin/sbin/srvadmin-services.sh enable
監控服務器kvm-phy04-jz:
[root@kvm-phy04-jz check_openmanage-3.7.11]# yum install -y perl-Net-SNMP [root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -H 192.168.0.210 Controller 0 [PERC H310 Mini]: Firmware '20.12.0-0004' is out of date Physical Disk 0:1:0 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:1 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:2 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:3 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:4 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:5 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified
總結:
若是運維環境使用的是nagios+cacti的監控架構,使用check_openmanage能夠很是方便的對線上服務器硬件進行監控預警。因爲我司的監控架構使用的是zabbix,所以這裏再也不多說nagios的具體監控實施操做。感興趣的同窗能夠參考下面兩篇博文的講解:
http://dreamway.blog.51cto.com/1281816/1048274
http://www.2cto.com/os/201505/397023.html
http://www.2cto.com/os/201405/301212.html
報錯集錦:
報錯1:
ERROR: You need perl module Net::SNMP to run check_openmanage in SNMP mode
緣由:
SNMP監控模式下,check_openmanage 須要 perl-Net-SNMP 支持
解決方案:
安裝perl-Net-SNMP包
# yum install -y perl-Net-SNMP
報錯2:
ERROR: (SNMP) OpenManage is not installed or is not working correctly
緣由:
snmp未配置致使。若是先安裝snmp,在安裝omsa的時候會自動幫你配置好snmp
配置信息以下:
解決方案:
一、先安裝net-snmp,再安裝omsa(即srvadmin-all)
or
二、手動按照上圖信息進行配置
報錯3:
SNMP CRITICAL: No response from remote host 'X.X.X.X'
緣由:
被監控端沒有安裝snmp服務
解決方案:
安裝snmp服務
# yum install -y net-snmpd