1、NSclient++與nrpehtml
nagios對windows的監控主要有兩種方法,一種是NSclient++,另外一種是經過nrpe的方式來達到監控目的mysql
NSclient++與nrpe最大的區別就是:linux
一、被監控機上安裝有nrpe,而且還有插件,最終的監控是由這些插件來進行的.當監控主機將監控請求發給nrpe後,nrpe調用插件來完成監控.ios
二、NSclient++則不一樣,被監控機上只安裝NSclient++,沒有任何的插件.當監控主機將監控請求發給NSclient++後,NSclient++直接完成監控,全部的監控是由NSclient++完成的。web
這也說明了NSclient++的一個很大的問題,不靈活,沒有可擴展性.它只能完成本身自己包含的監控操做,不能由一些插件來擴展.好在NSclient++已經作的不錯了,基本上能夠徹底知足咱們的監控須要。sql
NSclient++的原理圖
apache
2、部署過程
vim
一、在windows上安裝NSclient++windows
(1)一直下一步安全
(2)設置nagios服務器IP地址
(3)檢查NSclient++的端口是否成功開啓
若是服務沒有開啓,就:win+r --> services.msc --> nsclient++ 開啓服務便可
(4)防火牆打開tcp 12489端口
二、設置nagios服務器
(1)檢測nagios命令是否能夠正常監測windows主機
[root@cacti libexec]# ./check_nt -H 192.168.200.15 -p 12489 -s dianyi123 -v UPTIME System Uptime - 3 day(s) 12 hour(s) 32 minute(s) [root@cacti libexec]# [root@cacti libexec]# ./check_nt -H 192.168.200.15 -p 12489 -s dianyi123 -v CPULOAD -w 80 -c 90 -l 5,80,90 CPU Load 0% (5 min average) | '5 min avg Load'=0%;80;90;0;100 #-w 警告比例 -c 緊急比例 -l(小寫L) 表示過去5分鐘的平均值,80%爲警告,90%爲緊急 [root@cacti libexec]# [root@cacti libexec]# ./check_nt -H 192.168.200.15 -p 12489 -s dianyi123 -v USEDDISKSPACE -w 80 -c 90 -l C C:\ - total: 100.83 Gb - used: 13.71 Gb (14%) - free 87.12 Gb (86%) | 'C:\ Used Space'=13.71Gb;80.66;90.74;0.00;100.83
(2)定義命令、主機、服務
①、定義命令
[root@cacti ~]# vim /usr/local/nagios/etc/objects/commands.cfg define command{ command_name check_win command_line $USER1$/check_nt -H "$HOSTADDRESS$" -p 12489 -s dianyi123 -v $ARG1$ $ARG2$ }
②、定義主機和服務
爲了方便,主機和監控服務都定義在一個配置文件裏面
首先創在/usr/local/nagios/etc建一個文件夾servers專門保存各服務器的配置文件,而後以服務器IP命名各服務器配置文件
這樣的話,nagios.cfg裏面就須要開啓對servers目錄的支持
[root@cacti etc]# pwd /usr/local/nagios/etc [root@cacti etc]# [root@cacti etc]# ls cgi.cfg htpasswd.users nagios.cfg objects resource.cfg servers [root@cacti etc]# [root@cacti etc]# vim nagios.cfg cfg_dir=/usr/local/nagios/etc/servers [root@cacti etc]# [root@cacti etc]# vim servers/192.168.200.15.cfg define host{ use windows-server ; Name of host template to use host_name 192.168.200.15 alias my computer address 192.168.200.15 } #define hostgroup{ # hostgroup_name windows-servers ; The name of the hostgroup # alias Windows Servers ; Long name of the group # } define service{ use generic-service host_name 192.168.200.15 service_description NSClient++ Version check_command check_win!CLIENTVERSION } define service{ use generic-service host_name 192.168.200.15 service_description Uptime check_command check_win!UPTIME } define service{ use generic-service host_name 192.168.200.15 service_description CPU Load check_command check_win!CPULOAD!-l 5,80,90 } define service{ use generic-service host_name 192.168.200.15 service_description Memory Usage check_command check_win!MEMUSE!-w 80 -c 90 } define service{ use generic-service host_name 192.168.200.15 service_description C:\ Drive Space check_command check_win!USEDDISKSPACE!-l c -w 80 -c 90 } define service{ use generic-service host_name 192.168.200.15 service_description D:\ Drive Space check_command check_win!USEDDISKSPACE!-l d -w 80 -c 90 } define service{ use generic-service host_name 192.168.200.15 service_description E:\ Drive Space check_command check_win!USEDDISKSPACE!-l e -w 80 -c 90 } #define service{ # use generic-service # host_name 192.168.200.15 # service_description W3SVC # check_command check_win!SERVICESTATE!-d SHOWALL -l W3SVC # } define service{ use generic-service host_name 192.168.200.15 service_description Explorer check_command check_win!PROCSTATE!-d SHOWALL -l Explorer.exe }
(3)檢查配置文件有無錯誤
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
若是沒有消息,那就是最好的消息,下一步,就能夠重啓nagios服務了
(4)重啓nagios服務
[root@cacti ~]# service nagios restart Stopping nagios: [ OK ] Starting nagios: [ OK ]
3、nagios監控頁面查看主機與服務
一、主機狀態
二、服務狀態
4、排錯階段
本次部署nagios監控windows主機主要碰到兩個問題
一、主機狀態(status)是down,而不是正常的up
緣由:這種狀況下,通常都是服務器禁ping了,監控服務器是經過ping服務來檢查被監控服務器是否在線,當把windows服務器ping的回顯請求開啓後,監控成功
解決:win2008:服務器管理器——設置——高級安全windows防火牆——入站規則——找到「文件和打印機共享(回顯請求-ICMPv4-in)」右擊……選擇「啓用規則」
二、could not fetch information from server
當把第1個問題解決掉後,Status是UP起來了,但是全部的服務所有都是could not fetch information from server
緣由:出現這種情況的緣由是由於nagios服務器沒有從被監控端服務器上得到相關數據,直接緣由就是NSclient++的配置文件中Allowed hosts的IP沒有設置正確
解決:NSclient++的配置文件中 Allowed hosts = nagios服務器IP
當時在安裝NSclient++時,個人 Allowed hosts = 192.168.200.105 ,個人設置是正確的,可是爲何會變成15我也不知道爲何
5、nagios監控linux主機
一、服務端定義主機
define host{ use linux-server host_name 192.168.200.111 alias linux address 192.168.200.111 } define service{ use generic-service host_name 192.168.200.111 service_description root_/ check_command check_nrpe!check_xvda!5%!10% } define service{ use generic-service host_name 192.168.200.111 service_description /dev/xvdb2 check_command check_nrpe!check_xvdb2!5%!10% } define service{ use generic-service host_name 192.168.200.111 service_description Check Swap check_command check_nrpe!check_swap } define service{ use generic-service host_name 192.168.200.111 service_description total check_command check_nrpe!check_total_procs } define service{ use generic-service host_name 192.168.200.111 service_description check_load check_command check_nrpe!check_load } define service{ use generic-service host_name 192.168.200.111 service_description check_tcp_3306 check_command check_tcp!3306 } define service{ use generic-service host_name 192.168.200.111 service_description check_users check_command check_nrpe!check_users } define service{ use generic-service host_name 192.168.200.111 service_description check_mem check_command check_nrpe!check_mem } define service{ use generic-service host_name 192.168.200.111 service_description check_mysql check_command check_nrpe!check_mysql } define service{ use generic-service host_name 192.168.200.111 service_description check_mysql_slave check_command check_nrpe!check_mysql_slave } define service{ use generic-service host_name 192.168.200.111 service_description check_http 192.168.200.111/test.html check_command check_http!'-u /test.html' #nagios監控網頁狀態(如 200),在commands.cfg中有自帶check_http命令,也可監控域名! }
二、客戶端修改:vim /usr/local/nagios/etc/nrpe.cfg
command[check_users]=/usr/local/nagios/libexec/check_users -w 3 -c 5 command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20 command[check_xvda]=/usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/xvda command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200 command[check_xvdb2]=/usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/xvdb2 #阿里雲 command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20% -c 10% #/dev/xvdb1 分區作了swap command[check_mem]=/usr/bin/sudo /usr/local/nagios/libexec/check_mem -w 20 -c 10 command[check_mysql]=/usr/local/nagios/libexec/check_mysql -H 192.168.200.111 -unagios -dnagios_monitor -p dianyi123 command[check_mysql_slave]=/usr/local/nagios/libexec/check_mysql_slave
三、在nrpe.cfg配置文件中容許nagios服務器IP
[root@localhost ~]# vim /usr/local/nagios/etc/nrpe.cfg allowed_hosts=127.0.0.1,192.168.200.105
四、客戶端以獨立進程方式啓動 nrpe
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
五、修改nagios的命令模板
[root@monitor ~]# vim /usr/local/nagios/etc/objects/commands.cfg #添加如下一行 define command { command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ }
不然重啓nagios會報錯:
Error: Service check command 'check_nrpe!check_total_procs' specified in service 'total' for host '192.168.200.105' not defined anywhere!
六、服務端檢測 :
/usr/local/nagios/libexec/check_nrpe -H 192.168.200.111 -c check_sda
6、補充
一、nagios監控windows端口
基本上socket(收發通訊協議)寫的程序都會對應一個tcp端口出來,咱們只要監控此端口就至關於監控了此程序;如FTP 21,pop 110,smtp 25 這些是常見的tcp端口,常見的端口通常nagios內都有定義的check_nt!,若是不是常見的端口,就需自定義程序的tcp端口。
在監控以前,要確認端口是打開的,能夠在CMD中telnet一下端口
C:\Users\Administrator>telnet 192.168.200.15 3389
(1)定義命令
[root@cacti objects]# vim /usr/local/nagios/etc/objects/commands.cfg define command{ command_name tcp3389 command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 3389 -v CLIENTVERSION }
(2)定義服務
主機已定義,主機和服務在一個配置文件裏
[root@cacti servers]# vim /usr/local/nagios/etc/servers/192.168.200.15.cfg define service{ use generic-service host_name 192.168.200.15 service_description port3389 check_command tcp3389 }
(3)重啓nagios服務
(4)查看驗證
二、nagios監控linux端口
[root@cacti servers]# pwd /usr/local/nagios/etc/servers [root@cacti servers]# [root@cacti servers]# vim 192.168.200.18.cfg define service{ use generic-service host_name 192.168.200.18 service_description check_tcp_3306 check_command check_tcp!3306 } define service{ use generic-service host_name 192.168.200.18 service_description check_tcp_873 check_command check_tcp!873 } # [root@cacti ~]# service nagios restart
############# 若是監聽的端口是這樣的,而不是 *:5666 這樣 ############### tcp LISTEN 0 50 61.138.78.59:7003 *:* tcp LISTEN 0 5 *:5666 *:* 則須要修改commands中的 $HOSTADDRESS$ 爲61.138.78.59,而後修改command_name,再定義服務便可
三、nagios監控mysql主從同步
判斷mysql的主從同步主要仍是看那兩個線程:Slave_IO線程和Slave_SQL線程,兩個都是YES的話,就證實是沒有問題的
MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.200.17 Master_User: doteyplay Master_Port: 3306 Connect_Retry: 60 Master_Log_File: master-bin.000008 Read_Master_Log_Pos: 1277 Relay_Log_File: relay-bin.000025 Relay_Log_Pos: 1486 Relay_Master_Log_File: master-bin.000008 Slave_IO_Running: Yes Slave_SQL_Running: Yes
第一部分:客戶端配置
(1)在被監控的從服務器增長一個用戶
MariaDB [(none)]> grant Replication client on *.* to nagios@localhost identified by 'nagios'; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> MariaDB [(none)]> flush privileges; Query OK, 0 rows affected (0.00 sec)
(2)驗證命名執行狀態
[root@DBSlave ~]# mysql -unagios -pnagios -e "show slave status\G;" *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.200.17 Master_User: doteyplay Master_Port: 3306 Connect_Retry: 60 Master_Log_File: master-bin.000008 Read_Master_Log_Pos: 1277 Relay_Log_File: relay-bin.000025 Relay_Log_Pos: 1486 Relay_Master_Log_File: master-bin.000008 Slave_IO_Running: Yes Slave_SQL_Running: Yes
(3)編寫腳本/usr/local/nagios/libexec/check_mysql_slave(這是監控其做用的核心)
#!/bin/sh declare -a slave_is slave_is=($(/usr/local/mysql/bin/mysql -unagios -pnagios -e "show slave status\G"|grep Running |awk '{print $2}')) if [ "${slave_is[0]}" = "Yes" -a "${slave_is[1]}" = "Yes" ] then echo "OK C2-slave is running" exit 0 else echo "Critical C2-slave is error" exit 2 fi # [root@DBSlave libexec]# chmod +x check_mysql_slave #賦予執行權限 [root@DBSlave libexec]# chown nagios.nagios check_mysql_slave
(4)在從服務器安裝 nrpe,而後在配置文件nrpe.cfg加入一行
[root@DBSlave ~]# vim /usr/local/nagios/etc/nrpe.cfg command[check_mysql_slave]=/usr/local/nagios/libexec/check_mysql_slave
(5)手動執行腳本,觀察輸出狀態
[root@DBSlave libexec]# sh check_mysql_slave OK C2-slave is running
(6)檢查被監控端的5666端口
[root@DBSlave libexec]# ss -antulp | grep 5666 tcp LISTEN 0 5 :::5666 :::* users:(("nrpe",26512,5)) tcp LISTEN 0 5 *:5666 *:* users:(("nrpe",26512,4)) [root@DBSlave libexec]# /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
第二部分:服務端配置
(1)在監控機上檢查是否可成功監控被監控機
[root@cacti ~]# /usr/local/nagios/libexec/check_nrpe -H 192.168.200.18 -c check_mysql_slave NRPE: Command 'check_mysql_slave' not defined #遇到問題
排錯:NRPE: Command
'check_mysql_slave'
not defined
[root@cacti ~]# /usr/local/nagios/libexec/check_nrpe -H 192.168.200.18 NRPE v2.15
證實在被監測主機上配置的NRPE已經正常工做,而且監測主機可以經過SSL與被監測主機上的NRPE正常通訊。
[root@DBSlave libexec]# ps -ef | grep nrpe root 10287 9703 0 12:01 pts/1 00:00:00 vim /usr/local/nagios/etc/nrpe.cfg root 10522 9639 0 12:30 pts/0 00:00:00 grep nrpe nagios 26512 1 0 Aug15 ? 00:01:09 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d #這裏的nrpe是以獨立進程運行的,而非守護進程。先kill一下nrpe再說 [root@DBSlave libexec]# [root@DBSlave libexec]# kill -9 26512 #kill nrpe進程 [root@DBSlave libexec]# [root@DBSlave libexec]# ps -ef | grep nrpe root 10287 9703 0 12:01 pts/1 00:00:00 vim /usr/local/nagios/etc/nrpe.cfg root 10524 9639 0 12:31 pts/0 00:00:00 grep nrpe #kill 成功 [root@DBSlave libexec]# [root@DBSlave libexec]# [root@DBSlave libexec]# /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d #重啓nrpe [root@DBSlave libexec]# [root@DBSlave libexec]# ps -ef | grep nrpe root 10287 9703 0 12:01 pts/1 00:00:00 vim /usr/local/nagios/etc/nrpe.cfg nagios 10526 1 0 12:31 ? 00:00:00 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d root 10528 9639 0 12:31 pts/0 00:00:00 grep nrpe
再次在監控端測試
[root@cacti ~]# /usr/local/nagios/libexec/check_nrpe -H 192.168.200.18 -c check_mysql_slave OK C2-slave is running #終於順利經過了,就是nrpe進程的事兒
(2)定義主機、服務
[root@cacti servers]# pwd /usr/local/nagios/etc/servers [root@cacti servers]# vim 192.168.200.18.cfg define host{ use linux-server host_name 192.168.200.18 alias linux address 192.168.200.18 } define service{ use generic-service host_name 192.168.200.18 service_description check_mysql_slave check_command check_nrpe!check_mysql_slave }
(3)重啓nagios服務
(4)查看監控狀態
四、nagios經過web界面修改某個服務時報錯
例如對某個服務進行臨時安排其執行時間,或者不讓它發警告,web頁面上都有這樣的設置.可是經常會有錯誤信息以下:
Could not open command file '/usr/local/nagios/var/rw/nagiosNaNd' for update!The permissions on the external command file and/or directory may be incorrect. Read the FAQs on how to setup proper permissions.An error occurred while attempting to commit your command for processing. |
(1)修改屬組
[root@monitor ~]# chown -R nagios.nagios /usr/local/nagios/var/rw/
(2)把apache用戶加入到nagios組
[root@monitor ~]# usermod -G nagios apache
(3)重啓服務
[root@monitor ~]# service nagios restart [root@monitor ~]# service httpd restart