全篇主要依賴下面2篇文章php
http://quenlang.blog.51cto.com/4813803/1571635html
http://www.cnblogs.com/mchina/archive/2013/02/20/2883404.html#!commentsnode
ganglia-3.6.0.tar.gzpython
nagios : http://sourceforge.net/projects/nagios/files/nagios-4.x/nagios-4.1.1/nagios-4.1.1.tar.gz/downloadlinux
nagios-plugs : http://www.nagios-plugins.org/download/nagios-plugins-2.1.1.tar.gzios
nrpe : http://sourceforge.net/projects/nagios/files/nrpe-2.x/nrpe-2.15/nrpe-2.15.tar.gz/downloadc++
hadoop1安裝ganglia的gmetad、gmond及ganglia-websql
新建一個 ganglia.rpm 文件,寫入如下依賴組件
$ vim ganglia.rpm apr-devel apr-util check-devel cairo-devel pango-devel libxml2-devel glib2-devel dbus-devel freetype-devel fontconfig-devel gcc-c++ expat-devel python-devel
rrdtool
rrdtool-devel libXrender-devel zlib libart_lgpl libpng dejavu-lgc-sans-mono-fonts dejavu-sans-mono-fonts perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker
查看這些組件是否有安裝
$ rpm -q `cat ganglia.rpm` package apr-devel is not installed apr-util-1.3.9-3.el6_0.1.x86_64 check-devel-0.9.8-1.1.el6.x86_64 cairo-devel-1.8.8-3.1.el6.x86_64 pango-devel-1.28.1-10.el6.x86_64 libxml2-devel-2.7.6-14.el6_5.2.x86_64 glib2-devel-2.28.8-4.el6.x86_64 dbus-devel-1.2.24-7.el6_3.x86_64 freetype-devel-2.3.11-14.el6_3.1.x86_64 fontconfig-devel-2.8.0-5.el6.x86_64 gcc-c++-4.4.7-11.el6.x86_64 package expat-devel is not installed python-devel-2.6.6-52.el6.x86_64 libXrender-devel-0.9.8-2.1.el6.x86_64 zlib-1.2.3-29.el6.x86_64 libart_lgpl-2.3.20-5.1.el6.x86_64 libpng-1.2.49-1.el6_2.x86_64 package dejavu-lgc-sans-mono-fonts is not installed package dejavu-sans-mono-fonts is not installed perl-ExtUtils-CBuilder-0.27-136.el6.x86_64 perl-ExtUtils-MakeMaker-6.55-136.el6.x86_64
使用 yum install 安裝機器上沒有的組件
還要安裝 confuse
下載地址:http://www.nongnu.org/confuse/
$ tar -zxf confuse-2.7.tar.gz $ cd confuse-2.7 $ ./configure CFLAGS=-fPIC --disable-nls $ make && make install
hadoop1上安裝
$ tar -xvf /home/hadoop/ganglia-3.6.0.tar.gz -C /opt/soft/ ## 安裝gmetad $ ./configure --prefix=/usr/local/ganglia --with-gmetad --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/ganglia $ make && make install $ cp gmetad/gmetad.init /etc/init.d/gmetad $ cp /usr/local/ganglia/sbin/gmetad /usr/sbin/ $ chkconfig --add gmetad ## 安裝gmond $ cp gmond/gmond.init /etc/init.d/gmond $ cp /usr/local/ganglia/sbin/gmond /usr/sbin/ $ gmond --default_config>/etc/ganglia/gmond.conf $ chkconfig --add gmond
gmetad、gmond安裝成功,接着安裝ganglia-web,首先要安裝php和httpd
yum install php httpd -y
修改httpd的配置文件/etc/httpd/conf/httpd.conf,只把監聽端口改成8080
Listen 8080
安裝ganglia-web
$ tar xf ganglia-web-3.6.2.tar.gz -C /opt/soft/ $ cd /opt/soft/ $ chmod -R 777 ganglia-web-3.6.2/
$
$ cd $ useradd www-data
$ make install
$ chmod 777 /var/lib/ganglia-web/dwoo/cache/
$ chmod 777 /var/lib/ganglia-web/dwoo/compiled/mv ganglia-web-3.6.2/ /var/www/html/ganglia/var/www/html/ganglia
至此ganglia-web安裝完成,修改conf_default.php修改文件,指定ganglia-web的目錄及rrds的數據目錄,修改以下兩行:
36 # Where gmetad stores the rrd archives. 37 $conf['gmetad_root'] = "/var/www/html/ganglia"; ## 改成web程序的安裝目錄 38 $conf['rrds'] = "/var/lib/ganglia/rrds"; ## 指定rrd數據存放的路徑
建立rrd數據存放目錄並受權
$ mkdir /var/lib/ganglia/rrds -p $ chown nobody:nobody /var/lib/ganglia/rrds/ -R
到這裏,hadoop1上的ganglia的全部安裝工做就完成了,接下來就是要在其餘全部節點上安裝ganglia的gmond客戶端。
其餘節點安裝上gmond
也是要先安裝依賴,而後在安裝gmond,全部節點安裝都是同樣的,因此這裏寫個腳本
$ vim install_ganglia.sh #!/bin/sh #安裝依賴 這是是我已經知道我缺乏哪些依賴,因此只安裝這些,具體按照你的環境來列出須要安裝哪些 yum install -y apr-devel expat-devel rrdtool rrdtool-devel mkdir /opt/soft;cd /opt/soft tar -xvf /home/hadoop/confuse-2.7.tar.gz cd confuse-2.7 ./configure CFLAGS=-fPIC --disable-nls make && make install cd /opt/soft #安裝 ganglia gmond tar -xvf /home/hadoop/ganglia-3.6.0.tar.gz cd ganglia-3.6.0/ ./configure --prefix=/usr/local/ganglia --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/ganglia make && make install cp gmond/gmond.init /etc/init.d/gmond cp /usr/local/ganglia/sbin/gmond /usr/sbin/ gmond --default_config>/etc/ganglia/gmond.conf chkconfig --add gmond
將這個腳本複製到全部節點執行
分爲服務端和客戶端的配置,服務端的配置文件爲gmetad.conf,客戶端的配置文件爲gmond.conf
首先配置hadoop1上的gmetad.conf,這個文件只有hadoop1上有
$ vi /etc/ganglia/gmetad.conf ## 定義數據源的名字及監聽地址,gmond會將收集的數據發送到數據源監聽機器上的rrd數據目錄中
## hadoop cluster 爲本身定義
data_source "hadoop cluster" 192.168.0.101:8649
接着配置 gmond.conf
$ head -n 80 /etc/ganglia/gmond.conf /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes ## 以守護進程運行 setuid = yes user = nobody ## 運行gmond的用戶 debug_level = 0 ## 改成1會在啓動時打印debug信息 max_udp_msg_len = 1472 mute = no ## 啞吧,本節點將不會再廣播任何本身收集到的數據到網絡上 deaf = no ## 聾子,本節點將再也不接收任何其餘節點廣播的數據包 allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # By default gmond will use reverse DNS resolution when displaying your hostname # Uncommeting following value will override that value. # override_hostname = "mywebserver.domain.com" # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 0 /*secs */ } /* * The cluster attributes specified will be used as part of the <CLUSTER> * tag that will wrap all hosts collected by this instance. */ cluster { name = "hadoop cluster" ## 指定集羣的名字 owner = "nobody" ## 集羣的全部者 latlong = "unspecified" url = "unspecified" } /* The host section describes attributes of the host, like the location */ host { location = "unspecified" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { #bind_hostname = yes # Highly recommended, soon to be default. # This option tells gmond to use a source address # that resolves to the machine's hostname. Without # this, the metrics may appear to come from any # interface and the DNS names associated with # those IPs will be used to create the RRDs. # mcast_join = 239.2.11.71 ## 單播模式要註釋調這行 host = 192.168.0.101 ## 單播模式,指定接受數據的主機 port = 8649 ## 監聽端口 ttl = 1 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { #mcast_join = 239.2.11.71 ## 單播模式要註釋調這行 port = 8649 #bind = 239.2.11.71 ## 單播模式要註釋調這行 retry_bind = true # Size of the UDP buffer. If you are handling lots of metrics you really # should bump it up to e.g. 10MB or even higher. # buffer = 10485760 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 # If you want to gzip XML output gzip_output = no } /* Channel to receive sFlow datagrams */ #udp_recv_channel { # port = 6343 #} /* Optional sFlow settings */
好了,hadoop1上的gmetad.conf和gmond.conf配置文件已經修改完成,這時,直接將hadoop1上的gmond.conf文件scp到其餘節點上相同的路徑下覆蓋原來的gmond.conf便可。
全部節點啓動 gmond 服務
/etc/init.d/gmond start
hadoop1 節點啓動 gmetad httpd 服務
/etc/init.d/gmetad start
/etc/init.d/httpd start
配置完成
此時,ganglia只是監控了各主機基本的性能,並無監控到hadoop,接下來須要配置hadoop配置文件,這裏以hadoop1上的配置文件爲例,其餘節點對應的配置文件應從hadoop1上拷貝,首先須要修改的是hadoop配置目錄下的hadoop-metrics2.properties
$ cd /usr/local/hadoop-2.6.0/etc/hadoop/ $ vim hadoop-metrics2.properties # for Ganglia 3.1 support *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 *.sink.ganglia.period=10 # default for supportsparse is false *.sink.ganglia.supportsparse=true *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40 # Tag values to use for the ganglia prefix. If not defined no tags are used. # If '*' all tags are used. If specifiying multiple tags separate them with # commas. Note that the last segment of the property name is the context name. # #*.sink.ganglia.tagsForPrefix.jvm=ProcesName #*.sink.ganglia.tagsForPrefix.dfs= #*.sink.ganglia.tagsForPrefix.rpc= #*.sink.ganglia.tagsForPrefix.mapred= namenode.sink.ganglia.servers=192.168.0.101:8649
datanode.sink.ganglia.servers=192.168.0.101:8649
resourcemanager.sink.ganglia.servers=192.168.0.101:8649
nodemanager.sink.ganglia.servers=192.168.0.101:8649
mrappmaster.sink.ganglia.servers=192.168.0.101:8649
jobhistoryserver.sink.ganglia.serve=192.168.0.101:8649
複製到全部節點,重啓hadoop集羣
此時在監控中已經能夠看到關於hadoop指標的監控了
新建nagios用戶
# useradd -s /sbin/nologin nagios
# mkdir /usr/local/nagios
# chown -R nagios.nagios /usr/local/nagios
$ cd /opt/soft
$ tar zxvf nagios-3.4.3.tar.gz
$ cd nagios-3.4.3
$ ./configure --prefix=/usr/local/nagios
$ make al
$ make install
$ make install-init
$ make install-config
$ make install-commandmode
$ make install-webconf
切換目錄到安裝路徑(這裏是/usr/local/nagios),看是否存在etc、bin、sbin、share、var 這五個目錄,若是存在則能夠代表程序被正確的安裝到系統了
$ cd /opt/soft $ tar zxvf nagios-plugins-1.4.16.tar.gz $ cd nagios-plugins-1.4.16
$ mkdir /user/local/nagios
$ ./configure --prefix=/usr/local/nagios $ make && make install
$ cd /opt/soft/ $ tar -xvf /home/hadoop/nrpe-2.15.tar.gz $ cd nrpe-2.15/ $ ./configure $ make all $ make install-plugin
datanode只要安裝nagios-plugs 與 nrpe.
由於全部節點是同樣的,這裏寫個腳本
#!/bin/sh adduser nagios cd /opt/soft tar xvf /home/hadoop/nagios-plugins-2.1.1.tar.gz cd nagios-plugins-2.1.1 mkdir /usr/local/nagios ./configure --prefix=/usr/local/nagios make && make install chown nagios.nagios /usr/local/nagios chown -R nagios.nagios /usr/local/nagios/libexec
#安裝xinetd.看你的機器是否有xinetd,若是沒有就安裝,有的話就不用了
yum install xinetd -y
cd ../ tar xvf /home/hadoop/nrpe-2.15.tar.gz cd nrpe-2.15 ./configure make all make install-daemon make install-daemon-config make install-xinetd
安裝完成後
修改nrpe.cfg
$ vim /usr/local/nagios/etc/nrpe.cfg
log_facility=daemon pid_file=/var/run/nrpe.pid ## nagios的監聽端口 server_port=5666 nrpe_user=nagios nrpe_group=nagios ## nagios服務器主機地址 allowed_hosts=xx.xxx.x.xx dont_blame_nrpe=0 allow_bash_command_substitution=0 debug=0 command_timeout=60 connection_timeout=300 ## 監控負載 command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20 ## 當前系統用戶數 command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10 ## 根分區空閒容量 command[check_sda2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda2 ## mysql狀態 command[check_mysql]=/usr/local/nagios/libexec/check_mysql -H localhost -P 3306 -d kora -u kora -p upbjsxt ## 主機是否存活 command[check_ping]=/usr/local/nagios/libexec/check_ping -H localhost -w 100.0,20% -c 500.0,60% ## 當前系統的進程總數 command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200
## swap使用狀況
command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20 -c 10
只有在被監控機器的這個配置文件中定義的命令,在監控機器(也就是hadoop1)上才能經過nrpe插件獲取.也就是想監控機器的什麼指標必須如今此處定義
同步到其餘全部datanode節點
能夠看到建立了這個文件/etc/xinetd.d/nrpe。
編輯這個腳本(圖用的其餘文章的圖,版本號跟配置不同,意思到就好了):
在only_from 後增長監控主機的IP地址。
編輯/etc/services 文件,增長NRPE服務
重啓xinted 服務
# service xinetd restart
查看NRPE 是否已經啓動
能夠看到5666端口已經在監聽了。
在hadoop1上
要想讓nagios與ganglia整合起來,就須要在hadoop1上把ganglia安裝包中的ganglia的插件放到nagios的插件目錄下
$ /opt/soft/ganglia-3.6.0 $ cp contrib/check_ganglia.py /usr/local/nagios/libexec/
默認的check_ganglia.py 插件中只有監控項的實際值大於critical閥值的狀況,這裏須要增長監控項的實際值小於critical閥值的狀況,即最後添加的一段代碼
$ vim /usr/local/nagios/libexec/check_ganglia.py 88 if critical > warning: 89 if value >= critical: 90 print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value) 91 sys.exit(2) 92 elif value >= warning: 93 print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value) 94 sys.exit(1) 95 else: 96 print "CHECKGANGLIA OK: %s is %.2f" % (metric, value) 97 sys.exit(0) 98 else: 99 if critical >=value: 100 print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value) 101 sys.exit(2) 102 elif warning >=value: 103 print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value) 104 sys.exit(1) 105 else: 106 print "CHECKGANGLIA OK: %s is %.2f" % (metric, value) 107 sys.exit(0)
最後改爲上面這樣
hadoop1上配置各個主機及對應的監控項
沒配置前,如今目錄結構是這樣的
$ cd /usr/local/nagios/etc/objects/ $ ll total 48 -rw-rw-r-- 1 nagios nagios 8010 9月 11 14:59 commands.cfg -rw-rw-r-- 1 nagios nagios 2138 9月 11 11:35 contacts.cfg -rw-rw-r-- 1 nagios nagios 5375 9月 11 11:35 localhost.cfg -rw-rw-r-- 1 nagios nagios 3096 9月 11 11:35 printer.cfg -rw-rw-r-- 1 nagios nagios 3265 9月 11 11:35 switch.cfg -rw-rw-r-- 1 nagios nagios 10621 9月 11 11:35 templates.cfg -rw-rw-r-- 1 nagios nagios 3180 9月 11 11:35 timeperiods.cfg -rw-rw-r-- 1 nagios nagios 3991 9月 11 11:35 windows.cfg
注意:cfg的文件跟在配置後面的說明註釋必定要用逗號,而不是#號.我就是由於一開始用了#號,結果一直出問題找不到是什麼緣由
修改 commands.cfg
在文件最後加上以下內容
# 'check_ganglia' command definition define command{ command_name check_ganglia command_line $USER1$/check_ganglia.py -h $HOSTADDRESS$ -m $ARG1$ -w $ARG2$ -c $ARG3$ } # 'check_nrpe' command definition define command{ command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ }
修改templates.cfg
我有18臺datanode機器,這裏篇幅緣由只截取5個,後面依次再加就好了
define service { use generic-service name ganglia-service1 ;這裏的配置在service1.cfg中用到 hostgroup_name a01 ;這裏的配置在hadoop1.cfg中用到 service_groups ganglia-metrics1 ;這裏的配置在service1.cfg中用到 register 0 } define service { use generic-service name ganglia-service2 ;這裏的配置在service2.cfg中用到 hostgroup_name a02 ;這裏的配置在hadoop2.cfg中用到 service_groups ganglia-metrics2 ;這裏的配置在service2.cfg中用到 register 0 } define service { use generic-service name ganglia-service3 ;這裏的配置在service3.cfg中用到 hostgroup_name a03 ;這裏的配置在hadoop3.cfg中用到 service_groups ganglia-metrics3 ;這裏的配置在service3.cfg中用到 register 0 } define service { use generic-service name ganglia-service4 ;這裏的配置在service4.cfg中用到 hostgroup_name a04 ;這裏的配置在hadoop4.cfg中用到 service_groups ganglia-metrics4 ;這裏的配置在service4.cfg中用到 register 0 } define service { use generic-service name ganglia-service5 ;這裏的配置在service5.cfg中用到 hostgroup_name a05 ;這裏的配置在hadoop5.cfg中用到 service_groups ganglia-metrics5 ;這裏的配置在service5.cfg中用到 register 0 }
hadoop1.cfg 配置
這個默認是沒有,用localhost.cfg 拷貝來
$cp localhost.cfg hadoop1.cfg
# vim hadoop1.cfg define host{ use linux-server host_name a01 alias a01 address a01 } define hostgroup { hostgroup_name a01 alias a01 members a01 } define service{ use local-service host_name a01 service_description PING check_command check_ping!100,20%!500,60% } define service{ use local-service host_name a01 service_description 根分區 check_command check_local_disk!20%!10%!/ # contact_groups admins } define service{ use local-service host_name a01 service_description 用戶數量 check_command check_local_users!20!50 } define service{ use local-service host_name a01 service_description 進程數 check_command check_local_procs!550!650!RSZDT } define service{ use local-service host_name a01 service_description 系統負載 check_command check_local_load!5.0,4.0,3.0!10.0,6.0,4.0 }
service1.cfg 配置
默認沒有service1.cfg,新建一個
$ vim service1.cfg define servicegroup { servicegroup_name ganglia-metrics1 alias Ganglia Metrics1 } ## 這裏的check_ganglia爲commonds.cfg中聲明的check_ganglia命令 define service{ use ganglia-service1 service_description 內存空閒 check_command check_ganglia!mem_free!200!50 } define service{ use ganglia-service1 service_description NameNode同步 check_command check_ganglia!dfs.namenode.SyncsAvgTime!10!50
}
hadoop2.cfg 配置
須要注意使用check_nrpe插件的監控項必需要在hadoop2上的nrpe.cfg中聲明
也就是每一個service裏的check_command必須在這臺機器的 nrpe.cfg 中聲明瞭纔有用,比且要保證名稱同樣
$ cp localhost.cfg hadoop2.cfg
$ vim hadoop2.cfg
define host{ use linux-server ; Name of host template to use ; This host definition will inherit all variables that are defined ; in (or inherited by) the linux-server host template definition. host_name a02 alias a02 address a02 } # Define an optional hostgroup for Linux machines define hostgroup{ hostgroup_name a02; The name of the hostgroup alias a02 ; Long name of the group members a02 ; Comma separated list of hosts that belong to this group } # Define a service to "ping" the local machine define service{ use local-service ; Name of service template to use host_name a02 service_description PING check_command check_nrpe!check_ping } # Define a service to check the disk space of the root partition # on the local machine. Warning if < 20% free, critical if # < 10% free space on partition. define service{ use local-service ; Name of service template to use host_name a02 service_description Root Partition check_command check_nrpe!check_sda2 } # Define a service to check the number of currently logged in # users on the local machine. Warning if > 20 users, critical # if > 50 users. define service{ use local-service ; Name of service template to use host_name a02 service_description Current Users check_command check_nrpe!check_users } # Define a service to check the number of currently running procs # on the local machine. Warning if > 250 processes, critical if # > 400 users. define service{ use local-service ; Name of service template to use host_name a02 service_description Total Processes check_command check_nrpe!check_total_procs } define service{ use local-service ; Name of service template to use host_name a02 service_description Current Load check_command check_nrpe!check_load } # Define a service to check the swap usage the local machine. # Critical if less than 10% of swap is free, warning if less than 20% is free define service{ use local-service ; Name of service template to use host_name a02 service_description Swap Usage check_command check_nrpe!check_swap }
hadoop2的設置完,拷貝16份,由於datanode配置基本同樣,就是hostname有點小區別
$ for i in {3..18};do cp hadoop2.cfg hadoop$i.cfg;done
將剩下里面hostname改下就行,後面就不說了
service2.cfg 配置
新建文件並配置
$ vim service2.cfg define servicegroup { servicegroup_name ganglia-metrics2 alias Ganglia Metrics2 } define service{ use ganglia-service2 service_description 內存空閒 check_command check_ganglia!mem_free!200!50 } define service{ use ganglia-service2 service_description RegionServer_Get check_command check_ganglia!yarn.NodeManagerMetrics.AvailableVCores!7!7 } define service{ use ganglia-service2 service_description DateNode_Heartbeat check_command check_ganglia!dfs.datanode.HeartbeatsAvgTime!15!40
service2的設置完,拷貝16份,由於datanode配置基本同樣,就是servicegroup_name,use有點小區別
$ for i in {3..18};do scp service2.cfg service$i.cfg;done
改爲對應的編號
修改 nagios.cfg
$ vim ../nagios.cfg cfg_file=/usr/local/nagios/etc/objects/commands.cfg cfg_file=/usr/local/nagios/etc/objects/contacts.cfg cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg cfg_file=/usr/local/nagios/etc/objects/templates.cfg #引進host文件 cfg_file=/usr/local/nagios/etc/objects/hadoop1.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop2.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop3.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop4.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop5.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop6.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop7.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop8.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop9.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop10.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop11.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop12.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop13.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop14.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop15.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop16.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop17.cfg cfg_file=/usr/local/nagios/etc/objects/hadoop18.cfg #引進監控項的文件 cfg_file=/usr/local/nagios/etc/objects/service1.cfg cfg_file=/usr/local/nagios/etc/objects/service2.cfg cfg_file=/usr/local/nagios/etc/objects/service3.cfg cfg_file=/usr/local/nagios/etc/objects/service4.cfg cfg_file=/usr/local/nagios/etc/objects/service5.cfg cfg_file=/usr/local/nagios/etc/objects/service6.cfg cfg_file=/usr/local/nagios/etc/objects/service7.cfg cfg_file=/usr/local/nagios/etc/objects/service8.cfg cfg_file=/usr/local/nagios/etc/objects/service9.cfg cfg_file=/usr/local/nagios/etc/objects/service10.cfg cfg_file=/usr/local/nagios/etc/objects/service11.cfg cfg_file=/usr/local/nagios/etc/objects/service12.cfg cfg_file=/usr/local/nagios/etc/objects/service13.cfg cfg_file=/usr/local/nagios/etc/objects/service14.cfg cfg_file=/usr/local/nagios/etc/objects/service15.cfg cfg_file=/usr/local/nagios/etc/objects/service16.cfg cfg_file=/usr/local/nagios/etc/objects/service17.cfg cfg_file=/usr/local/nagios/etc/objects/service18.cfg
驗證配置是否正確
$ pwd /usr/local/nagios/etc $ ../bin/nagios -v nagios.cfg Nagios Core 4.1.1 Copyright (c) 2009-present Nagios Core Development Team and Community Contributors Copyright (c) 1999-2009 Ethan Galstad Last Modified: 08-19-2015 License: GPL Website: https://www.nagios.org Reading configuration data... Read main config file okay... Read object config files okay... Running pre-flight check on configuration data... Checking objects... Checked 161 services. Checked 18 hosts. Checked 18 host groups. Checked 18 service groups. Checked 1 contacts. Checked 1 contact groups. Checked 26 commands. Checked 5 time periods. Checked 0 host escalations. Checked 0 service escalations. Checking for circular paths... Checked 18 hosts Checked 0 service dependencies Checked 0 host dependencies Checked 5 timeperiods Checking global event handlers... Checking obsessive compulsive processor commands... Checking misc settings... Total Warnings: 0 Total Errors: 0 Things look okay - No serious problems were detected during the pre-flight check
沒有錯誤,這時就能夠啓動hadoop1上的nagios服務
$ /etc/init.d/nagios start Starting nagios: done.
由於以前datanode上的nrpe已經啓動了
測試hadoop1與datanode上nrpe通訊是否正常
]$ for i in {10..28};do /usr/local/nagios/libexec/check_nrpe -H xx.xxx.x.$i;done NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15 NRPE v2.15
ok,通訊正常,驗證check_ganglia.py插件是否工做正常
$ /usr/local/nagios/libexec/check_ganglia.py -h a01 -m mem_free -w 200 -c 50 CHECKGANGLIA OK: mem_free is 61840868.00
工做正常,如今咱們能夠nagios的web頁面,看是否監控成功。
localhost:8080/nagios
先檢查服務器是否安裝sendmail
$ rpm -q sendmail
$ yum install sendmail #若是沒有就安裝sendmail
$ service sendmail restart #重啓sendmail
由於給外部發郵件,須要服務器本身有郵件服務器,這很麻煩而且很是佔資源.這裏咱們配置一下,使用現有的STMP服務器
配置地址 /etc/mail.rc
$ vim /etc/mail.rc set from=systeminformation@xxx.com set smtp=mail.xxx.com smtp-auth-user=systeminformation smtp-auth-password=111111 smtp-auth=login
配置完畢以後,就能夠先命令行測試一下,是否能夠發郵件了
$ echo "hello world" |mail -s "test" pingjie@xxx.com
若是看你的郵件已經收到郵件了,說明sendmail已經沒有問題.
下面配置nagios的郵件告警配置
$ vim /usr/local/nagios/etc/objects/contacts.cfg define contact{ contact_name nagiosadmin ; Short name of user use generic-contact ; Inherit default values from generic-contact template (defined above) alias Nagios Admin ; Full name of user ## 告警時間段 service_notification_period 24x7 host_notification_period 24x7 ## 告警信息格式 service_notification_options w,u,c,r,f,s host_notification_options d,u,r,f,s ## 告警方式爲郵件 service_notification_commands notify-service-by-email host_notification_commands notify-host-by-email email pingjie@xxx.com ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ****** } # We only have one contact in this simple configuration file, so there is # no need to create more than one contact group. define contactgroup{ contactgroup_name admins alias Nagios Administrators members nagiosadmin }
至此配置所有完成
1.監控datanode的腳本
就是用python 讀取HDFS頁面,再正則匹配到Live Nodes這部分
1 #!/usr/bin/env python 2 3 import commands 4 import sys 5 from optparse import OptionParser 6 import urllib 7 import re 8 9 def get_value(): 10 urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp") 11 html = urlItem.read() 12 urlItem.close() 13 return float(re.findall('.+Live Nodes</a> <td id="col2"> :<td id="col3">\\s+(\d+)\\s+\\(Decommissioned: \d+\\)<tr class="rowNormal">.+', html)[0]) 14 15 if __name__ == '__main__': 16 17 parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0") 18 parser.add_option("-w", "--warning", type="int", dest="w", default=16) 19 parser.add_option("-c", "--critical", type="int", dest="c", default=15) 20 (options, args) = parser.parse_args() 21 22 if(options.c >= options.w): 23 print '-w must greater then -c' 24 sys.exit(1) 25 26 value = get_value() 27 28 if(value <= options.c ) : 29 print 'CRITICAL - Live Nodes %d' %(value) 30 sys.exit(2) 31 elif(value <= options.w): 32 print 'WARNING - Live Nodes %d' %(value) 33 sys.exit(1) 34 else: 35 print 'OK - Live Nodes %d' %(value) 36 sys.exit(0)
2.監控dfs空間:
#!/usr/bin/env python import commands import sys from optparse import OptionParser import urllib import re def get_dfs_free_percent(): urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp") html = urlItem.read() urlItem.close() return float(re.findall('.+<td id="col1"> DFS Remaining%<td id="col2"> :<td id="col3">\\s+(\d+\\.\d+)%<tr class="rowAlt">.+', html)[0]) if __name__ == '__main__': parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0") parser.add_option("-w", "--warning", type="int", dest="w", default=30, help="total dfs used percent") parser.add_option("-c", "--critical", type="int", dest="c", default=20, help="total dfs used percent") (options, args) = parser.parse_args() if(options.c >= options.w): print '-w must greater then -c' sys.exit(1) dfs_free_percent = get_dfs_free_percent() if(dfs_free_percent <= options.c ) : print 'CRITICAL - DFS free %d%%' %(dfs_free_percent) sys.exit(2) elif(dfs_free_percent <= options.w): print 'WARNING - DFS free %d%%' %(dfs_free_percent) sys.exit(1) else: print 'OK - DFS free %d%%' %(dfs_free_percent) sys.exit(0)
若是腳本出錯,就進python命令行,根據html的結果調試一下正則部分便可
拷貝這2個腳本到/usr/local/nagios/etc/objects/
這2個腳本單獨在命令行使用 ./check_hadoop_datanode.py 這種方式執行一下試試,若是報這個錯
: No such file or directory
vim打開文件後,命令模式執行 :set ff=unix , 而後保存就好了
3. 修改nagios配置
commands.cfg 增長以下2個command
$ vim /usr/local/nagios/etc/objects/commands.cfg define command{ command_name check_datanode command_line $USER1$/check_hadoop_datanode.py -w $ARG1$ -c $ARG2$ } define command{ command_name check_dfs command_line $USER1$/check_hadoop_dfs.py -w $ARG1$ -c $ARG2$ }
修改server1.cfg,增長以下2個service
$ vim service1.cfg define service{ use ganglia-service1 service_description datanode存活個數 check_command check_datanode!16!15 } define service{ use ganglia-service1 service_description dfs剩餘空間 check_command check_dfs!30!20 }
完成
5.1 ganglia監控的指標有問題
問題描述:爲了測試nagios報警功能,而後我就kill了一個節點的datanode,可是看nagios上一直顯示這個datanode是正常的.由於nagios這些指標是從ganglia來的,因而就找到ganglia上,發現也是正常的.這個問題就很奇怪了,爲啥datanode已經kill了還一直髮心跳
解決方案:沒有,有知道的請賜教。曲線救國,nagios使用腳本方式監控進程