Ganglia+nagios 監控hadoop資源與報警

時間 2019-11-19

標籤 ganglia+nagios ganglia nagios 監控 hadoop 資源報警欄目 Hadoop 简体版

原文原文鏈接

全篇主要依賴下面２篇文章php

http://quenlang.blog.51cto.com/4813803/1571635html

http://www.cnblogs.com/mchina/archive/2013/02/20/2883404.html#!commentsnode

一資源下載

ganglia-3.6.0.tar.gzpython

ganglia-web-3.6.2.tar.gzmysql

nagios : http://sourceforge.net/projects/nagios/files/nagios-4.x/nagios-4.1.1/nagios-4.1.1.tar.gz/downloadlinux

nagios-plugs : http://www.nagios-plugins.org/download/nagios-plugins-2.1.1.tar.gzios

nrpe : http://sourceforge.net/projects/nagios/files/nrpe-2.x/nrpe-2.15/nrpe-2.15.tar.gz/downloadc++

php-5.4.10.tar.gzweb

二 ganglia 安裝

hadoop1安裝ganglia的gmetad、gmond及ganglia-websql

2.1 依賴檢驗,安裝

新建一個 ganglia.rpm 文件,寫入如下依賴組件

$ vim ganglia.rpm
apr-devel
apr-util
check-devel
cairo-devel
pango-devel
libxml2-devel
glib2-devel
dbus-devel
freetype-devel
fontconfig-devel
gcc-c++
expat-devel
python-devel
rrdtool
rrdtool-devel
libXrender-devel
zlib
libart_lgpl
libpng
dejavu-lgc-sans-mono-fonts
dejavu-sans-mono-fonts
perl-ExtUtils-CBuilder
perl-ExtUtils-MakeMaker

查看這些組件是否有安裝

$ rpm -q `cat ganglia.rpm`
package apr-devel is not installed
apr-util-1.3.9-3.el6_0.1.x86_64
check-devel-0.9.8-1.1.el6.x86_64
cairo-devel-1.8.8-3.1.el6.x86_64
pango-devel-1.28.1-10.el6.x86_64
libxml2-devel-2.7.6-14.el6_5.2.x86_64
glib2-devel-2.28.8-4.el6.x86_64
dbus-devel-1.2.24-7.el6_3.x86_64
freetype-devel-2.3.11-14.el6_3.1.x86_64
fontconfig-devel-2.8.0-5.el6.x86_64
gcc-c++-4.4.7-11.el6.x86_64
package expat-devel is not installed
python-devel-2.6.6-52.el6.x86_64
libXrender-devel-0.9.8-2.1.el6.x86_64
zlib-1.2.3-29.el6.x86_64
libart_lgpl-2.3.20-5.1.el6.x86_64
libpng-1.2.49-1.el6_2.x86_64
package dejavu-lgc-sans-mono-fonts is not installed
package dejavu-sans-mono-fonts is not installed
perl-ExtUtils-CBuilder-0.27-136.el6.x86_64
perl-ExtUtils-MakeMaker-6.55-136.el6.x86_64

使用 yum install 安裝機器上沒有的組件

還要安裝 confuse

下載地址:http://www.nongnu.org/confuse/

$ tar -zxf confuse-2.7.tar.gz
$ cd confuse-2.7
$ ./configure CFLAGS=-fPIC --disable-nls
$ make && make install

2.2 安裝gangali

hadoop1上安裝

$ tar -xvf /home/hadoop/ganglia-3.6.0.tar.gz -C /opt/soft/
## 安裝gmetad
$ ./configure --prefix=/usr/local/ganglia --with-gmetad --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/ganglia
$ make && make install
$ cp gmetad/gmetad.init /etc/init.d/gmetad
$ cp /usr/local/ganglia/sbin/gmetad /usr/sbin/
$ chkconfig --add gmetad
## 安裝gmond
$ cp gmond/gmond.init /etc/init.d/gmond
$ cp /usr/local/ganglia/sbin/gmond /usr/sbin/
$ gmond --default_config>/etc/ganglia/gmond.conf
$ chkconfig --add gmond

gmetad、gmond安裝成功，接着安裝ganglia-web，首先要安裝php和httpd

yum install php httpd -y

修改httpd的配置文件/etc/httpd/conf/httpd.conf，只把監聽端口改成8080

Listen 8080

安裝ganglia-web

$ tar xf ganglia-web-3.6.2.tar.gz  -C /opt/soft/
$ cd /opt/soft/
$ chmod -R 777 ganglia-web-3.6.2/
$ 
$ cd $ useradd www-data 
$ make install 
$ chmod 777 /var/lib/ganglia-web/dwoo/cache/ 
$ chmod 777 /var/lib/ganglia-web/dwoo/compiled/mv ganglia-web-3.6.2/ /var/www/html/ganglia/var/www/html/ganglia

至此ganglia-web安裝完成，修改conf_default.php修改文件，指定ganglia-web的目錄及rrds的數據目錄，修改以下兩行：

36 # Where gmetad stores the rrd archives.
37 $conf['gmetad_root'] = "/var/www/html/ganglia"; ## 改成web程序的安裝目錄
38 $conf['rrds'] = "/var/lib/ganglia/rrds";        ## 指定rrd數據存放的路徑

建立rrd數據存放目錄並受權

$ mkdir /var/lib/ganglia/rrds -p
$ chown nobody:nobody /var/lib/ganglia/rrds/ -R

到這裏，hadoop1上的ganglia的全部安裝工做就完成了，接下來就是要在其餘全部節點上安裝ganglia的gmond客戶端。

其餘節點安裝上gmond

也是要先安裝依賴,而後在安裝gmond,全部節點安裝都是同樣的,因此這裏寫個腳本

$ vim install_ganglia.sh

#!/bin/sh

#安裝依賴  這是是我已經知道我缺乏哪些依賴,因此只安裝這些,具體按照你的環境來列出須要安裝哪些
yum install -y apr-devel expat-devel rrdtool rrdtool-devel

mkdir /opt/soft;cd /opt/soft
tar -xvf /home/hadoop/confuse-2.7.tar.gz
cd confuse-2.7
./configure CFLAGS=-fPIC --disable-nls
make && make install
cd /opt/soft
#安裝 ganglia gmond
tar -xvf /home/hadoop/ganglia-3.6.0.tar.gz
cd ganglia-3.6.0/
./configure --prefix=/usr/local/ganglia --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/ganglia
make && make install
cp gmond/gmond.init /etc/init.d/gmond
cp /usr/local/ganglia/sbin/gmond /usr/sbin/
gmond --default_config>/etc/ganglia/gmond.conf
chkconfig --add gmond

將這個腳本複製到全部節點執行

2.3 配置ganglia

分爲服務端和客戶端的配置，服務端的配置文件爲gmetad.conf,客戶端的配置文件爲gmond.conf

首先配置hadoop1上的gmetad.conf,這個文件只有hadoop1上有

$ vi  /etc/ganglia/gmetad.conf
## 定義數據源的名字及監聽地址，gmond會將收集的數據發送到數據源監聽機器上的rrd數據目錄中
## hadoop cluster 爲本身定義
data_source "hadoop cluster" 192.168.0.101:8649

接着配置 gmond.conf

$ head -n 80 /etc/ganglia/gmond.conf

/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
  daemonize = yes        ## 以守護進程運行
  setuid = yes           
  user = nobody          ## 運行gmond的用戶
  debug_level = 0        ## 改成1會在啓動時打印debug信息
  max_udp_msg_len = 1472
  mute = no              ## 啞吧，本節點將不會再廣播任何本身收集到的數據到網絡上
  deaf = no              ## 聾子，本節點將再也不接收任何其餘節點廣播的數據包
  allow_extra_data = yes
  host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */
  host_tmax = 20 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  # By default gmond will use reverse DNS resolution when displaying your hostname
  # Uncommeting following value will override that value.
  # override_hostname = "mywebserver.domain.com"
  # If you are not using multicast this value should be set to something other than 0.
  # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable
  send_metadata_interval = 0 /*secs */
 
}
 
/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
  name = "hadoop cluster"    ## 指定集羣的名字
  owner = "nobody"           ## 集羣的全部者
  latlong = "unspecified"
  url = "unspecified"
}
 
/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}
 
/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  #bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
#  mcast_join = 239.2.11.71    ## 單播模式要註釋調這行
  host = 192.168.0.101    ## 單播模式，指定接受數據的主機
  port = 8649             ## 監聽端口
  ttl = 1
}
 
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  #mcast_join = 239.2.11.71    ## 單播模式要註釋調這行
  port = 8649
  #bind = 239.2.11.71          ## 單播模式要註釋調這行
  retry_bind = true
  # Size of the UDP buffer. If you are handling lots of metrics you really
  # should bump it up to e.g. 10MB or even higher.
  # buffer = 10485760
}
 
/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
  # If you want to gzip XML output
  gzip_output = no
}
 
/* Channel to receive sFlow datagrams */
#udp_recv_channel {
#  port = 6343
#}
 
/* Optional sFlow settings */

好了，hadoop1上的gmetad.conf和gmond.conf配置文件已經修改完成，這時，直接將hadoop1上的gmond.conf文件scp到其餘節點上相同的路徑下覆蓋原來的gmond.conf便可。

2.4 啓動 ganglia

全部節點啓動 gmond 服務

/etc/init.d/gmond start

hadoop1 節點啓動 gmetad httpd 服務

/etc/init.d/gmetad start
/etc/init.d/httpd start

2.5 在瀏覽器中訪問hadoop1:8080/ganglia,就會出現下面的頁面

配置完成

三配置hadoop

此時，ganglia只是監控了各主機基本的性能，並無監控到hadoop，接下來須要配置hadoop配置文件，這裏以hadoop1上的配置文件爲例，其餘節點對應的配置文件應從hadoop1上拷貝，首先須要修改的是hadoop配置目錄下的hadoop-metrics2.properties

$ cd /usr/local/hadoop-2.6.0/etc/hadoop/
$ vim hadoop-metrics2.properties
# for Ganglia 3.1 support
 *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31

 *.sink.ganglia.period=10

# default for supportsparse is false
 *.sink.ganglia.supportsparse=true

*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40

# Tag values to use for the ganglia prefix. If not defined no tags are used.
# If '*' all tags are used. If specifiying multiple tags separate them with 
# commas. Note that the last segment of the property name is the context name.
#
#*.sink.ganglia.tagsForPrefix.jvm=ProcesName
#*.sink.ganglia.tagsForPrefix.dfs=
#*.sink.ganglia.tagsForPrefix.rpc=
#*.sink.ganglia.tagsForPrefix.mapred=

namenode.sink.ganglia.servers=192.168.0.101:8649 
datanode.sink.ganglia.servers=192.168.0.101:8649
resourcemanager.sink.ganglia.servers=192.168.0.101:8649 
nodemanager.sink.ganglia.servers=192.168.0.101:8649
mrappmaster.sink.ganglia.servers=192.168.0.101:8649 
jobhistoryserver.sink.ganglia.serve=192.168.0.101:8649

複製到全部節點,重啓hadoop集羣

此時在監控中已經能夠看到關於hadoop指標的監控了

四 nagios 安裝

4.1 hadoop1 機器

新建nagios用戶

# useradd -s /sbin/nologin nagios
# mkdir /usr/local/nagios
# chown -R nagios.nagios /usr/local/nagios

4.1.1 編譯安裝nagios

$ cd /opt/soft
$ tar zxvf nagios-3.4.3.tar.gz
$ cd nagios-3.4.3
$ ./configure --prefix=/usr/local/nagios
$ make al
$ make install
$ make install-init
$ make install-config
$ make install-commandmode
$ make install-webconf

切換目錄到安裝路徑（這裏是/usr/local/nagios），看是否存在etc、bin、sbin、share、var 這五個目錄，若是存在則能夠代表程序被正確的安裝到系統了

4.1.2 編譯安裝　nagios-plugs

$ cd /opt/soft
$ tar zxvf nagios-plugins-1.4.16.tar.gz
$ cd nagios-plugins-1.4.16
$ mkdir /user/local/nagios
$ ./configure --prefix=/usr/local/nagios
$ make && make install

4.1.3 安裝　check_nrpe　插件

$ cd /opt/soft/
$ tar -xvf /home/hadoop/nrpe-2.15.tar.gz
$ cd nrpe-2.15/
$ ./configure
$ make all
$ make install-plugin

4.2 datanode 節點

datanode只要安裝nagios-plugs 與 nrpe.

由於全部節點是同樣的，這裏寫個腳本

#!/bin/sh

adduser nagios

cd /opt/soft
tar xvf /home/hadoop/nagios-plugins-2.1.1.tar.gz
cd nagios-plugins-2.1.1
mkdir /usr/local/nagios
./configure --prefix=/usr/local/nagios
make && make install

chown nagios.nagios /usr/local/nagios
chown -R nagios.nagios /usr/local/nagios/libexec

#安裝xinetd.看你的機器是否有xinetd,若是沒有就安裝，有的話就不用了
yum install xinetd -y

cd ../
tar xvf /home/hadoop/nrpe-2.15.tar.gz
cd nrpe-2.15
./configure
make all
make install-daemon
make install-daemon-config
make install-xinetd

安裝完成後

修改nrpe.cfg

$ vim /usr/local/nagios/etc/nrpe.cfg 
log_facility=daemon pid_file=/var/run/nrpe.pid ## nagios的監聽端口 server_port=5666 nrpe_user=nagios nrpe_group=nagios ## nagios服務器主機地址 allowed_hosts=xx.xxx.x.xx dont_blame_nrpe=0 allow_bash_command_substitution=0 debug=0 command_timeout=60 connection_timeout=300 ## 監控負載 command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20 ## 當前系統用戶數 command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10 ## 根分區空閒容量 command[check_sda2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda2 ## mysql狀態 command[check_mysql]=/usr/local/nagios/libexec/check_mysql -H localhost -P 3306 -d kora -u kora -p upbjsxt ## 主機是否存活 command[check_ping]=/usr/local/nagios/libexec/check_ping -H localhost -w 100.0,20% -c 500.0,60% ## 當前系統的進程總數 command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200
## swap使用狀況
command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20 -c 10

只有在被監控機器的這個配置文件中定義的命令，在監控機器（也就是hadoop1）上才能經過nrpe插件獲取．也就是想監控機器的什麼指標必須如今此處定義

同步到其餘全部datanode節點

能夠看到建立了這個文件/etc/xinetd.d/nrpe。

編輯這個腳本（圖用的其餘文章的圖，版本號跟配置不同，意思到就好了）：

在only_from 後增長監控主機的IP地址。

編輯/etc/services 文件，增長NRPE服務

重啓xinted 服務

# service xinetd restart

查看NRPE 是否已經啓動

能夠看到5666端口已經在監聽了。

4.3 配置

在hadoop1上

要想讓nagios與ganglia整合起來，就須要在hadoop1上把ganglia安裝包中的ganglia的插件放到nagios的插件目錄下

$ /opt/soft/ganglia-3.6.0
$ cp contrib/check_ganglia.py /usr/local/nagios/libexec/

默認的check_ganglia.py 插件中只有監控項的實際值大於critical閥值的狀況，這裏須要增長監控項的實際值小於critical閥值的狀況，即最後添加的一段代碼

$ vim  /usr/local/nagios/libexec/check_ganglia.py

 88   if critical > warning:
 89     if value >= critical:
 90       print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
 91       sys.exit(2)
 92     elif value >= warning:
 93       print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
 94       sys.exit(1)
 95     else:
 96       print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
 97       sys.exit(0)
 98   else:
 99     if critical >=value:
100       print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
101       sys.exit(2)
102     elif warning >=value:
103       print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
104       sys.exit(1)
105     else:
106       print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
107       sys.exit(0)

最後改爲上面這樣

hadoop1上配置各個主機及對應的監控項

沒配置前，如今目錄結構是這樣的

$ cd /usr/local/nagios/etc/objects/
$ ll
total 48
-rw-rw-r-- 1 nagios nagios  8010 9月  11 14:59 commands.cfg
-rw-rw-r-- 1 nagios nagios  2138 9月  11 11:35 contacts.cfg
-rw-rw-r-- 1 nagios nagios  5375 9月  11 11:35 localhost.cfg
-rw-rw-r-- 1 nagios nagios  3096 9月  11 11:35 printer.cfg
-rw-rw-r-- 1 nagios nagios  3265 9月  11 11:35 switch.cfg
-rw-rw-r-- 1 nagios nagios 10621 9月  11 11:35 templates.cfg
-rw-rw-r-- 1 nagios nagios  3180 9月  11 11:35 timeperiods.cfg
-rw-rw-r-- 1 nagios nagios  3991 9月  11 11:35 windows.cfg

注意：cfg的文件跟在配置後面的說明註釋必定要用逗號，而不是＃號．我就是由於一開始用了＃號，結果一直出問題找不到是什麼緣由

修改　commands.cfg

在文件最後加上以下內容

# 'check_ganglia' command definition
define command{
        command_name    check_ganglia
        command_line    $USER1$/check_ganglia.py -h $HOSTADDRESS$ -m $ARG1$ -w $ARG2$ -c $ARG3$
        }

# 'check_nrpe' command definition
define command{
        command_name    check_nrpe
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
        }

修改templates.cfg

我有18臺datanode機器，這裏篇幅緣由只截取５個，後面依次再加就好了

define service { 
        use generic-service 
        name ganglia-service1     ；這裏的配置在service1.cfg中用到
        hostgroup_name a01    ；這裏的配置在hadoop1.cfg中用到
        service_groups ganglia-metrics1    ；這裏的配置在service1.cfg中用到
        register        0
}
 
define service { 
        use generic-service    
        name ganglia-service2    ；這裏的配置在service2.cfg中用到 
        hostgroup_name a02    ；這裏的配置在hadoop2.cfg中用到
        service_groups ganglia-metrics2    ；這裏的配置在service2.cfg中用到
        register        0
}
define service { 
        use generic-service 
        name ganglia-service3    ；這裏的配置在service3.cfg中用到 
        hostgroup_name a03    ；這裏的配置在hadoop3.cfg中用到
        service_groups ganglia-metrics3    ；這裏的配置在service3.cfg中用到
        register        0
}
define service { 
        use generic-service 
        name ganglia-service4    ；這裏的配置在service4.cfg中用到 
        hostgroup_name a04    ；這裏的配置在hadoop4.cfg中用到
        service_groups ganglia-metrics4    ；這裏的配置在service4.cfg中用到
        register        0
}
define service { 
        use generic-service     
        name ganglia-service5    ；這裏的配置在service5.cfg中用到     
        hostgroup_name a05    ；這裏的配置在hadoop5.cfg中用到    
        service_groups ganglia-metrics5    ；這裏的配置在service5.cfg中用到
        register        0
}

hadoop1.cfg　配置

這個默認是沒有，用localhost.cfg 拷貝來

$cp localhost.cfg hadoop1.cfg

# vim hadoop1.cfg 
define host{   
        use                     linux-server 
        host_name               a01
        alias                   a01
        address                a01
        }
 
define hostgroup { 
        hostgroup_name  a01
        alias  a01
        members a01
        }
define service{
        use                             local-service
        host_name                       a01
        service_description             PING
        check_command                   check_ping!100,20%!500,60%
        }
 
define service{
        use                             local-service
        host_name                      a01
        service_description             根分區
        check_command                   check_local_disk!20%!10%!/
#       contact_groups                  admins
        }
 
define service{
        use                             local-service
        host_name                       a01
        service_description             用戶數量
        check_command                   check_local_users!20!50
        }
 
define service{
        use                             local-service
        host_name                       a01
        service_description             進程數
        check_command                   check_local_procs!550!650!RSZDT
        }
 
define service{ 
        use                             local-service         
        host_name                       a01
        service_description             系統負載
        check_command                   check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
}

service1.cfg 配置

默認沒有service１.cfg，新建一個

$ vim service1.cfg

define servicegroup { 
        servicegroup_name ganglia-metrics1
        alias Ganglia Metrics1
} 
## 這裏的check_ganglia爲commonds.cfg中聲明的check_ganglia命令
define service{ 
        use                             ganglia-service1
        service_description             內存空閒
        check_command                   check_ganglia!mem_free!200!50
} 
 
define service{
        use                             ganglia-service1
        service_description             NameNode同步
        check_command                   check_ganglia!dfs.namenode.SyncsAvgTime!10!50
}

hadoop2.cfg 配置

須要注意使用check_nrpe插件的監控項必需要在hadoop2上的nrpe.cfg中聲明

也就是每一個service裏的check_command必須在這臺機器的　nrpe.cfg　中聲明瞭纔有用，比且要保證名稱同樣

$ cp localhost.cfg hadoop2.cfg
$ vim hadoop2.cfg

define host{
        use                     linux-server            ; Name of host template to use
                                                        ; This host definition will inherit all variables that are defined
                                                        ; in (or inherited by) the linux-server host template definition.
        host_name               a02
        alias                   a02
        address                 a02
        }

# Define an optional hostgroup for Linux machines

define hostgroup{
        hostgroup_name  a02; The name of the hostgroup
        alias           a02 ; Long name of the group
        members         a02    ; Comma separated list of hosts that belong to this group
        }

# Define a service to "ping" the local machine

define service{
        use                             local-service         ; Name of service template to use
        host_name                       a02
        service_description             PING
        check_command                   check_nrpe!check_ping
        }


# Define a service to check the disk space of the root partition
# on the local machine.  Warning if < 20% free, critical if
# < 10% free space on partition.

define service{
        use                             local-service         ; Name of service template to use
        host_name                       a02
        service_description             Root Partition
        check_command                   check_nrpe!check_sda2
        }



# Define a service to check the number of currently logged in
# users on the local machine.  Warning if > 20 users, critical
# if > 50 users.

define service{
        use                             local-service         ; Name of service template to use
        host_name                       a02
        service_description             Current Users
        check_command                   check_nrpe!check_users
        }


# Define a service to check the number of currently running procs
# on the local machine.  Warning if > 250 processes, critical if
# > 400 users.

define service{
        use                             local-service         ; Name of service template to use
        host_name                       a02
        service_description             Total Processes
        check_command                   check_nrpe!check_total_procs
        }

define service{
        use                             local-service         ; Name of service template to use
        host_name                       a02
        service_description             Current Load
        check_command                   check_nrpe!check_load
        }



# Define a service to check the swap usage the local machine. 
# Critical if less than 10% of swap is free, warning if less than 20% is free

define service{
        use                             local-service         ; Name of service template to use
        host_name                       a02
        service_description             Swap Usage
        check_command                   check_nrpe!check_swap
        }

hadoop2的設置完，拷貝16份，由於datanode配置基本同樣，就是hostname有點小區別

$ for i in {3..18};do cp hadoop2.cfg hadoop$i.cfg;done

將剩下里面hostname改下就行，後面就不說了

service2.cfg 配置

新建文件並配置

$ vim service2.cfg 
define servicegroup {
        servicegroup_name ganglia-metrics2
        alias Ganglia Metrics2
}

define service{
        use     ganglia-service2
        service_description     內存空閒
        check_command   check_ganglia!mem_free!200!50
}

define service{
        use     ganglia-service2
        service_description     RegionServer_Get
        check_command   check_ganglia!yarn.NodeManagerMetrics.AvailableVCores!７!７
}

define service{
        use     ganglia-service2
        service_description     DateNode_Heartbeat
        check_command   check_ganglia!dfs.datanode.HeartbeatsAvgTime!15!40

service2的設置完，拷貝16份，由於datanode配置基本同樣，就是servicegroup_name,use有點小區別

$ for i in {3..18};do scp service2.cfg service$i.cfg;done

改爲對應的編號

修改　nagios.cfg

$ vim  ../nagios.cfg
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg

#引進host文件
cfg_file=/usr/local/nagios/etc/objects/hadoop1.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop2.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop3.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop4.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop5.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop6.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop7.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop8.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop9.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop10.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop11.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop12.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop13.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop14.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop15.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop16.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop17.cfg
cfg_file=/usr/local/nagios/etc/objects/hadoop18.cfg

#引進監控項的文件
cfg_file=/usr/local/nagios/etc/objects/service1.cfg
cfg_file=/usr/local/nagios/etc/objects/service2.cfg
cfg_file=/usr/local/nagios/etc/objects/service3.cfg
cfg_file=/usr/local/nagios/etc/objects/service4.cfg
cfg_file=/usr/local/nagios/etc/objects/service5.cfg
cfg_file=/usr/local/nagios/etc/objects/service6.cfg
cfg_file=/usr/local/nagios/etc/objects/service7.cfg
cfg_file=/usr/local/nagios/etc/objects/service8.cfg
cfg_file=/usr/local/nagios/etc/objects/service9.cfg
cfg_file=/usr/local/nagios/etc/objects/service10.cfg
cfg_file=/usr/local/nagios/etc/objects/service11.cfg
cfg_file=/usr/local/nagios/etc/objects/service12.cfg
cfg_file=/usr/local/nagios/etc/objects/service13.cfg
cfg_file=/usr/local/nagios/etc/objects/service14.cfg
cfg_file=/usr/local/nagios/etc/objects/service15.cfg
cfg_file=/usr/local/nagios/etc/objects/service16.cfg
cfg_file=/usr/local/nagios/etc/objects/service17.cfg
cfg_file=/usr/local/nagios/etc/objects/service18.cfg

驗證配置是否正確

$ pwd
/usr/local/nagios/etc

$ ../bin/nagios -v nagios.cfg 

Nagios Core 4.1.1
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-19-2015
License: GPL

Website: https://www.nagios.org
Reading configuration data...
   Read main config file okay...
   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
    Checked 161 services.
    Checked 18 hosts.
    Checked 18 host groups.
    Checked 18 service groups.
    Checked 1 contacts.
    Checked 1 contact groups.
    Checked 26 commands.
    Checked 5 time periods.
    Checked 0 host escalations.
    Checked 0 service escalations.
Checking for circular paths...
    Checked 18 hosts
    Checked 0 service dependencies
    Checked 0 host dependencies
    Checked 5 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check

沒有錯誤，這時就能夠啓動hadoop1上的nagios服務

$ /etc/init.d/nagios start
Starting nagios: done.

由於以前datanode上的nrpe已經啓動了

測試hadoop1與datanode上nrpe通訊是否正常

]$ for i in {10..28};do /usr/local/nagios/libexec/check_nrpe -H xx.xxx.x.$i;done
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15
NRPE v2.15

ok，通訊正常，驗證check_ganglia.py插件是否工做正常

$ /usr/local/nagios/libexec/check_ganglia.py -h a01 -m mem_free -w 200 -c 50
CHECKGANGLIA OK: mem_free is 61840868.00

工做正常，如今咱們能夠nagios的web頁面，看是否監控成功。

localhost:8080/nagios

4.4 郵件報警配置

先檢查服務器是否安裝sendmail

$ rpm -q sendmail
$ yum install sendmail  #若是沒有就安裝sendmail
$ service sendmail restart  #重啓sendmail

由於給外部發郵件，須要服務器本身有郵件服務器，這很麻煩而且很是佔資源．這裏咱們配置一下，使用現有的STMP服務器

配置地址　/etc/mail.rc

$ vim /etc/mail.rc

set from=systeminformation@xxx.com
set smtp=mail.xxx.com smtp-auth-user=systeminformation smtp-auth-password=111111 smtp-auth=login

配置完畢以後，就能夠先命令行測試一下，是否能夠發郵件了

$ echo "hello world" |mail -s "test" pingjie@xxx.com

若是看你的郵件已經收到郵件了，說明sendmail已經沒有問題．

下面配置nagios的郵件告警配置

$ vim /usr/local/nagios/etc/objects/contacts.cfg
define contact{
        contact_name                    nagiosadmin             ; Short name of user
        use                             generic-contact         ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin            ; Full name of user
        ## 告警時間段
        service_notification_period     24x7
        host_notification_period        24x7
        ## 告警信息格式
        service_notification_options    w,u,c,r,f,s
        host_notification_options       d,u,r,f,s
        ## 告警方式爲郵件
        service_notification_commands   notify-service-by-email
        host_notification_commands      notify-host-by-email
        email                           pingjie@xxx.com       ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        }


# We only have one contact in this simple configuration file, so there is
# no need to create more than one contact group.

define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
        }

至此配置所有完成

腳本監控hadoop進程

1.監控datanode的腳本

就是用python 讀取HDFS頁面，再正則匹配到Live Nodes這部分

 1 #!/usr/bin/env python
 2 
 3 import commands
 4 import sys
 5 from optparse import OptionParser
 6 import urllib
 7 import re
 8 
 9 def get_value():
10     urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp")
11     html = urlItem.read()
12     urlItem.close()
13     return float(re.findall('.+Live Nodes</a> <td id="col2"> :<td id="col3">\\s+(\d+)\\s+\\(Decommissioned: \d+\\)<tr class="rowNormal">.+', html)[0])
14 
15 if __name__ == '__main__':
16 
17     parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0")
18     parser.add_option("-w", "--warning", type="int", dest="w", default=16)
19     parser.add_option("-c", "--critical", type="int", dest="c", default=15)
20     (options, args) = parser.parse_args()
21 
22     if(options.c >= options.w):
23         print '-w must greater then -c'
24         sys.exit(1)
25 
26     value = get_value()
27 
28     if(value <= options.c ) :
29         print 'CRITICAL - Live Nodes %d' %(value)
30         sys.exit(2)
31     elif(value <= options.w):
32         print 'WARNING - Live Nodes %d' %(value)
33         sys.exit(1)
34     else:
35         print 'OK - Live Nodes %d' %(value)
36         sys.exit(0)

2.監控dfs空間：

#!/usr/bin/env python

import commands
import sys
from optparse import OptionParser
import urllib
import re

def get_dfs_free_percent():
    urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp")
    html = urlItem.read()
    urlItem.close()
    return float(re.findall('.+<td id="col1"> DFS Remaining%<td id="col2"> :<td id="col3">\\s+(\d+\\.\d+)%<tr class="rowAlt">.+', html)[0])

if __name__ == '__main__':

    parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0")
    parser.add_option("-w", "--warning", type="int", dest="w", default=30, help="total dfs used percent")
    parser.add_option("-c", "--critical", type="int", dest="c", default=20, help="total dfs used percent")
    (options, args) = parser.parse_args()

    if(options.c >= options.w):
        print '-w must greater then -c'
        sys.exit(1)

    dfs_free_percent = get_dfs_free_percent()

    if(dfs_free_percent <= options.c ) :
        print 'CRITICAL - DFS free %d%%' %(dfs_free_percent)
        sys.exit(2)
    elif(dfs_free_percent <= options.w):
        print 'WARNING - DFS free %d%%' %(dfs_free_percent)
        sys.exit(1)
    else:
        print 'OK - DFS free %d%%' %(dfs_free_percent)
        sys.exit(0)

若是腳本出錯，就進ｐｙｔｈｏｎ命令行，根據html的結果調試一下正則部分便可

拷貝這２個腳本到/usr/local/nagios/etc/objects/

這２個腳本單獨在命令行使用 ./check_hadoop_datanode.py 這種方式執行一下試試，若是報這個錯

: No such file or directory

vim打開文件後,命令模式執行 :set ff=unix , 而後保存就好了

3. 修改nagios配置

commands.cfg　增長以下２個command

$ vim /usr/local/nagios/etc/objects/commands.cfg
define command{
        command_name    check_datanode
        command_line    $USER1$/check_hadoop_datanode.py -w $ARG1$ -c $ARG2$
        }

define command{
        command_name    check_dfs
        command_line    $USER1$/check_hadoop_dfs.py -w $ARG1$ -c $ARG2$
        }

修改server1.cfg,增長以下２個service

$ vim service1.cfg 
define service{
        use     ganglia-service1
        service_description     datanode存活個數
        check_command   check_datanode!16!15
}


define service{
        use     ganglia-service1
        service_description     dfs剩餘空間
        check_command   check_dfs!30!20
}

完成

五問題記錄

5.1 ganglia監控的指標有問題

問題描述：爲了測試nagios報警功能，而後我就kill了一個節點的datanode，可是看nagios上一直顯示這個datanode是正常的．由於nagios這些指標是從ganglia來的，因而就找到ganglia上，發現也是正常的．這個問題就很奇怪了，爲啥datanode已經kill了還一直髮心跳

解決方案：沒有，有知道的請賜教。曲線救國，nagios使用腳本方式監控進程