使用Nagios打造專業的業務狀態監控

時間 2019-11-24

標籤使用 nagios 打造專業業務狀態監控简体版

原文原文鏈接

想必各個公司都有部署zabbix之類的監控系統來監控服務器的資源使用狀況、各服務的運行狀態，是否這種監控就足夠了呢？有沒有遇到監控系統一切正常確發現項目沒法正常對外提供服務的狀況呢？本篇文章聊聊咱們如何簡單的使用Nagios監控業務的狀態php

文中的業務指用戶訪問的網站頁面，對外提供的API接口，移動端的APP等產品html

監控的思考

一般咱們會在項目所在的機房部署一套監控系統來監控咱們服務器和MySQL之類的公共服務，制定報警策略，在出現異常狀況的時候郵件或短信提醒咱們及時處理。mysql

此類監控主要的關注點有兩個：ios

資源的佔用狀況，例如負載高低、內存大小、磁盤空間等
服務的狀態監控，例如Nginx狀態、Mysql主從狀態等

同時也會存在如下兩個主要問題：nginx

缺乏業務狀態的監控，不能很直觀的知道業務當前的狀態，可能服務器、服務都正常但業務確掛了
監控服務器和業務服務器處於同一機房環境內，監控網絡故障、入口網絡擁堵等狀況均可能會致使收不到監控系統的報警，且只能監控機房內的狀況，用戶到機房入口的狀況沒法監控

那麼如何解決這兩個問題呢？web

業務狀態監控，就是要最直觀的的反映業務當前是正常仍是故障，該怎麼監控呢？以web項目爲例，首先就是要肯定具體URL的返回狀態，是200正常仍是404未找到等，其次要考慮頁面裏邊的內容是否是正常，咱們知道最終反饋給用戶內容的是由一些靜態資源和後端接口數據共同組成的HTML頁面，想知道內容究竟對不對這個比較困難，退而求其次咱們默認全部靜態資源和後端接口都返回正常狀態則表示正常，這個監控就比較容易實現了。redis

靜態資源能夠直接由nginx服務器處理，nginx的併發能力很強，通常不會成爲性能的瓶頸，針對靜態資源的監控咱們能夠結合ELK一塊兒來看。後端接口的處理性能就要差不少了，對業務狀態的監控也主要是對後端接口狀態的監控，那咱們是否須要監控全部的接口呢？這個實施起來比較麻煩，我以爲沒太大必要，只須要監控幾個有表明性的接口就能夠了，例如咱們全部的項目中都讓開發單獨加了一個health check的接口，這個接口的做用是鏈接項目全部用到的服務進行操做，如接口鏈接mysql進行數據查詢以肯定mysql能給正常提供服務，鏈接redis進行get、set操做以肯定redis服務正常，對於這個接口的監控就能覆蓋到整個鏈路的服務狀況。sql

對於監控服務器和業務服務器在同一個機房內所致使的問題（上邊講到的第二點問題），咱們能夠經過在不一樣的網絡環境內部署獨立的狀態監控來解決，例如辦公區部署Nagios，不一樣網絡監控也更接近用戶的網絡狀況，這套狀態監控就區別於機房部署的資源佔用監控了，主要用來監控業務的狀態，也就是咱們上邊提到的URL和接口狀態。後端

咱們能不能直接將監控部署在機房外的環境來節省一套監控呢？例如公司或者其餘的機房部署監控。這樣不是個好方案，跨網絡的監控性能太差了，首先網絡之間的延遲都比同機房內要大的多，其次大量監控項頻繁的數據傳輸對帶寬也是不小的壓力api

Nagios監控

咱們業務狀態監控採用了Nagios，Nagios部署簡單配置靈活，這種場景下很是適合。

系統環境：Debian8
nginx + nagios架構

部署Nagios

1.安裝基礎環境

# apt-get update
# apt-get install -y build-essential libgd2-xpm-dev autoconf gcc libc6 make wget
# apt-get install -y nginx php5-fpm spawn-fcgi fcgiwrap
複製代碼

2.下載並解壓nagios

# wget https://assets.nagios.com/downloads/nagioscore/releases/nagios-4.0.8.tar.gz
# tar -zxvf nagios-4.0.8.tar.gz 
# cd nagios-4.0.8

# ./configure && make all

# make install-groups-users
# usermod -a -G nagios www-data

# make install
# make install-init
# make install-config
# make install-commandmode
# cd ..
複製代碼

nagios安裝完成後就能夠啓動了，可是web頁面是沒法訪問的，查看日誌會報錯(No output on stdout) stderr: execvp(/usr/local/nagios/libexec/check_ping, ...) failed. errno is 2: No such file or directory，這是由於咱們只安裝了nagios的core，沒有安裝nagios的插件，須要安裝插件來支持core工做

3.安裝nagios-plugins

# wget https://nagios-plugins.org/download/nagios-plugins-2.2.1.tar.gz
# tar -zxvf nagios-plugins-2.2.1.tar.gz
# cd nagios-plugins-2.2.1
# ./configure
# make
# make install
# cd ..
複製代碼

nagios的插件主要是添加了check_ping、checkhttp之類的輔助檢查的腳本，默認位於/usr/local/nagios/libexec/下，能夠藉助這些插件來監控咱們的HTTP接口或主機、服務狀態

4.建立nagios web訪問的帳號密碼

# vi /usr/local/bin/htpasswd.pl
#!/usr/bin/perl
 
use strict;
 
if ( @ARGV != 2 ){
    print "usage: /usr/local/bin/htpasswd.pl <username> <password>\n";
}
else {
    print $ARGV[0].":".crypt($ARGV[1],$ARGV[1])."\n";
}
# chmod +x /usr/local/bin/htpasswd.pl

#利用perl腳本生成帳號密碼到htpasswd.users文件中
# /usr/local/bin/htpasswd.pl nagiosadmin nagios@ops-coffee > /usr/local/nagios/htpasswd.users
複製代碼

nagios默認開啓了帳號認證，認證相關的配置在這個文件裏/usr/local/nagios/etc/cgi.cfg
若是安裝了httpd服務，能夠直接接觸htpasswd命令生成密碼，這裏咱們沒有httpd服務，因此寫個perl腳原本生成密碼

5.nginx添加server配置，讓瀏覽器能夠訪問

server {
    listen       80;
    server_name  ngs.domain.com;

    access_log /var/log/nginx/nagios.access.log;
    error_log /var/log/nginx/nagios.error.log;

    auth_basic "Private";
    auth_basic_user_file /usr/local/nagios/htpasswd.users;

    root /usr/local/nagios/share;
    index index.php index.html;

    location / {
        try_files $uri $uri/ index.php /nagios;
    }

    location /nagios {
        alias /usr/local/nagios/share;
    }

    location ~ \.php$ {
        include /etc/nginx/fastcgi_params;
        fastcgi_pass unix:/var/run/php5-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    }

    location ~ ^/nagios/(.*\.php)$ {
        alias /usr/local/nagios/share/$1;
        include /etc/nginx/fastcgi_params;
        fastcgi_pass unix:/var/run/php5-fpm.sock;
    }

    location ~ \.cgi$ {
        root /usr/local/nagios/sbin/;
        rewrite ^/nagios/cgi-bin/(.*)\.cgi /$1.cgi break;
        fastcgi_param AUTH_USER $remote_user;
        fastcgi_param REMOTE_USER $remote_user;
        include /etc/nginx/fastcgi_params;
        fastcgi_pass unix:/var/run/fcgiwrap.socket;
    }
}
複製代碼

6.檢查配置文件並啓動

#檢查配置文件是否有語法錯誤
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg 

#啓動nagios服務
# /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

#啓動fcgiwrap和php5-fpm服務
# service fcgiwrap restart
# service php5-fpm restart
複製代碼

7.瀏覽器訪問服務器IP或域名就能夠看到nagios的頁面了，默認有本機的監控數據，不須要的話能夠在配置文件localhost.cfg中刪除

Nagios配置

Nagios的主配置文件路徑爲/usr/local/nagios/etc/nagios.cfg，裏邊默認已經配置了一些配置文件的路徑，cfg_file=後邊配置的都是配置文件，nagios程序會來這裏讀取配置，咱們能夠新添加一個專門用來監控HTTP API的配置文件

cfg_file=/usr/local/nagios/etc/objects/check_api.cfg
複製代碼

check_api.cfg裏邊的內容以下：

define service{
    use                     generic-service
    host_name               localhost
    service_description     web_project_01
    check_command           check_http!ops-coffee.cn -S
}

define service{
    use                     generic-service
    host_name               localhost
    service_description     web_project_02
    check_command           check_http!ops-coffee.cn -S -u / -e 200
}

define service{
    use                     generic-service
    host_name               localhost
    service_description     web_project_03
    check_command           check_http!ops-coffee.cn -S -u /action/health -k "sign:e5dhn"
}
複製代碼

define service：定義一個服務，每個頁面或api屬於一個服務
use：定義服務使用的模板，模板配置文件在/usr/local/nagios/etc/objects/templates.cfg
host_name：定義服務所屬的主機，咱們這裏區別主機意義不大，統一都屬於localhost好了
service_description：定義服務描述，這個值會最終展現在web頁面上的service字段，定義應簡單有意義
check_command：定義服務檢查使用的命令，命令的配置文件在/usr/local/nagios/etc/objects/commands.cfg
check_http檢測https接口時可使用-S參數，若是報錯SSL is not available，那麼你須要先安裝libssl-dev包，而後從新編譯（./configure --with-openssl=/usr/bin/openssl）部署nagios-plugin插件添加對ssl的支持

check_command咱們配置了check_http，須要修改commands.cfg文件中默認的check_http配置以下：

define command {
    command_name    check_http
    command_line    $USER1$/check_http -H $ARG1$
}
複製代碼

define command：定義一個command
command_name：定義command的名字，在主機或服務的配置文件中能夠引用
command_line：定義命令的路徑和執行方式，這個check_http就是咱們經過安裝nagios-plugin生成的，位於/usr/local/nagios/libexec/下，check_http的詳細用法能夠經過check_http -h查看，支持比較普遍

use咱們配置了generic-service，能夠經過配置服務模板定義不少默認的配置以下：

define service {
    name                            generic-service         ; The 'name' of this service template
    active_checks_enabled           1                       ; Active service checks are enabled
    passive_checks_enabled          1                       ; Passive service checks are enabled/accepted
    parallelize_check               1                       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
    obsess_over_service             1                       ; We should obsess over this service (if necessary)
    check_freshness                 0                       ; Default is to NOT check service 'freshness'
    notifications_enabled           1                       ; Service notifications are enabled
    event_handler_enabled           1                       ; Service event handler is enabled
    flap_detection_enabled          1                       ; Flap detection is enabled
    process_perf_data               1                       ; Process performance data
    retain_status_information       1                       ; Retain status information across program restarts
    retain_nonstatus_information    1                       ; Retain non-status information across program restarts
    is_volatile                     0                       ; The service is not volatile
    check_period                    24x7                    ; The service can be checked at any time of the day
    max_check_attempts              2                       ; Re-check the service up to 3 times in order to determine its final (hard) state
    check_interval                  1                      ; Check the service every 10 minutes under normal conditions
    retry_interval                  1                       ; Re-check the service every two minutes until a hard state can be determined
    contact_groups                  admins                  ; Notifications get sent out to everyone in the 'admins' group
    notification_options            w,u,c,r                 ; Send notifications about warning, unknown, critical, and recovery events
    notification_interval           60                      ; Re-notify about service problems every hour
    notification_period             24x7                    ; Notifications can be sent out at any time
    register                        0                       ; DON'T REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! } 複製代碼

配置太多就不一一解釋了，配合後邊的英文註釋應該看得懂，說幾個重要的

max_check_attempts：重試幾回來最終肯定服務的狀態，例如咱們一個服務掛了，須要重試3次纔會肯定這個服務確實是掛了，而後發郵件或短信通知咱們
check_interval：檢查頻率配置，在服務正常的狀況下多長時間輪訓檢查一次，這裏爲了更及時的反饋結果咱們配置一分鐘一次
retry_interval：當服務狀態發生變動的時候多長時間輪序檢查一次，咱們也給配置一分鐘一次
contact_groups：定義聯繫人組，當發生故障須要報警時，發送報警給哪一個組，這個組的配置文件在/usr/local/nagios/etc/objects/contacts.cfg

contact_groups咱們配置了admins，接下來看下contacts.cfg的配置

define contact{
    contact_name                    sa       ; Short name of user
    use                             generic-contact     ; Inherit default values from generic-contact template (defined above)
    alias                           Nagios Admin        ; Full name of user

    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r
    host_notification_commands      notify-host-by-email,notify-host-by-sms
    service_notification_commands   notify-service-by-email,notify-service-by-sms

    email                           ops-coffee@domain.com
    pager                           15821212121,15822222222
}

define contactgroup{
    contactgroup_name       admins
    alias                   Nagios Administrators
    members                 sa
}
複製代碼

contactgroup就是咱們定義的聯繫人組admins
admins組管理了成員sa聯繫人
sa聯繫人定義了主機和服務的命令，例如這裏咱們定義的notify-host-by-email,notify-host-by-sms發郵件和發短信的命令，這個命令的定義位置跟咱們check_http的定義都在文件/usr/local/nagios/etc/objects/commands.cfg文件內

所有配置完成後重啓nagios服務，會看到監控已經正常

Nagstamon插件

介紹一款配合nagios用起來很是棒的插件Nagstamon，Nagstamon是一款nagios的桌面小工具（實際上如今不只僅能配合nagios使用，還能配合zabbix等使用），啓動後常駐系統托盤，當nagios監控狀態發生變化時會及時的跳出來併發出聲音警告，可以更加及時的獲取業務狀態。

配置以下：

Update interval可以配置多長時間取一次nagios的狀態，咱們這裏調整爲1s
當出現報警時桌面直接飆紅，給你心跳加速的感受

寫在最後

業務狀態監控做爲Zabbix之類過程監控的補充，並不能替代過程監控系統，在咱們過程監控不是很完善的狀況下頗有用，目前咱們有至關一部分的報警都首先發現於這套業務狀態監控
選擇Nagios主要是她比較純粹，專一狀態監控（有插件實現過程記錄），且對Nagios比較熟悉了。Nagios看似配置複雜，幾個配置文件環環相扣，實際上理清楚配置文件之間的關係就會發現配置合理且簡單
部署的狀態監控節點越多覆蓋地區越多用戶狀態獲取就越準確，但因爲網絡環境複雜，咱們也不可能在每一個省市、節點部署監控系統來監控項目的狀態，若有必要能夠考慮一些商業監控方案，可以作到全球節點監控，但相應的成本可能就會增長，要綜合權衡