使用cli爲aws elbv2批量添加監控告警

需求:

爲aws上各個region的elbv2進行配置監控告警,監控使用的是aws的cloudwatch,告警使用的是aws的snspython

  1. HealthyHostCount,一分鐘檢查一次,當健康主機數量等於0,就告警。健康主機的最大值小於等於0就告警
  2. UnHealthyHostCount 五分鐘檢查一次,當不健康主機數量大於或等於1個,就告警. 不健康主機數量的最小值大於等於1就告警
  3. HTTP_5XX 一分鐘採集一次,週期爲1分鐘,1個數據點中有1次超過閾值就告警,當5xx超過10個爲超過閾值
  4. HTTP_4XX 一分鐘採集一次,週期爲5分鐘,5個數據點中有三次超過閾值就告警,當4xx超過10%爲超過閾值

  5. 難點:

    每個region下有多個elb2,每個elb2下又分爲80和443的listen,目前是須要對每個listen都配置以上四個監控需求,因此總體來講工做量比較大。
    json

    解決方式:

    採用aws的cli工具進行批量建立告警。
    bash

    涉及到的awscli命令有如下:

    aws elbv2 describe-load-balancers; //獲取到該可用區內全部lb的信息
    複製代碼
    aws elbv2 describe-listeners --load-balancer-arn XXXXXXX; //獲取某一個lb的信息複製代碼

    須要配置的地方:

    • 因爲每個region須要不一樣的sns配置,以及執行腳本的主機是不一樣region的ops機器,因此每一次執行的時候要到指定ops機器以及修改爲該region下的sns地址。以下代碼:
    if __name__ == '__main__':
        # 1.獲取lb2的arn
        
        print("1. 獲取當前可用區全部lb2全部的arn。")
        arndict = getAllArn()
        sns = "arn:aws:sns:ap-south-1:2617960:AWS_Alert_Ops"
        getSimpleArnAndTaggroup(arndict,sns)複製代碼
    • 而後須要對執行的ops受權iam,目前是全部region下的ops機器有權限。
    • 登陸ops機器後須要執行:
    aws configure //key等信息不須要輸入直接回車便可,直接輸入該ops機器所在的region,目前美圖所涉及的region有:
        ap-south-1,    //孟買
        us-west-2,   //俄勒岡
        ap-southeast-1, //新加坡
        ap-northeast-1, //東京複製代碼
    • 最後直接執行腳本python aws_elb_add_alert.py

    腳本中每個aws命令具體參數的意思以下:(更加細節的能夠看一下awscli文檔)

    def getHealthyHostCountComm(lbname,port,taggroup,lbarn,sns):
        comm = ''' %s cloudwatch put-metric-alarm \ --alarm-name 'AWS_ELB_%s_PORT_%s_HealthyHostCount' \ //警報的名稱 --alarm-description 'aws elb HealthyHostCount' \ //警報描述 --metric-name HealthyHostCount \ //警報類型,可選類型有HTTPCode_Target_4XX_Count,UnHealthyHostCount,HealthyHostCount等 --namespace AWS/ApplicationELB \ //報警的對象,可選類型有AWS/ApplicationELB,AWS/RDS,AWS/EC2等 --statistic Maximum \ //對採集數據的一個判斷 --period 60 \ //每次採集數據的週期 --threshold 0 \ //告警的閾值 --evaluation-periods 1 \ //採集的次數,總時間=次數*週期時間 --datapoints-to-alarm 1 \ //知足告警條件超過閾值的次數 --comparison-operator LessThanOrEqualToThreshold \ //當前值與閾值對於操做,GreaterThanOrEqualToThreshold,GreaterThanThreshold,LessThanThreshold,LessThanOrEqualToThreshold --treat-missing-data notBreaching \ //對於不知足告警條件的數據處理方式,missing,notBreaching,breaching,ignore --alarm-actions '%s' \ //告警方式,這裏填寫的是sns的arn值 --dimensions 'Name=TargetGroup,Value=targetgroup/%s' 'Name=LoadBalancer,Value=app/%s' ''' %(Contants['AWSCLI'],lbname,port,sns,taggroup,lbarn)
        return comm複製代碼

    最終代碼:

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    
    # @Version : 1.0
    # @Time : 2018/5/28 18:12
    # @Author : *************
    # @File : aws_elb_add_alert.py
    # @Function: aws elb批量配置告警
    # @Note : 因爲每個可用區都有一臺ops機器,因此每個可用區須要單獨執行此腳本,或者用ansible
    
    import os,sys,commands,re,json
    
    Contants = {
        "AWSCLI":'/usr/bin/aws',
        "AWSREGION":['ap-south-1','us-west-2','ap-southeast-1','ap-northeast-1'] #孟買,俄勒岡,新加坡,東京
    }
    
    # 構造字典
    class CreateDict(dict):
        def __getitem__(self, item):
            try:
                return dict.__getitem__(self, item)
            except KeyError:
                value = self[item] = type(self)()
                return value
    
    #########################################################################################################
    # 配置告警
    
    # HealthyHostCount,一分鐘檢查一次,當健康主機數量等於0,就告警。健康主機的最大值小於等於0就告警
    def getHealthyHostCountComm(lbname,port,taggroup,lbarn,sns):
        comm = ''' %s cloudwatch put-metric-alarm \ --alarm-name 'AWS_ELB_%s_PORT_%s_HealthyHostCount' \ --alarm-description 'aws elb HealthyHostCount' \ --metric-name HealthyHostCount \ --namespace AWS/ApplicationELB \ --statistic Maximum \ --period 60 \ --threshold 0 \ --evaluation-periods 1 \ --datapoints-to-alarm 1 \ --comparison-operator LessThanOrEqualToThreshold \ --treat-missing-data notBreaching \ --alarm-actions '%s' \ --dimensions 'Name=TargetGroup,Value=targetgroup/%s' 'Name=LoadBalancer,Value=app/%s' ''' %(Contants['AWSCLI'],lbname,port,sns,taggroup,lbarn)
        return comm
    # UnHealthyHostCount 五分鐘檢查一次,當不健康主機數量大於或等於1個,就告警. 不健康主機數量的最小值大於等於1就告警
    def getUnHealthyHostCountComm(lbname,port,taggroup,lbarn,sns):
        comm = ''' %s cloudwatch put-metric-alarm \ --alarm-name 'AWS_ELB_%s_PORT_%s_UnHealthyHostCount' \ --alarm-description 'aws elb UnHealthyHostCount' \ --metric-name UnHealthyHostCount \ --namespace AWS/ApplicationELB \ --statistic Minimum \ --period 300 \ --threshold 1 \ --evaluation-periods 1 \ --datapoints-to-alarm 1 \ --comparison-operator GreaterThanOrEqualToThreshold \ --treat-missing-data notBreaching \ --alarm-actions '%s' \ --dimensions 'Name=TargetGroup,Value=targetgroup/%s' 'Name=LoadBalancer,Value=app/%s' ''' %(Contants['AWSCLI'],lbname,port,sns,taggroup,lbarn)
        return comm
    
    # HTTP_5XX 一分鐘採集一次,週期爲1分鐘,1個數據點中有1次超過閾值就告警,當5xx超過10個爲超過閾值
    def getHTTP_5XXComm(lbname,port,taggroup,lbarn,sns):
        comm = ''' %s cloudwatch put-metric-alarm \ --alarm-name 'AWS_ELB_%s_PORT_%s_HTTP_5XX' \ --alarm-description 'aws elb http 5xx alert' \ --metric-name HTTPCode_Target_5XX_Count \ --namespace AWS/ApplicationELB \ --statistic Sum \ --period 60 \ --threshold 10 \ --comparison-operator GreaterThanOrEqualToThreshold \ --treat-missing-data notBreaching \ --evaluation-periods 1 \ --datapoints-to-alarm 1 \ --alarm-actions '%s' \ --dimensions 'Name=LoadBalancer,Value=app/%s' '''%(Contants['AWSCLI'],lbname,port,sns,lbarn)
        return comm
    
    # HTTP_4XX 一分鐘採集一次,週期爲5分鐘,5個數據點中有三次超過閾值就告警,當4xx超過10%爲超過閾值
    def getHTTP_4XXComm(lbname,port,taggroup,lbarn,sns):
        comm = ''' %s cloudwatch put-metric-alarm \ --alarm-name 'AWS_ELB_%s_PORT_%s_HTTP_4XX' \ --alarm-description 'aws elb http 4xx alert' \ --metric-name HTTPCode_Target_4XX_Count \ --namespace AWS/ApplicationELB \ --statistic Sum \ --period 60 \ --threshold 10 \ --comparison-operator GreaterThanOrEqualToThreshold \ --treat-missing-data notBreaching \ --evaluation-periods 5 \ --datapoints-to-alarm 3 \ --unit Percent \ --alarm-actions '%s' \ --dimensions 'Name=TargetGroup,Value=targetgroup/%s' 'Name=LoadBalancer,Value=app/%s' ''' %(Contants['AWSCLI'],lbname,port,sns,taggroup,lbarn)
        return comm
    
    # 執行命令函數
    def execCommand(comm):
        try:
            (exitstatus, outtext) = commands.getstatusoutput(comm)
            return outtext
        except Exception as e:
            print(e)
    # 獲取當前可用區內全部lb2的基礎信息
    def getAllArn():
        comm1 = "%s elbv2 describe-load-balancers" % Contants['AWSCLI']
        AllLb2Details = eval(execCommand(comm1))['LoadBalancers']
        arndict = CreateDict()
        for i in range(0,len(AllLb2Details)):
            lbarn = AllLb2Details[i]["LoadBalancerArn"]
            lbname = AllLb2Details[i]["LoadBalancerName"]
            arndict[lbname]["lbarn"] = lbarn
            comm2 = "%s elbv2 describe-listeners --load-balancer-arn %s" %(Contants['AWSCLI'],lbarn)
            alllisten = eval(execCommand(comm2))['Listeners']
            for j in range(0,len(alllisten)):
                taggroup = alllisten[j]['DefaultActions'][0]['TargetGroupArn']
                port = alllisten[j]['Port']
                arndict[lbname]["lbgroup"][port] = taggroup
        return json.dumps(arndict)
    # 獲取簡寫
    def getSimpleArnAndTaggroup(arndict,sns):
        for lbname,lbvalue in eval(arndict).iteritems():
            lbarn = re.split(r'loadbalancer/app/',lbvalue['lbarn'])[-1]
            for port,taggroup in lbvalue['lbgroup'].iteritems():
                print("######################################################")
                taggroup = re.split(r':targetgroup/',taggroup)[-1]
                print("#####開始配置HealthyHostCountAlert#####")
                comm1 = getHealthyHostCountComm(lbname,port,taggroup,lbarn,sns)
                print(comm1)
                execCommand(comm1)
                print("#####開始配置UnHealthyHostCountAlert#####")
                comm2 = getUnHealthyHostCountComm(lbname,port,taggroup,lbarn,sns)
                print(comm2)
                execCommand(comm2)
                print("#####開始配置HTTP_5XX#####")
                comm3 = getHTTP_5XXComm(lbname,port,taggroup,lbarn,sns)
                print(comm3)
                execCommand(comm3)
                print("#####開始配置HTTP_4XX#####")
                comm4 = getHTTP_4XXComm(lbname,port,taggroup,lbarn,sns)
                print(comm4)
                execCommand(comm4)
    
    if __name__ == '__main__':
        # 1.獲取lb2的arn
        print("1. 獲取當前可用區全部lb2全部的arn。")
        arndict = getAllArn()
        sns = "arn:aws:sns:us-west-2:217608:AWS_Alert_Ops"
        getSimpleArnAndTaggroup(arndict,sns)
    複製代碼
    相關文章
    相關標籤/搜索