Flume可分佈式日誌收集系統

時間 2019-12-08

原文原文鏈接

Flume

1. 前言

　　flume是由cloudera軟件公司產出的可分佈式日誌收集系統，後與2009年被捐贈了apache軟件基金會，爲hadoop相關組件之一。尤爲近幾年隨着flume的不斷被完善以及升級版本的逐一推出，特別是flume-ng;同時flume內部的各類組件不斷豐富，用戶在開發的過程當中使用的便利性獲得很大的改善，現已成爲apache top項目之一.html

2. 概述

2.1. 什麼是flume?
http://flume.apache.org/index.html

Apache Flume 是一個從能夠收集例如日誌，事件等數據資源，並將這些數量龐大的數據從各項數據資源中集中起來存儲的工具/服務，或者數集中機制。flume具備高可用，分佈式，配置工具，其設計的原理也是基於將數據流，如日誌數據從各類網站服務器上聚集起來存儲到HDFS，HBase等集中存儲器中。其結構以下圖所示：
node

2.2. Flume特性

Flume是一個分佈式、可靠、和高可用的海量日誌採集、聚合和傳輸的系統。
Flume能夠採集文件，socket數據包、文件、文件夾、kafka等各類形式源數據，又能夠將採集到的數據(下沉sink)輸出到HDFS、hbase、hive、kafka等衆多外部存儲系統中
通常的採集需求，經過對flume的簡單配置便可實現
Flume針對特殊場景也具有良好的自定義擴展能力，所以，flume能夠適用於大部分的平常數據採集場景

3. Flume原理

3.1. Flume組件詳解

對於每個Agent來講,它就是一共獨立的守護進程(JVM),它從客戶端接收數據，以下圖所示flume的基本模型
web

一、 Flume分佈式系統中最核心的角色是agent，flume採集系統就是由一個個agent所鏈接起來造成apache

二、 每個agent至關於一個數據(被封裝成Event對象)傳遞員，內部有三個組件：緩存

a) Source：採集組件，用於跟數據源對接，以獲取數據服務器

b) Sink：下沉組件，用於往下一級agent傳遞數據或者往最終存儲系統傳遞數據架構

c) Channel：傳輸通道組件，用於從source將數據傳遞到sink
socket

首先來看一下flume官網中對Event的定義
分佈式

　　一行文本內容會被反序列化成一個event(序列化是將對象狀態轉換爲可保持或傳輸的格式的過程。與序列化相對的是反序列化，它將流轉換爲對象。這兩個過程結合起來，能夠輕鬆地存儲和傳輸數據)，event的最大定義爲2048字節，超過，則會切割，剩下的會被放到下一個event中，默認編碼是UTF-8。ide

3.2. Flume採集結構圖

3.2.1. 簡單結構

單個agent採集數據

3.2.2. 複雜結構

多級agent之間串聯

4. Flume實戰案例

4.1. Flume的安裝部署

一、Flume的安裝很是簡單，只須要解壓便可，固然，前提是已有hadoop環境

上傳安裝包到數據源所在節點上

而後解壓 tar -zxvf apache-flume-1.6.0-bin.tar.gz

而後進入flume的目錄，修改conf下的flume-env.sh，在裏面配置JAVA_HOME

2、根據數據採集的需求配置採集方案，描述在配置文件中(文件名可任意自定義)

3、指定採集方案配置文件，在相應的節點上啓動flume agent

先用一個最簡單的例子來測試一下程序環境是否正常

一、先在flume的conf目錄下新建一個配置文件（採集方案）

vi netcat-logger.properties

# 定義這個agent中各組件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述和配置source組件：r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# 描述和配置sink組件：k1
a1.sinks.k1.type = logger

# 描述和配置channel組件，此處使用是內存緩存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 描述和配置source  channel   sink之間的鏈接關係
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

View Code

二、啓動agent去採集數據

bin/bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

-c conf 指定flume自身的配置文件所在目錄

-f conf/netcat-logger.conf 指定咱們所描述的採集方案

-n a1 指定咱們這個agent的名字

三、測試

先要往agent的source所監聽的端口上發送數據，讓agent有數據可採

隨便在一個能跟agent節點聯網的機器上

telnet anget-hostname port （telnet localhost 44444）

4.2. 採集案例

4.2.1. 採集目錄到HDFS

結構示意圖：

採集需求：某服務器的某特定目錄下，會不斷產生新的文件，每當有新文件出現，就須要把文件採集到HDFS中去

根據需求，首先定義如下3大要素

l 數據源組件，即source ——監控文件目錄 : spooldir

spooldir特性：

1、監視一個目錄，只要目錄中出現新文件，就會採集文件中的內容

2、採集完成的文件，會被agent自動添加一個後綴：COMPLETED

3、所監視的目錄中不容許重複出現相同文件名的文件

l 下沉組件，即sink——HDFS文件系統 : hdfs sink

l 通道組件，即channel——可用file channel 也能夠用內存channel

配置文件編寫：

#定義三大組件的名稱
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# 配置source組件
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /home/hadoop/logs/
agent1.sources.source1.fileHeader = false

#配置攔截器
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = host
agent1.sources.source1.interceptors.i1.hostHeader = hostname

# 配置sink組件
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path =hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
#agent1.sinks.sink1.hdfs.round = true
#agent1.sinks.sink1.hdfs.roundValue = 10
#agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

View Code

Channel參數解釋：

capacity：默認該通道中最大的能夠存儲的event數量

trasactionCapacity：每次最大能夠從source中拿到或者送到sink中的event數量

keep-alive：event添加到通道中或者移出的容許時間

4.2.2. 採集文件到HDFS

採集需求：好比業務系統使用log4j生成的日誌，日誌內容不斷增長，須要把追加到日誌文件中的數據實時採集到hdfs

根據需求，首先定義如下3大要素

採集源，即source——監控文件內容更新 : exec ‘tail -F file’
下沉目標，即sink——HDFS文件系統 : hdfs sink
Source和sink之間的傳遞通道——channel，可用file channel 也能夠用內存channel

配置文件編寫：

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure tail -F source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F /home/hadoop/logs/access_log
agent1.sources.source1.channels = channel1

#configure host for source
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = host
agent1.sources.source1.interceptors.i1.hostHeader = hostname

# Describe sink1
agent1.sinks.sink1.type = hdfs
#a1.sinks.k1.channel = c1
agent1.sinks.sink1.hdfs.path =hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

View Code

3、兩個agent級聯

4.3. 更多source和sink組件

Flume支持衆多的source和sink類型，詳細手冊可參考官方文檔

http://flume.apache.org/FlumeUserGuide.html

4.4. HA Flume配置案例

在完成單點的Flume NG搭建後，下面咱們搭建一個高可用的Flume NG集羣，架構圖以下所示：

　　圖中，咱們能夠看出，Flume的存儲能夠支持多種，這裏只列舉了HDFS和Kafka（如：存儲最新的一週日誌，並給Spark Streaming系統提供實時日誌流。

4.4.1. 角色分配

Flume的Agent和Collector分佈以下表所示：

名稱	HOST	角色
Agent1	mini1	Web Server
Agent2	mini2	Web Server
Agent3	mini3	Web Server
Collector1	mini4	AgentMstr1
Collector2	mini5	AgentMstr2

圖中所示，Agent1，Agent2，Agent3數據分別流入到Collector1和Collector2，Flume NG自己提供了Failover機制，能夠自動切換和恢復。在上圖中，有3個產生日誌服務器分佈在不一樣的機房，要把全部的日誌都收集到一個集羣中存儲。下面咱們開發配置Flume NG集羣。

4.4.2. 配置

在下面單點Flume中，基本配置都完成了，咱們只須要新添加兩個配置文件，它們是agent.properties和collector.properties，其配置內容以下所示：

1、agent配置

vi conf/agent.properties

#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2

#set gruop
agent1.sinkgroups = g1

#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/log/test.log

agent1.sources.r1.interceptors = i1 i2
agent1.sources.r1.interceptors.i1.type = static
agent1.sources.r1.interceptors.i1.key = Type
agent1.sources.r1.interceptors.i1.value = LOGIN
agent1.sources.r1.interceptors.i2.type = timestamp

# set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = mini2
agent1.sinks.k1.port = 52020

# set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = mini3
agent1.sinks.k2.port = 52020

#set sink group
agent1.sinkgroups.g1.sinks = k1 k2

#set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000

View Code

啓動命令：

bin/flume-ng agent -n agent1 -c conf -f conf/agent.properties -Dflume.root.logger=DEBUG,console

2、collector配置

vi collector.properties

#set Agent name
a1.sources = r1
a1.channels = c1
a1.sinks = k1

#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# other node,nna to nns
a1.sources.r1.type = avro
a1.sources.r1.bind = mini2
a1.sources.r1.port = 52020
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = Collector
a1.sources.r1.interceptors.i1.value = mini2
a1.sources.r1.channels = c1

#set sink to hdfs
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=/home/hdfs/flume/logdfs
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=TEXT
a1.sinks.k1.hdfs.rollInterval=10
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d

View Code

在mini3上，須要修改上述配置中的紅色字體主機名爲mini3

啓動命令：

bin/flume-ng agent -n a1 -c conf -f conf/collector.properties -Dflume.root.logger=DEBUG,console

4.4.3. FAILOVER測試

下面咱們來測試下Flume NG集羣的高可用（故障轉移）。場景以下：咱們在Agent1節點上傳文件，因爲咱們配置Collector1的權重比Collector2大，因此 Collector1優先採集並上傳到存儲系統。而後咱們kill掉Collector1，此時有Collector2負責日誌的採集上傳工做，以後，我們手動恢復Collector1節點的Flume服務，再次在Agent1上次文件，發現Collector1恢復優先級別的採集工做。具體截圖以下所示：