Flume官方文檔翻譯——Flume 1.7.0 User Guide (unreleased version)(一)

Flume 1.7.0 User Guidehtml

 

 

Introduction

Overview

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.shell

The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.apache

Apache Flume is a top level project at the Apache Software Foundation.json

There are currently two release code lines available, versions 0.9.x and 1.x.安全

Documentation for the 0.9.x track is available at the Flume 0.9.x User Guide.網絡

This documentation applies to the 1.4.x track.架構

New and existing users are encouraged to use the 1.x releases so as to leverage the performance improvements and configuration flexibilities available in the latest architecture.app

Apache Flume是一個分佈式、高可靠和高可用的收集、集合和將大量來自不一樣來源的日誌數據移動到一箇中央數據倉庫。

Apache Flume不只侷限於數據的彙集。由於數據是可定製的,因此Flume能夠用於運輸大量時間數據包括不限於網絡傳輸數據,社交媒體產生的數據,電子郵件信息和幾乎任何數據源。

Apache Flume是Apache軟件基金會的頂級項目。

目前有兩個可用的發佈版本,0.9.x和1.x。

咱們鼓勵新老用戶使用1.x發佈版原本提升性能和利用新結構的配置靈活性。

System Requirements

    1. Java Runtime Environment - Java 1.7 or later(Java運行環境-Java1.7或者之後的版本)
    2. Memory - Sufficient memory for configurations used by sources, channels or sinks(內存——足夠的內存來配置souuces,channels和sinks)
    3. Disk Space - Sufficient disk space for configurations used by channels or sinks(磁盤空間-足夠的磁盤空間來配置channels或者sinks)
    4. Directory Permissions - Read/Write permissions for directories used by agent(目錄權限-代理所使用的目錄讀/寫權限)

Architecture(架構)

Data flow model(數據流動模型)

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).

一個Flume event被定義爲擁有一個字節的有效負載的一個數據流單元和一個可選的字符串屬性配置。Flume agent是一個JVM進程來控制組件完成事件流從一個外部來源傳輸到下一個目的地。

 

A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.

Flume source消費外部來源像web server傳輸給他的事件。外部來源發送以目標Flume source定義好的格式的event給Flume。例如,Avro Flume source用於接收Avro客戶端或者流中的其餘Flume中Avro sink發來的Avro events。一個類似的流能夠用Thrift Flume Source 來接收來自Flume sink或者FluemThrift Rpc客戶端或者一個用任何語言寫的遵照Flume Thrift 協議的Thrift客戶端的事件。當一個Flume Source接收一個事件時,它將事件存儲在一個或者多個Cannel中。Channel是一個被動倉庫用來保存事件直到它被Flume Sink消費掉。File channel就是個例子-它背靠着本地的文件系統。Sink將事件從Channel中移除而且將事件放到一個外部的倉庫像HDFS(經過Flume HDFS sink)或者向前傳輸到流中另外一個Flume Agent。Agent中Source和Sink異步地執行Channel中events。

Complex flows(複雜流)

Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

Flume容許一些用戶創建multi-hop流當事件在到達最終目的地時要通過多個Agent。它也容許扇入和扇出流,上下文路由和失效hop的恢復路由。

Reliability(可靠性)

The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.

Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel. This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop.

事件都是(存儲)在每一個代理中Channel。事件會被傳送到下一個Agent或者流中的最終目的地像HDFS。事件會在被儲存在另外一個Agent的Channel中或者終點倉庫以後從原來的Agent中移除。這是一個單hop在流中信息傳輸定義,以此提供了端對端的流的可靠性。

Flume用一個事務性方案來保證事件傳遞的可靠性。source、sink和channel分別提供不一樣的事務機制,source和sink是封裝事件的存儲/恢復在一個事務機制中,channel封裝事件的位置和提供在一個事務機制中。這個保證了事件集合可靠地從流中的一個點傳到另外一個點。在多個hop的流中,前一個hop的sink和後一個hop的source都有其事務機制來保證數據可以安全得存儲在下一個hop中。

Recoverability(可恢復性)

The events are staged in the channel, which manages recovery from failure. Flume supports a durable file channel which is backed by the local file system. There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered.

Channel中存儲着事件,而且負責失效恢復。Flume支持一個持久的依賴於本地文件系統的文件Channel。一樣吃一個內存Channel簡單地將事件存儲在一個內存隊列,處理速度快但當Agent掛掉時內存中存留的事件將會丟失而且沒辦法恢復。

Setup(設置)

Setting up an agent(設置Agent)

Flume agent configuration is stored in a local configuration file. This is a text file that follows the Java properties file format. Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of each source, sink and channel in an agent and how they are wired together to form data flows.

Flume agent配置存儲在一個本地配置文件中。這是一個跟Java 屬性文件格式同樣的文本文件。一個或者多個agent能夠指定同一個配置文件來進行配置。配置文件包括每一個source的屬性,agent中的sink和channel以及它們是如何鏈接構成數據流。

Configuring individual components(單個組件的配件)

Each component (source, sink or channel) in the flow has a name, type, and set of properties that are specific to the type and instantiation. For example, an Avro source needs a hostname (or IP address) and a port number to receive data from. A memory channel can have max queue size (「capacity」), and an HDFS sink needs to know the file system URI, path to create files, frequency of file rotation (「hdfs.rollInterval」) etc. All such attributes of a component needs to be set in the properties file of the hosting Flume agent.

流中的每一個組件(source,sink或者channel)都有名字,類型和用來指定類型的屬性集和實例化。例如,一個avro source須要一個主機名(或者IP地址)和端口來接收數據,內存channel有最大隊列值(「capacity」),和HDFS sink須要知道文件系統的URI,來建立路徑,輪詢文件的頻率(hdfs.roollInterval)等.組件的全部屬性都必須在Flume agetnt的屬性文件裏配置。

Wiring the pieces together(碎片集合)

The agent needs to know what individual components to load and how they are connected in order to constitute the flow. This is done by listing the names of each of the sources, sinks and channels in the agent, and then specifying the connecting channel for each sink and source. For example, an agent flows events from an Avro source called avroWeb to HDFS sink hdfs-cluster1 via a file channel called file-channel. The configuration file will contain names of these components and file-channel as a shared channel for both avroWeb source and hdfs-cluster1 sink.

agent須要知道每一個組件加載什麼和它們是怎樣鏈接構成流。這經過列出agent中每一個source、sink和channel和指定每一個sink和source鏈接的channel。例如,一個agent流事件從一個稱爲avroWeb的Avro sources經過一個稱爲file-channel的文件channel流向一個稱爲hdfs-cluster1的HDFS sink。配置文檔將包含這些組件的名字和avroWeb source和hdfs-cluster1 sink中間共享的file-channel。

Starting an agent(開始一個agent)

An agent is started using a shell script called flume-ng which is located in the bin directory of the Flume distribution. You need to specify the agent name, the config directory, and the config file on the command line:

agent經過一個稱爲flume-ngshell位於Flume項目中bin目錄下的腳原本啓動。你必須在命令行中指定一個agent名字,配置目錄和配置文檔

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

 

Now the agent will start running source and sinks configured in the given properties file.

如今agent將會開始運行給定的屬性文檔中的cource和sink。

A simple example(一個簡單的例子)

Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate events and subsequently logs them to the console.

這裏咱們給出一個配置文件的例子,闡述一個單點Flume的部署,這個配置讓一個用戶產生一個事件和隨後把事件打印在控制檯。

# example.conf: A single-node Flume configuration

 

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

 

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

 

# Describe the sink

a1.sinks.k1.type = logger

 

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

 

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel that buffers event data in memory, and a sink that logs event data to the console. The configuration file names the various components, then describes their types and configuration parameters. A given configuration file might define several named agents; when a given Flume process is launched a flag is passed telling it which named agent to manifest.

Given this configuration file, we can start Flume as follows:

這個配置信息定義了一個名字爲a1的單點agent。a1擁有一個監聽數據端口爲44444的source,一個內存channel和一個將事件打印在控制檯的sink。配置文檔給多個組件命名,而且描述它們的類型和配置參數。一個給定的配置文檔能夠定義多個agent;當一個給定的Flume進程加載時,一個標誌會傳遞告訴他具體運行哪一個agent。

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

 

Note that in a full deployment we would typically include one more option: --conf=<conf-dir>. The <conf-dir> directory would include a shell script flume-env.sh and potentially a log4j properties file. In this example, we pass a Java option to force Flume to log to the console and we go without a custom environment script.

須要說明的是在一個完整的部署中咱們應該一般會包含多一個選項:--conf=<conf-dir>.<conf-dir>目錄包含一個shell腳本 flume-env.sh和一個潛在的log4j屬性文檔。在這個例子中,咱們經過一個Java選項來強制Flume打印信息到控制檯和沒有自定義一個環境腳本。

From a separate terminal, we can then telnet port 44444 and send Flume an event:

經過一個獨立的終端,咱們能夠telnet 端口4444和發送一個事件:

$ telnet localhost 44444

Trying 127.0.0.1...

Connected to localhost.localdomain (127.0.0.1).

Escape character is '^]'.

Hello world! <ENTER>

OK

 

The original Flume terminal will output the event in a log message.

原來的Flume終端將會在控制檯將事件打印出來:

12/06/19 15:32:19 INFO source.NetcatSource: Source starting

12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]

12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D          Hello world!. }

 

Congratulations - you’ve successfully configured and deployed a Flume agent! Subsequent sections cover agent configuration in much more detail.

恭喜-你已經成功配置和部署了一個Flume agent!接下來的部分會覆蓋agent配置的更多細節。

相關文章
相關標籤/搜索