Flink的流處理API(二)

時間 2020-08-04

標籤 flink 處理 api 简体版

原文原文鏈接

1、Environment

1，getExecutionEnvironment

　　getExecutionEnvironment會根據查詢運行的方式決定返回什麼樣的運行環境，是最經常使用的一種建立執行環境的方式。php

2，createLocalEnvironment

　　返回本地執行環境，須要在調用時指定默認的並行度。java

val env = StreamExecutionEnvironment.createLocalEnvironment(1) //parallelism

3，createRemoteEnvironment

　　返回集羣執行環境，將Jar提交到遠程服務器。須要在調用時指定JobManager的IP和端口號，並指定要在集羣中運行的Jar包。mysql

//hostname port jarFiles
val env = ExecutionEnvironment.createRemoteEnvironment(host, port,"/flink/wc.jar")

4，maven依賴

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-scala_2.11</artifactId>
        <version>1.7.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-scala_2.11</artifactId>
        <version>1.7.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-kafka-0.11_2.12</artifactId>
        <version>1.7.0</version>
    </dependency>
</dependencies>
<build>
    <plugins>
        <!-- 該插件用於將Scala代碼編譯成class文件 -->
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.4.6</version>
            <executions>
                <execution>
                    <!-- 聲明綁定到maven的compile階段 -->
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.0.0</version>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

2、Source

1，基本讀取方式

//文件中讀取
val fileDs = env.readTextFile("in/tbStock.txt")
//端口讀取
val socketDs = env.socketTextStream("localhost",777)
//集合中獲取
val collectDs = env.fromCollection(List("aaa","bbb","ccc","aaa"))

2，kafka source

//kafka配置文件
val properties = new Properties()
properties.setProperty("bootstrap.servers", "hadoop102:9092")
properties.setProperty("group.id", "consumer-group")
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("auto.offset.reset", "latest")
//接收kafka的topic-demo這個topic發來的數據
val kafkaDataStream: DataStream[String] = env.addSource(new FlinkKafkaConsumer011[String]("topic-demo", new SimpleStringSchema(), properties))

3，Flink Kafa如何實現exactly-once

可參考: https://www.aboutyun.com/forum.php?mod=viewthread&tid=27395linux

　　Flink經過checkpoint來保存數據是否處理完成的狀態redis

　　由JobManager協調各個TaskManager進行checkpoint存儲，checkpoint保存在 StateBackend中，默認StateBackend是內存級的，也能夠改成文件級的進行持久化保存。sql

　　執行過程其實是一個兩段式提交，每一個算子執行完成，會進行「預提交」，直到執行完sink操做，會發起「確認提交」，若是執行失敗，預提交會放棄掉。apache

　　若是宕機須要經過StateBackend進行恢復，只能恢復全部確認提交的操做。json

4，自定義source

env.addSource(new MySource)
//自定義source
class MySource extends SourceFunction[(String,Double)] {
  //flag: 表示數據源是否還在正常運行
  var running: Boolean = true
  override def cancel(): Unit = {
    running = false
  }
  override def run(ctx: SourceFunction.SourceContext[(String,Double)]): Unit = {
    //初始化一個隨機數發生器
    val rand = new Random()
    var curTemp = 1.to(10).map(
      i => ("item_" + i, 65 + rand.nextGaussian() * 20)
    )
    while (running) {
      curTemp.foreach(
        t => ctx.collect(t)
      )
      Thread.sleep(5000)  //每5秒鐘產生一組數據
    }
  }
}

3、Transform

1，基本轉換算子

//map
val streamMap = stream.map { x => x * 2 }
//flatmap
val streamFlatMap = stream.flatMap{
    x => x.split(" ")
}
//filter
val streamFilter = stream.filter{
    x => x == 1
}

2，KeyBy與Reduce

　　keyBy(DataStream → KeyedStream)：輸入必須是Tuple類型，邏輯地將一個流拆分紅不相交的分區，每一個分區包含具備相同key的元素，在內部以hash的形式實現的。bootstrap

　　reduce(KeyedStream → DataStream)：一個分組數據流的聚合操做，合併當前的元素和上次聚合的結果，產生一個新的值，返回的流中包含每一次聚合的結果，而不是隻返回最後一次聚合的最終結果。服務器

val keyedStream: KeyedStream[(String, Int), Tuple] = startUplogDstream.map(startuplog=>(startuplog.ch,1)).keyBy(0)
//reduce //sum
keyedStream.reduce{  (ch1,ch2)=>
  (ch1._1,ch1._2+ch2._2)
}.print()

3，Split和Select

　　split(DataStream → SplitStream)：根據某些特徵把一個DataStream拆分紅兩個或者多個DataStream。

　　select(SplitStream→DataStream)：從一個SplitStream中獲取一個或者多個DataStream。

//根據Item的id進行拆分
val splitStream:SplitStream[Item] = dStream.split {
  item =>
    List(item.id)
}
//獲取標記爲item_1的數據集
splitStream.select("item_1").print()

4，Connect和CoMap

　　connect(DataStream,DataStream → ConnectedStreams)：鏈接兩個保持他們類型的數據流，兩個數據流被Connect以後，只是被放在了一個同一個流中，內部依然保持各自的數據和形式不發生任何變化，兩個流相互獨立。

　　CoMap,CoFlatMap(ConnectedStreams → DataStream)：做用於ConnectedStreams上，功能與map和flatMap同樣，對ConnectedStreams中的每個Stream分別進行map和flatMap處理。

val connStream: ConnectedStreams[StartUpLog, StartUpLog] = appStoreStream.connect(otherStream)
val allStream: DataStream[String] = connStream.map(
  (log1: StartUpLog) => log1.ch,
  (log2: StartUpLog) => log2.ch
)

4，Union

　　DataStream → DataStream：對兩個或者兩個以上的DataStream進行union操做，產生一個包含全部DataStream元素的新DataStream。注意:若是你將一個DataStream跟它本身作union操做，在新的DataStream中，你將看到每個元素都出現兩次。

val unionStream: DataStream[StartUpLog] = appStoreStream.union(otherStream)
unionStream.print("union:::")

5，Connect與 Union 區別：

　　1)Union以前兩個流的類型必須是同樣，Connect能夠不同，在以後的coMap中再去調整成爲同樣的。

　　2)Connect只能操做兩個流，Union能夠操做多個

4、實現UDF函數

1，函數類(Function Classes)

　　Flink暴露了全部udf函數的接口(實現方式爲接口或者抽象類)。例如:MapFunction, FilterFunction, ProcessFunction 等等。

val flinkTweets = tweets.filter(new FlinkFilter)
//自定義filter類
class FlinkFilter extends FilterFunction[String] {
    override def filter(value: String): Boolean = { value.contains("flink")
   }
}

2，匿名函數(Lamda Functions)

val flinkTweets = tweets.filter(_.contains("flink"))

3，富含數(Rich Functions)

　　富函數是 DataStream API 提供的一個函數類的接口，全部 Flink 函數類都有其 Rich 版本。它與常規函數的不一樣在於，能夠獲取運行環境的上下文，並擁有一些生命週期方法，因此能夠實現更復雜的功能。

　　open()方法是 rich function 的初始化方法，當一個算子例如map或者filter被調用以前open()會被調用。

　　close()方法是生命週期中的最後一個調用的方法，作一些清理工做。

　　getRuntimeContext()方法提供了函數的 RuntimeContext 的一些信息，例如函數執行的並行度，任務的名字，以及 state 狀態。

5、Sink

　　Flink 沒有相似於spark中foreach方法，讓用戶進行迭代的操做。雖有對外的輸出操做都要利用Sink完成。最後經過相似以下方式完成整個任務最終輸出操做。

1，kafka

dstream.addSink(new FlinkKafkaProducer011[String]("linux01:9092","test", new SimpleStringSchema()))

2，redis

<dependency>
    <groupId>org.apache.bahir</groupId>
    <artifactId>flink-connector-redis_2.11</artifactId>
    <version>1.0</version>
</dependency>

val config = new FlinkJedisPoolConfig.Builder().setHost("127.0.0.1").setPort(6379).build()
resultDStream.addSink(new RedisSink[Item](config,new MyRedisMapper))
//定義redisMapper
class MyRedisMapper extends RedisMapper[Item] {
  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.HSET,"item_test") //hkey
  }
  override def getKeyFromData(data: Item): String = data.id 
  override def getValueFromData(data: Item): String = data.toString
}

3，Elasticsearch

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-elasticsearch6_2.11</artifactId>
    <version>1.7.2</version>
</dependency>

//定義es的host集合
val list = new util.ArrayList[HttpHost]()
list.add(new HttpHost("linux01", 9200))
//定義esBuilder
val esBuilder = new ElasticsearchSink.Builder[Item](list,new ElasticsearchSinkFunction[Item] {
  override def process(element: Item, ctx: RuntimeContext, indexer: RequestIndexer): Unit = {
    //定義es數據存儲方式和存儲值
    val json = new util.HashMap[String, String]()
    json.put("data", element.toString)
    //定義存儲索引 type 和數據源
    val indexRequest = Requests.indexRequest().index("indexName").`type`("_doc").source(json)
    indexer.add(indexRequest)
  }
})
resultDStream.addSink(esBuilder.build())

4，自定義sink(JDBC)

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.44</version>
</dependency>

resultDStream.addSink(new MyJDBCSink)
//自定義jdbcsink
class MyJDBCSink extends RichSinkFunction[Sensor]{
  var conn: Connection = _
  var insertStmt: PreparedStatement = _
  var updateStmt: PreparedStatement = _
  //open 簡歷鏈接
  override def open(parameters: Configuration): Unit = {
    conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/test", "root", "123456")
    insertStmt = conn.prepareStatement("INSERT INTO item_test (id, num) VALUES (?, ?)")
    updateStmt = conn.prepareStatement("UPDATE item_test SET num = ? WHERE id = ?")
  }
  //調用執行
  override def invoke(value: Sensor, context: SinkFunction.Context[_]): Unit = {
    updateStmt.setDouble(1, value.temp)
    updateStmt.setString(2, value.id)
    updateStmt.execute()
    if (updateStmt.getUpdateCount == 0) {
      insertStmt.setString(1, value.id)
      insertStmt.setDouble(2, value.temp)
      insertStmt.execute()
    }
  }
  //關閉資源
  override def close(): Unit = {
    insertStmt.close()
    updateStmt.close()
    conn.close()
  }
}

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。