Java大數據開發(三)Hadoop(20)-FileInputFormat實現類介紹

導讀：對於切片機制，咱們都已經很清晰了，可是對於多種多樣的輸入性文件格式，針對不一樣的數據類型，咱們應該怎麼讀取數據呢？這就是接下來說的FileInputFormat實現類

思考java

在運行MapReduce程序時，輸入的文件格式有不少，好比：基於行的日誌文件、二進制格式文件、數據庫表等。那麼，針對不一樣的數據類型，MapReduce是怎麼讀取這些數據的呢？
數據庫

FileInputFormat常見的接口實現類包括：TextInputFormat,KeyValueTextInputFormat,NLineInputFormat,CombineTextInputFormat和自定義InputFormat等。微信

1．TextInputFormatapp

TextInputFormat是默認的FileInputFormat實現類。按行讀取每條記錄。鍵是存儲該行在整個文件中的起始字節偏移量， LongWritable類型。值是這行的內容，不包括任何行終止符（換行符和回車符），Text類型。大數據

如下是一個示例，好比，一個分片包含了以下4條文本記錄。ui

Rich learning formIntelligent learning engineLearning more convenientFrom the real demand for more close to the enterprise

每條記錄表示爲如下鍵/值對：spa

(0,Rich learning form)(19,Intelligent learning engine)(47,Learning more convenient)(72,From the real demand for more close to the enterprise)

2．KeyValueTextInputFormat.net

每一行均爲一條記錄，被分隔符分割爲key，value。能夠經過在驅動類中設置conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t");來設定分隔符。默認分隔符是tab（\t）。日誌

如下是一個示例，輸入是一個包含4條記錄的分片。其中——>表示一個（水平方向的）製表符。code

line1 ——>Rich learning formline2 ——>Intelligent learning engineline3 ——>Learning more convenientline4 ——>From the real demand for more close to the enterprise

每條記錄表示爲如下鍵/值對：

(line1,Rich learning form)(line2,Intelligent learning engine)(line3,Learning more convenient)(line4,From the real demand for more close to the enterprise)

注意：此時的鍵是每行排在製表符以前的Text序列。

3．NLineInputFormat

若是使用NlineInputFormat，表明每一個map進程處理的InputSplit再也不按Block塊去劃分，而是按NlineInputFormat指定的行數N來劃分。即輸入文件的總行數/N=切片數，若是不整除，切片數=商+1。

如下是一個示例，仍然以上面的4行輸入爲例。

Rich learning formIntelligent learning engineLearning more convenientFrom the real demand for more close to the enterprise

例如，若是N是2，則每一個輸入分片包含兩行。開啓2個MapTask。

(0,Rich learning form)(19,Intelligent learning engine)

另外一個 mapper 則收到後兩行：

(47,Learning more convenient)(72,From the real demand for more close to the enterprise)

這裏的鍵和值與TextInputFormat生成的同樣。

關注「跟我一塊兒學大數據」

跟我一塊兒學大數據

本文分享自微信公衆號 - 跟我一塊兒學大數據（java_big_data）。
若有侵權，請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」，歡迎正在閱讀的你也加入，一塊兒分享。