【大數據應用技術】做業十一｜分佈式並行計算MapReduce

時間 2020-06-04

原文原文鏈接

本次做業在要求來自：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/3319html

1.用本身的話闡明Hadoop平臺上HDFS和MapReduce的功能、工做原理和工做過程。python

1）HDFSshell

HDFS是分佈式文件系統，用來存儲海量數據。HDFS中有兩類節點：NameNode和DataNode。vim

NameNode是管理節點，存放文件元數據。也就是存放着文件和數據塊的映射表，數據塊和數據節點的映射表。也就是說，經過NameNode，咱們就能夠找到文件存放的地方，找到存放的數據。DataNode是工做節點，用來存放數據塊，也就是文件實際存儲的地方。bash

工做原理：客戶端向NameNode發起讀取元數據的消息，NameNode就會查詢它的Block Map，找到對應的數據節點。而後客戶端就能夠去對應的數據節點中找到數據塊，拼接成文件就能夠了，這就是讀寫的流程。app

2）MapReduce框架

MapReduce是並行處理框架，實現任務分解和調度。分佈式

工做原理：將一個大任務分解成多個小任務(map)，小任務執行完了以後，合併計算結果(reduce)。也就是說，JobTracker拿到job以後，會把job分紅不少個maptask和reducetask，交給他們執行。 MapTask、ReduceTask函數的輸入、輸出都是的形式。HDFS存儲的輸入數據通過解析後，以鍵值對的形式，輸入到MapReduce()函數中進行處理，輸出一系列鍵值對做爲中間結果，在Reduce階段，對擁有一樣Key值的中間數據進行合併造成最後結果。函數

2.HDFS上運行MapReduceoop

1）準備文本文件，放在本地/home/hadoop/wc

先準備一個大一點英文文本文件，我這裏準備的是一個名爲Mrstandfast.txt的文本文件，放在了下載目錄下，以下圖所示。

使用 mkdir wc 命令新建一個名爲wc的文件夾，再使用 mv /home/chen/下載/MrStandfast.txt MrStandfast.txt 命令把MrStandfast.txt文件複製到wc中，以下圖所示。

2）編寫mapper函數和reducer函數，在本地運行測試經過

首先，咱們能夠先編寫 mapper函數和reducer函數，使用 gedit mapper.py 命令創建mapper.py文件，在其中插入所要執行的語句，並保存關閉。同理，reducer.py也是這樣。

mapper.py

#!/usr/bin/env python
import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print "%s\t%s" % (word, 1)

reducer.py

#!/usr/bin/env python
 
from operator import itemgetter import sys current_word = None current_count = 0 word = None # input comes from STDIN
for line in sys.stdin: # remove leading and trailing whitespace
    line = line.strip() # parse the input we got from mapper.py
    word, count = line.split('\t', 1) # convert count (currently a string) to int
    try: count = int(count) except ValueError: # count was not a number, so silently
        # ignore/discard this line
        continue
 
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word: current_count += count else: if current_word: # write result to STDOUT
            print '%s\t%s' % (current_word, current_count) current_count = count current_word = word # do not forget to output the last word if needed!
if current_word == word: print '%s\t%s' % (current_word, current_count)

完成上述步驟後，可使用 cat mapper.py 命令和 cat reducer.py 命令來查看，以下圖所示。

分別使用 chmod a+x /home/chen/wc/mapper.py 和 chmod a+x /home/chen/wc/reducer.py 命令修改mapper和reducer的權限。

分別使用 echo "foo foo quux labs foo bar quux" | ./mapper 命令和 echo "foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py 命令在本地測試python代碼是否可執行，以下圖所示。

3）啓動Hadoop：HDFS, JobTracker, TaskTracker

使用 start-all.sh 命令啓動hadoop，再使用 jps 命令查看是否啓動成功，以下圖所示。

4）把文本文件上傳到hdfs文件系統上 user/chen/input

因爲我先前作實驗時已經建立了 /user/chen/input 這個文件夾了，因此在這裏我就直接將本地文件上傳便可，使用 hdfs dfs -put /home/chen/wc/MrStandfast.txt /user/chen/input/ 命令把本地的MrStandfast.txt上傳至hdfs文件系統上 user/chen/input上，再使用 hdfs dfs -ls /user/chen/input/ 命令來查看文件，以下圖所示。

注意：若是先前尚未建立文件夾的，可使用 hdfs dfs -mkdir -p /user/chen/input 命令來建立文件夾，詳見https://www.cnblogs.com/bhuan/p/10964927.html

5）streaming的jar文件的路徑寫入環境變量，讓環境變量生效

使用 vim ~/.bashrc 命令將streaming的jar文件的路徑寫入~/.bashrc中，並使用 source ~/.bashrc 讓環境變量生效，以下圖所示。

streaming的jar文件的路徑：

export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar

6）創建一個shell腳本文件：streaming接口運行的腳本，名稱爲run.sh

使用 vim run.sh 命令或者 gedit run.sh 命令添加streaming接口運行的腳本，再使用 source run.sh 命令使其生效，以下圖所示。

run.sh文件內容

hadoop jar $STREAM  \
-file /home/chen/wc/mapper.py \
-mapper  /home/chen/wc/mapper.py \
-file /home/chen/wc/reducer.py \
-reducer  /home/chen/wc/reducer.py \
-input /user/chen/input/*.txt \
-output /user/chen/wcountput

7）source run.sh來執行mapreduce

8）查看運行結果

使用 hdfs dfs -cat /user/chen/wcountput/* 命令來查看運行結果，以下圖所示。