解讀MapReduce程序實例

    Mapreduce 是一個分佈式運算程序的編程框架,核心功能是將用戶編寫的業務邏輯代碼和自帶默認組件整合成一個完整的 分佈式運算程序,併發運行在一個 hadoop 集羣上。MapReduce採用「分而治之」策略,一個存儲在分佈式文件系統中的大規模數據集,會被切分紅許多獨立的分片(split),這些分片能夠被多個Map任務並行處理。編程

    Hadoop 的四大組件:vim

    (1)HDFS:分佈式存儲系統;
    (2)MapReduce:分佈式計算系統;
    (3)YARN: hadoop 的資源調度系統;
    (4)Common: 以上三大組件的底層支撐組件,主要提供基礎工具包和 RPC 框架等;併發

    在 MapReduce 組件裏, 官方給咱們提供了一些樣例程序,其中很是有名的就是 wordcount 和 pi 程序,這些程序代碼都在 hadoop-example.jar 包裏,jar包的安裝目錄在Hadoop下,爲:app

/share/hadoop/mapreduce

    下面咱們來逐一解讀這兩個樣例程序。框架

    測試前,先關閉防火牆,啓動Zookeeper、Hadoop集羣,依次順序爲 :分佈式

./start-dfs.sh
./start-yarn.sh

    成功啓動後,查看進程是否完整。這些可參考以前博客中關於集羣的搭建。工具

    1、pi樣例程序oop

    (1)執行命令,帶上參數測試

[hadoop@slave01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.7.6.jar pi 5 5
Number of Maps  = 5
Samples per Map = 5
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
...
...
省略一部分
...
...
18/06/27 16:22:56 INFO mapreduce.Job:  map 0% reduce 0%
18/06/27 16:28:12 INFO mapreduce.Job:  map 73% reduce 0%
18/06/27 16:28:13 INFO mapreduce.Job:  map 100% reduce 0%
18/06/27 16:29:26 INFO mapreduce.Job:  map 100% reduce 100%
18/06/27 16:29:29 INFO mapreduce.Job: Job job_1530087649012_0001 completed successfully
18/06/27 16:29:30 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=116
		FILE: Number of bytes written=738477
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1320
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=23
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters 
		Launched map tasks=5
		Launched reduce tasks=1
		Data-local map tasks=5
		Total time spent by all maps in occupied slots (ms)=1625795
		Total time spent by all reduces in occupied slots (ms)=48952
		Total time spent by all map tasks (ms)=1625795
		Total time spent by all reduce tasks (ms)=48952
		Total vcore-milliseconds taken by all map tasks=1625795
		Total vcore-milliseconds taken by all reduce tasks=48952
		Total megabyte-milliseconds taken by all map tasks=1664814080
		Total megabyte-milliseconds taken by all reduce tasks=50126848
	Map-Reduce Framework
		Map input records=5
		Map output records=10
		Map output bytes=90
		Map output materialized bytes=140
		Input split bytes=730
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=140
		Reduce input records=10
		Reduce output records=0
		Spilled Records=20
		Shuffled Maps =5
		Failed Shuffles=0
		Merged Map outputs=5
		GC time elapsed (ms)=107561
		CPU time spent (ms)=32240
		Physical memory (bytes) snapshot=500453376
		Virtual memory (bytes) snapshot=12460331008
		Total committed heap usage (bytes)=631316480
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=590
	File Output Format Counters 
		Bytes Written=97
Job Finished in 452.843 seconds
Estimated value of Pi is 3.68000000000000000000

    執行程序,參數含義:url

    第1個參數5指的是要運行5次map任務 ;
    第2個參數5指的是每一個map任務,要投擲多少次 ;

    2個參數的乘積就是總的投擲次數(pi代碼就是以投擲來計算值)。

    經過上面咱們得到了Pi的值:3.680000,固然也能夠改變參數來驗證得出的結果和參數的關係,好比個人蔘數換成10和10,則得出的結果爲:3.20000。因而可知:參數越大,結果越是精確。

    (2)查看運行進程

    在執行過程當中,它的時間不定,因此咱們能夠經過訪問界面,查看具體的運行進程,訪問:

slave01:8088

    界面顯示以下:

    從上面咱們能夠看出:當Progress進程結束,即表明運算過程結束,也能夠點擊查看具體的內容,這裏不作演示了。

    2、wordcount樣例程序

    (1)準備數據,上傳HDFS

    簡單的說就是單詞統計,這裏咱們新建一個txt文件,輸入一些單詞,方便統計:

[hadoop@slave01 mapreduce]$ touch wordcount.txt
[hadoop@slave01 mapreduce]$ vim wordcount.txt

    輸入如下單詞,並保存:

hello word !
you can help me ?
yes , I can
How do you do ?

    上傳到HDFS,先在hdfs上建立文件夾,在將txt文件放到該文件夾下,下面是一種建立方式,或者是hadoop fs -mkdir 的方式,兩者擇其一,注意路徑:

[hadoop@slave01 bin]$ hdfs dfs -mkdir -p /wordcount
[hadoop@slave01 bin]$ hdfs dfs -put ../share/hadoop/mapreduce/wordcount.txt /wordcount
[hadoop@slave01 bin]$

    咱們能夠經過訪問 slave01:50070,查看HDFS文件系統:

    成功上傳。

    (2)運行程序

    執行下面的命令,注意路徑:

[hadoop@slave01 bin]$ yarn jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /wordcount /word_output
18/06/27 17:34:24 INFO client.RMProxy: Connecting to ResourceManager at slave01/127.0.0.1:8032
18/06/27 17:34:30 INFO input.FileInputFormat: Total input paths to process : 1
18/06/27 17:34:30 INFO mapreduce.JobSubmitter: number of splits:1
18/06/27 17:34:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1530087649012_0003
18/06/27 17:34:32 INFO impl.YarnClientImpl: Submitted application application_1530087649012_0003
18/06/27 17:34:33 INFO mapreduce.Job: The url to track the job: http://slave01:8088/proxy/application_1530087649012_0003/
18/06/27 17:34:33 INFO mapreduce.Job: Running job: job_1530087649012_0003
18/06/27 17:34:52 INFO mapreduce.Job: Job job_1530087649012_0003 running in uber mode : false
18/06/27 17:34:52 INFO mapreduce.Job:  map 0% reduce 0%
18/06/27 17:35:02 INFO mapreduce.Job:  map 100% reduce 0%
18/06/27 17:35:31 INFO mapreduce.Job:  map 100% reduce 100%
18/06/27 17:35:32 INFO mapreduce.Job: Job job_1530087649012_0003 completed successfully
...
...
省略部分
...
...
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=59
	File Output Format Counters 
		Bytes Written=72

    命令參數的含義:

    第一個指的是jar包路徑,第二個指的是要執行的樣例程序名稱wordcount,第三個指的是文件所在的HDFS路徑,第四個指的是要輸出的文件目錄(不要是已經存在的)。

    上面是輸出結果,一樣的咱們能夠經過訪問 slave01:8088 查看進程。

    執行結束後,在HDFS文件系統上,能夠看到輸出的目錄已經建立好了,且裏面存在了輸出的文件:

    經過命令,能夠查看執行後的結果文件:

[hadoop@slave01 bin]$ hdfs dfs -text /word_output/part*
!	1
,	1
?	2
How	1
I	1
can	2
do	2
hello	1
help	1
me	1
word	1
yes	1
you	2
[hadoop@slave01 bin]$

    從上面能夠看出:單詞已經統計完成,咱們能夠對照文件進行驗證。

    好了,上面是對兩個已有樣例的解讀,至於代碼方面有空再一塊兒討論吧。

相關文章
相關標籤/搜索