本文將大數據學習門檻降到了地平線

Hadoop介紹

Hadoop-大數據開源世界的亞當夏娃。
核心是HDFS數據存儲系統,和MapReduce分佈式計算框架。java

HDFS

原理是把大塊數據切碎,

每一個碎塊複製三份,分開放在三個廉價機上,一直保持有三塊可用的數據互爲備份。使用的時候只從其中一個備份讀出來,這個碎塊數據就有了。

存數據的叫datenode(格子間),管理datenode的叫namenode(執傘人)。node

MapReduce

原理是大任務先分堆處理-Map,再彙總處理結果-Reduce。分和匯是多臺服務器並行進行,才能體現集羣的威力。難度在於如何把任務拆解成符合MapReduce模型的分和匯,以及中間過程的輸入輸出<k,v> 都是什麼。
web

單機版Hadoop介紹

對於學習hadoop原理和hadoop開發的人來講,搭建一套hadoop系統是必須的。但正則表達式

  • 配置該系統是很是頭疼的,不少人配置過程就放棄了。
  • 沒有服務器供你使用

這裏介紹一種免配置的單機版hadoop安裝使用方法,能夠簡單快速的跑一跑hadoop例子輔助學習、開發和測試。
要求筆記本上裝了Linux虛擬機,虛擬機上裝了docker。docker

安裝

使用docker下載sequenceiq/hadoop-docker:2.7.0鏡像並運行。shell

[root@bogon ~]# docker pull sequenceiq/hadoop-docker:2.7.0  
2.7.0: Pulling from sequenceiq/hadoop-docker860d0823bcab: Pulling fs layer e592c61b2522: Pulling fs layer

下載成功輸出apache

Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0

啓動

[root@bogon ~]# docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
Starting sshd:                                             [  OK  ]
Starting namenodes on [b7a42f79339c]
b7a42f79339c: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-b7a42f79339c.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out

啓動成功後命令行shell會自動進入Hadoop的容器環境,不須要執行docker exec。在容器環境進入/usr/local/hadoop/sbin,執行./start-all.sh和./mr-jobhistory-daemon.sh start historyserver,以下bootstrap

bash-4.1# cd /usr/local/hadoop/sbin
bash-4.1# ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh

Starting namenodes on [b7a42f79339c]
b7a42f79339c: namenode running as process 128. Stop it first.

localhost: datanode running as process 219. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 402. Stop it first.

starting yarn daemons
resourcemanager running as process 547. Stop it first.
localhost: nodemanager running as process 641. Stop it first.  

bash-4.1# ./mr-jobhistory-daemon.sh start historyserver
chown: missing operand after `/usr/local/hadoop/logs'
Try `chown --help' for more information.
starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.out

Hadoop啓動完成,如此簡單。瀏覽器

要問分佈式部署有多麻煩,數數光配置文件就有多少個吧!我親眼見過一個hadoop老鳥,由於新換的服務器hostname主機名帶橫線「-」,配了一上午,環境硬是沒起來。

運行自帶的例子

回到Hadoop主目錄,運行示例程序bash

bash-4.1# cd /usr/local/hadoop
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+' 
20/07/05 22:34:41 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/07/05 22:34:43 INFO input.FileInputFormat: Total input paths to process : 31
20/07/05 22:34:43 INFO mapreduce.JobSubmitter: number of splits:31
20/07/05 22:34:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594002714328_0001
20/07/05 22:34:44 INFO impl.YarnClientImpl: Submitted application application_1594002714328_0001
20/07/05 22:34:45 INFO mapreduce.Job: The url to track the job: http://b7a42f79339c:8088/proxy/application_1594002714328_0001/
20/07/05 22:34:45 INFO mapreduce.Job: Running job: job_1594002714328_0001
20/07/05 22:35:04 INFO mapreduce.Job: Job job_1594002714328_0001 running in uber mode : false
20/07/05 22:35:04 INFO mapreduce.Job:  map 0% reduce 0%
20/07/05 22:37:59 INFO mapreduce.Job:  map 11% reduce 0%
20/07/05 22:38:05 INFO mapreduce.Job:  map 12% reduce 0%

mapreduce計算完成,有以下輸出

20/07/05 22:55:26 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=291
                FILE: Number of bytes written=230541
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=569
                HDFS: Number of bytes written=197
                HDFS: Number of read operations=7
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=5929
                Total time spent by all reduces in occupied slots (ms)=8545
                Total time spent by all map tasks (ms)=5929
                Total time spent by all reduce tasks (ms)=8545
                Total vcore-seconds taken by all map tasks=5929
                Total vcore-seconds taken by all reduce tasks=8545
                Total megabyte-seconds taken by all map tasks=6071296
                Total megabyte-seconds taken by all reduce tasks=8750080
        Map-Reduce Framework
                Map input records=11
                Map output records=11
                Map output bytes=263
                Map output materialized bytes=291
                Input split bytes=132
                Combine input records=0
                Combine output records=0
                Reduce input groups=5
                Reduce shuffle bytes=291
                Reduce input records=11
                Reduce output records=11
                Spilled Records=22
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=159
                CPU time spent (ms)=1280
                Physical memory (bytes) snapshot=303452160
                Virtual memory (bytes) snapshot=1291390976
                Total committed heap usage (bytes)=136450048
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=437
        File Output Format Counters 
                Bytes Written=197

hdfs命令查看輸出結果

bash-4.1# bin/hdfs dfs -cat output/*
6       dfs.audit.logger
4       dfs.class
3       dfs.server.namenode.
2       dfs.period
2       dfs.audit.log.maxfilesize
2       dfs.audit.log.maxbackupindex
1       dfsmetrics.log
1       dfsadmin
1       dfs.servers
1       dfs.replication
1       dfs.file

例子講解

grep是一個在輸入中計算正則表達式匹配的mapreduce程序,篩選出符合正則的字符串以及出現次數。

shell的grep結果會顯示完整的一行,這個命令只顯示行中匹配的那個字符串
grep input output 'dfs[a-z.]+'

正則表達式dfs[a-z.]+,表示字符串要以dfs開頭,後面是小寫字母或者換行符n以外的任意單個字符均可以,數量一個或者多個。
輸入是input裏的全部文件,

bash-4.1# ls -lrt
total 48
-rw-r--r--. 1 root root  690 May 16  2015 yarn-site.xml
-rw-r--r--. 1 root root 5511 May 16  2015 kms-site.xml
-rw-r--r--. 1 root root 3518 May 16  2015 kms-acls.xml
-rw-r--r--. 1 root root  620 May 16  2015 httpfs-site.xml
-rw-r--r--. 1 root root  775 May 16  2015 hdfs-site.xml
-rw-r--r--. 1 root root 9683 May 16  2015 hadoop-policy.xml
-rw-r--r--. 1 root root  774 May 16  2015 core-site.xml
-rw-r--r--. 1 root root 4436 May 16  2015 capacity-scheduler.xml

結果輸出到output。
計算流程以下

稍有不一樣的是這裏有兩次reduce,第二次reduce就是把結果按照出現次數排個序。map和reduce流程開發者本身隨意組合,只要各流程的輸入輸出能銜接上就行。

管理系統介紹

Hadoop提供了web界面的管理系統,

端口號 用途
50070 Hadoop Namenode UI端口
50075 Hadoop Datanode UI端口
50090 Hadoop SecondaryNamenode 端口
50030 JobTracker監控端口
50060 TaskTrackers端口
8088 Yarn任務監控端口
60010 Hbase HMaster監控UI端口
60030 Hbase HRegionServer端口
8080 Spark監控UI端口
4040 Spark任務UI端口

加命令參數

docker run命令要加入參數,才能訪問UI管理頁面

docker run -it -p 50070:50070 -p 8088:8088 -p 50075:50075  sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true

執行這條命令後在宿主機瀏覽器就能夠查看系統了,固然若是Linux有瀏覽器也能夠查看。個人Linux沒有圖形界面,因此在宿主機查看。

50070 Hadoop Namenode UI端口

50075 Hadoop Datanode UI端口

8088 Yarn任務監控端口


已完成和正在運行的mapreduce任務均可以在8088裏查看,上圖有gerp和wordcount兩個任務。

一些問題

1、./sbin/mr-jobhistory-daemon.sh start historyserver必須執行,不然運行任務過程當中會報

20/06/29 21:18:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Call From 87a4217b9f8a/172.17.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

2、./start-all.sh必須執行不然報形如
Unknown Job job_1592960164748_0001錯誤

3、docker run命令後面必須加--privileged=true,不然運行任務過程當中會報java.io.IOException: Job status not available

4、注意,Hadoop 默認不會覆蓋結果文件,所以再次運行上面實例會提示出錯,須要先將 ./output 刪除。或者換成output01試試?

總結

本文方法能夠低成本的完成Hadoop的安裝配置,對於學習理解和開發測試都有幫助的。若是開發本身的Hadoop程序,須要將程序打jar包上傳到share/hadoop/mapreduce/目錄,執行

bin/hadoop jar share/hadoop/mapreduce/yourtest.jar

來運行程序觀察效果。

相關文章
相關標籤/搜索