Hadoop-大數據開源世界的亞當夏娃。
核心是HDFS數據存儲系統,和MapReduce分佈式計算框架。java
原理是把大塊數據切碎,
每一個碎塊複製三份,分開放在三個廉價機上,一直保持有三塊可用的數據互爲備份。使用的時候只從其中一個備份讀出來,這個碎塊數據就有了。
存數據的叫datenode(格子間),管理datenode的叫namenode(執傘人)。node
原理是大任務先分堆處理-Map,再彙總處理結果-Reduce。分和匯是多臺服務器並行進行,才能體現集羣的威力。難度在於如何把任務拆解成符合MapReduce模型的分和匯,以及中間過程的輸入輸出<k,v> 都是什麼。
web
對於學習hadoop原理和hadoop開發的人來講,搭建一套hadoop系統是必須的。但正則表達式
這裏介紹一種免配置的單機版hadoop安裝使用方法,能夠簡單快速的跑一跑hadoop例子輔助學習、開發和測試。
要求筆記本上裝了Linux虛擬機,虛擬機上裝了docker。docker
使用docker下載sequenceiq/hadoop-docker:2.7.0鏡像並運行。shell
[root@bogon ~]# docker pull sequenceiq/hadoop-docker:2.7.0 2.7.0: Pulling from sequenceiq/hadoop-docker860d0823bcab: Pulling fs layer e592c61b2522: Pulling fs layer
下載成功輸出apache
Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0
[root@bogon ~]# docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true Starting sshd: [ OK ] Starting namenodes on [b7a42f79339c] b7a42f79339c: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.out localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-b7a42f79339c.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.out starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.out localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out
啓動成功後命令行shell會自動進入Hadoop的容器環境,不須要執行docker exec。在容器環境進入/usr/local/hadoop/sbin,執行./start-all.sh和./mr-jobhistory-daemon.sh start historyserver,以下bootstrap
bash-4.1# cd /usr/local/hadoop/sbin bash-4.1# ./start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh Starting namenodes on [b7a42f79339c] b7a42f79339c: namenode running as process 128. Stop it first. localhost: datanode running as process 219. Stop it first. Starting secondary namenodes [0.0.0.0] 0.0.0.0: secondarynamenode running as process 402. Stop it first. starting yarn daemons resourcemanager running as process 547. Stop it first. localhost: nodemanager running as process 641. Stop it first. bash-4.1# ./mr-jobhistory-daemon.sh start historyserver chown: missing operand after `/usr/local/hadoop/logs' Try `chown --help' for more information. starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.out
Hadoop啓動完成,如此簡單。瀏覽器
要問分佈式部署有多麻煩,數數光配置文件就有多少個吧!我親眼見過一個hadoop老鳥,由於新換的服務器hostname主機名帶橫線「-」,配了一上午,環境硬是沒起來。bash
回到Hadoop主目錄,運行示例程序
bash-4.1# cd /usr/local/hadoop bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+' 20/07/05 22:34:41 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 20/07/05 22:34:43 INFO input.FileInputFormat: Total input paths to process : 31 20/07/05 22:34:43 INFO mapreduce.JobSubmitter: number of splits:31 20/07/05 22:34:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594002714328_0001 20/07/05 22:34:44 INFO impl.YarnClientImpl: Submitted application application_1594002714328_0001 20/07/05 22:34:45 INFO mapreduce.Job: The url to track the job: http://b7a42f79339c:8088/proxy/application_1594002714328_0001/ 20/07/05 22:34:45 INFO mapreduce.Job: Running job: job_1594002714328_0001 20/07/05 22:35:04 INFO mapreduce.Job: Job job_1594002714328_0001 running in uber mode : false 20/07/05 22:35:04 INFO mapreduce.Job: map 0% reduce 0% 20/07/05 22:37:59 INFO mapreduce.Job: map 11% reduce 0% 20/07/05 22:38:05 INFO mapreduce.Job: map 12% reduce 0%
mapreduce計算完成,有以下輸出
20/07/05 22:55:26 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=291 FILE: Number of bytes written=230541 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=569 HDFS: Number of bytes written=197 HDFS: Number of read operations=7 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=5929 Total time spent by all reduces in occupied slots (ms)=8545 Total time spent by all map tasks (ms)=5929 Total time spent by all reduce tasks (ms)=8545 Total vcore-seconds taken by all map tasks=5929 Total vcore-seconds taken by all reduce tasks=8545 Total megabyte-seconds taken by all map tasks=6071296 Total megabyte-seconds taken by all reduce tasks=8750080 Map-Reduce Framework Map input records=11 Map output records=11 Map output bytes=263 Map output materialized bytes=291 Input split bytes=132 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=291 Reduce input records=11 Reduce output records=11 Spilled Records=22 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=159 CPU time spent (ms)=1280 Physical memory (bytes) snapshot=303452160 Virtual memory (bytes) snapshot=1291390976 Total committed heap usage (bytes)=136450048 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=437 File Output Format Counters Bytes Written=197
hdfs命令查看輸出結果
bash-4.1# bin/hdfs dfs -cat output/* 6 dfs.audit.logger 4 dfs.class 3 dfs.server.namenode. 2 dfs.period 2 dfs.audit.log.maxfilesize 2 dfs.audit.log.maxbackupindex 1 dfsmetrics.log 1 dfsadmin 1 dfs.servers 1 dfs.replication 1 dfs.file
grep是一個在輸入中計算正則表達式匹配的mapreduce程序,篩選出符合正則的字符串以及出現次數。
shell的grep結果會顯示完整的一行,這個命令只顯示行中匹配的那個字符串
grep input output 'dfs[a-z.]+'
正則表達式dfs[a-z.]+,表示字符串要以dfs開頭,後面是小寫字母或者換行符\n以外的任意單個字符均可以,數量一個或者多個。
輸入是input裏的全部文件,
bash-4.1# ls -lrt total 48 -rw-r--r--. 1 root root 690 May 16 2015 yarn-site.xml -rw-r--r--. 1 root root 5511 May 16 2015 kms-site.xml -rw-r--r--. 1 root root 3518 May 16 2015 kms-acls.xml -rw-r--r--. 1 root root 620 May 16 2015 httpfs-site.xml -rw-r--r--. 1 root root 775 May 16 2015 hdfs-site.xml -rw-r--r--. 1 root root 9683 May 16 2015 hadoop-policy.xml -rw-r--r--. 1 root root 774 May 16 2015 core-site.xml -rw-r--r--. 1 root root 4436 May 16 2015 capacity-scheduler.xml
結果輸出到output。
計算流程以下
稍有不一樣的是這裏有兩次reduce,第二次reduce就是把結果按照出現次數排個序。map和reduce流程開發者本身隨意組合,只要各流程的輸入輸出能銜接上就行。
Hadoop提供了web界面的管理系統,
端口號 | 用途 |
---|---|
50070 | Hadoop Namenode UI端口 |
50075 | Hadoop Datanode UI端口 |
50090 | Hadoop SecondaryNamenode 端口 |
50030 | JobTracker監控端口 |
50060 | TaskTrackers端口 |
8088 | Yarn任務監控端口 |
60010 | Hbase HMaster監控UI端口 |
60030 | Hbase HRegionServer端口 |
8080 | Spark監控UI端口 |
4040 | Spark任務UI端口 |
docker run命令要加入參數,才能訪問UI管理頁面
docker run -it -p 50070:50070 -p 8088:8088 -p 50075:50075 sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
執行這條命令後在宿主機瀏覽器就能夠查看系統了,固然若是Linux有瀏覽器也能夠查看。個人Linux沒有圖形界面,因此在宿主機查看。
已完成和正在運行的mapreduce任務均可以在8088裏查看,上圖有gerp和wordcount兩個任務。
1、./sbin/mr-jobhistory-daemon.sh start historyserver必須執行,不然運行任務過程當中會報
20/06/29 21:18:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) java.io.IOException: java.net.ConnectException: Call From 87a4217b9f8a/172.17.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
2、./start-all.sh必須執行不然報形如
Unknown Job job_1592960164748_0001錯誤
3、docker run命令後面必須加--privileged=true,不然運行任務過程當中會報java.io.IOException: Job status not available
4、注意,Hadoop 默認不會覆蓋結果文件,所以再次運行上面實例會提示出錯,須要先將 ./output 刪除。或者換成output01試試?
本文方法能夠低成本的完成Hadoop的安裝配置,對於學習理解和開發測試都有幫助的。若是開發本身的Hadoop程序,須要將程序打jar包上傳到share/hadoop/mapreduce/目錄,執行
bin/hadoop jar share/hadoop/mapreduce/yourtest.jar
來運行程序觀察效果。