Hadoop是基於Java的,因此首先須要安裝配置好java環境。從官網下載JDK,我用的是1.8版本。 在Mac下能夠在終端下使用scp命令遠程拷貝到虛擬機linux中。html
danieldu@daniels-MacBook-Pro-857 ~/Downloads scp jdk-8u121-linux-x64.tar.gz root@hadoop100:/opt/software root@hadoop100's password: danieldu@daniels-MacBook-Pro-857 ~/Downloads
其實我在Mac上裝了一個神器-Forklift。 能夠經過SFTP的方式鏈接到遠程linux。而後在操做本地電腦同樣,直接把文件拖過去就好了。並且好像配置文件的編輯,也能夠不用在linux下用vi,直接在Mac下用sublime遠程打開就能夠編輯了 :) java
而後在linux虛擬機中(ssh 登陸上去)解壓縮到/opt/modules目錄下node
[root@hadoop100 include]# tar -zxvf /opt/software/jdk-8u121-linux-x64.tar.gz -C /opt/modules/
而後須要設置一下環境變量, 打開 /etc/profile, 添加JAVA_HOME並設置PATH用vi打開也行,或者若是你也安裝了相似forklift這樣的能夠遠程編輯文件的工具那更方便。linux
vi /etc/profile
按shift + G 跳到文件最後,按i切換到編輯模式,添加下面的內容,主要路徑要搞對。
#JAVA_HOME export JAVA_HOME=/opt/modules/jdk1.8.0_121 export PATH=$PATH:$JAVA_HOME/bin
按ESC , 而後 :wq存盤退出。
執行下面的語句使更改生效
[root@hadoop100 include]# source /etc/profileweb
檢查java是否安裝成功。若是能看到版本信息就說明安裝成功了。
[root@hadoop100 include]# java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
[root@hadoop100 include]#shell
Hadoop的安裝也是隻須要把hadoop的tar包拷貝到linux,解壓,設置環境變量.而後用以前作好的xsync腳本,把更新同步到集羣中的其餘機器。若是你不知道xcall、xsync怎麼寫的。能夠翻一下以前的文章。這樣集羣裏的全部機器就都設置好了。apache
[root@hadoop100 include]# tar -zxvf /opt/software/hadoop-2.7.3.tar.gz -C /opt/modules/
[root@hadoop100 include]# vi /etc/profile 繼續添加HADOOP_HOME
#JAVA_HOME
export JAVA_HOME=/opt/modules/jdk1.8.0_121
export PATH=$PATH:$JAVA_HOME/binapp
#HADOOP_HOME
export HADOOP_HOME=/opt/modules/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinssh
[root@hadoop100 include]# source /etc/profile分佈式
把更改同步到集羣中的其餘機器
[root@hadoop100 include]# xsync /etc/profile
[root@hadoop100 include]# xcall source /etc/profile
[root@hadoop100 include]# xsync hadoop-2.7.3/
而後須要對Hadoop集羣環境進行配置。對於集羣的資源配置是這樣安排的,固然hadoop100顯得任務重了一點 :)
編輯0/opt/modules/hadoop-2.7.3/etc/hadoop/mapred-env.sh、yarn-env.sh、hadoop-env.sh 這幾個shell文件中的JAVA_HOME,設置爲真實的絕對路徑。
export JAVA_HOME=/opt/modules/jdk1.8.0_121
打開編輯 /opt/modules/hadoop-2.7.3/etc/hadoop/core-site.xml, 內容以下
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop100:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/modules/hadoop-2.7.3/data/tmp</value> </property> </configuration>
編輯/opt/modules/hadoop-2.7.3/etc/hadoop/hdfs-site.xml, 指定讓dfs複製5份,由於我這裏有5臺虛擬機組成的集羣。每臺機器都擔當DataNode的角色。暫時也把secondary name node也放在hadoop100上,其實這裏不太好,最好能和主namenode分開在不一樣機器上。
<configuration> <property> <name>dfs.replication</name> <value>5</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop100:50090</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
YARN 是hadoop的集中資源管理服務,放在hadoop100上。 編輯/opt/modules/hadoop-2.7.3/etc/hadoop/yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop100</value> </property> <property> <name>yarn.log-aggregation-enbale</name> <value>true</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> </configuration>
爲了讓集羣能一次啓動,編輯slaves文件(/opt/modules/hadoop-2.7.3/etc/hadoop/slaves),把集羣中的幾臺機器都加入到slave文件中,一臺佔一行。
hadoop100
hadoop101
hadoop102
hadoop103
hadoop104
最後,在hadoop100上所有作完相關配置更改後,把相關的修改同步到集羣中的其餘機器
xsync hadoop-2.7.3/
在啓動Hadoop以前須要format一下hadoop設置。
hdfs namenode -format
而後就能夠啓動hadoop了。從下面的輸出過程能夠看到整個集羣從100到104的5臺機器都已經啓動起來了。經過jps能夠查看當前進程。
[root@hadoop100 sbin]# ./start-dfs.sh
Starting namenodes on [hadoop100]
hadoop100: starting namenode, logging to /opt/modules/hadoop-2.7.3/logs/hadoop-root-namenode-hadoop100.out
hadoop101: starting datanode, logging to /opt/modules/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop101.out
hadoop102: starting datanode, logging to /opt/modules/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop102.out
hadoop100: starting datanode, logging to /opt/modules/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop100.out
hadoop103: starting datanode, logging to /opt/modules/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop103.out
hadoop104: starting datanode, logging to /opt/modules/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop104.out
Starting secondary namenodes [hadoop100]
hadoop100: starting secondarynamenode, logging to /opt/modules/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-hadoop100.out
[root@hadoop100 sbin]# jps
2945 NameNode
3187 SecondaryNameNode
3047 DataNode
3351 Jps
[root@hadoop100 sbin]# ./start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/modules/hadoop-2.7.3/logs/yarn-root-resourcemanager-hadoop100.out
hadoop103: starting nodemanager, logging to /opt/modules/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop103.out
hadoop102: starting nodemanager, logging to /opt/modules/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop102.out
hadoop104: starting nodemanager, logging to /opt/modules/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop104.out
hadoop101: starting nodemanager, logging to /opt/modules/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop101.out
hadoop100: starting nodemanager, logging to /opt/modules/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop100.out
[root@hadoop100 sbin]# jps
3408 ResourceManager
2945 NameNode
3187 SecondaryNameNode
3669 Jps
3047 DataNode
3519 NodeManager
[root@hadoop100 sbin]#
使用hadoop能夠經過API調用,這裏先看看使用命令調用,確保hadoop環境已經正常運行了。
這中間有個小插曲,我經過下面的命令查看hdfs上面的文件時,發現鏈接不上。
[root@hadoop100 ~]# hadoop fs -ls ls: Call From hadoop100/192.168.56.100 to hadoop100:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
後來發現,是我中間更改過前面提到的xml配置文件,忘記format了。修改配置後記得要format。
hdfs namenode -format
[root@hadoop100 sbin]# hadoop fs -ls / [root@hadoop100 sbin]# hadoop fs -put ~/anaconda-ks.cfg / [root@hadoop100 sbin]# hadoop fs -ls / Found 1 items -rw-r--r-- 5 root supergroup 1233 2019-09-16 16:31 /anaconda-ks.cfg [root@hadoop100 sbin]# hadoop fs -cat /anaconda-ks.cfg 文件內容 [root@hadoop100 ~]# mkdir tmp [root@hadoop100 ~]# hadoop fs -get /anaconda-ks.cfg ./tmp/ [root@hadoop100 ~]# ll tmp/ total 4 -rw-r--r--. 1 root root 1233 Sep 16 16:34 anaconda-ks.cfg
hadoop中指向示例的MapReduce程序,wordcount,數數在一個文件中出現的詞的次數,我隨便找了個anaconda-ks.cfg試了一下:
[root@hadoop100 ~]# hadoop jar /opt/modules/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /anaconda-ks.cfg ~/tmp 19/09/16 16:43:28 INFO client.RMProxy: Connecting to ResourceManager at hadoop100/192.168.56.100:8032 19/09/16 16:43:29 INFO input.FileInputFormat: Total input paths to process : 1 19/09/16 16:43:29 INFO mapreduce.JobSubmitter: number of splits:1 19/09/16 16:43:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568622576365_0001 19/09/16 16:43:30 INFO impl.YarnClientImpl: Submitted application application_1568622576365_0001 19/09/16 16:43:31 INFO mapreduce.Job: The url to track the job: http://hadoop100:8088/proxy/application_1568622576365_0001/ 19/09/16 16:43:31 INFO mapreduce.Job: Running job: job_1568622576365_0001 19/09/16 16:43:49 INFO mapreduce.Job: Job job_1568622576365_0001 running in uber mode : false 19/09/16 16:43:49 INFO mapreduce.Job: map 0% reduce 0% 19/09/16 16:43:58 INFO mapreduce.Job: map 100% reduce 0% 19/09/16 16:44:10 INFO mapreduce.Job: map 100% reduce 100% 19/09/16 16:44:11 INFO mapreduce.Job: Job job_1568622576365_0001 completed successfully 19/09/16 16:44:12 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=1470 FILE: Number of bytes written=240535 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1335 HDFS: Number of bytes written=1129 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Rack-local map tasks=1 Total time spent by all maps in occupied slots (ms)=6932 Total time spent by all reduces in occupied slots (ms)=7991 Total time spent by all map tasks (ms)=6932 Total time spent by all reduce tasks (ms)=7991 Total vcore-milliseconds taken by all map tasks=6932 Total vcore-milliseconds taken by all reduce tasks=7991 Total megabyte-milliseconds taken by all map tasks=7098368 Total megabyte-milliseconds taken by all reduce tasks=8182784 Map-Reduce Framework Map input records=46 Map output records=120 Map output bytes=1704 Map output materialized bytes=1470 Input split bytes=102 Combine input records=120 Combine output records=84 Reduce input groups=84 Reduce shuffle bytes=1470 Reduce input records=84 Reduce output records=84 Spilled Records=168 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=169 CPU time spent (ms)=1440 Physical memory (bytes) snapshot=300003328 Virtual memory (bytes) snapshot=4159303680 Total committed heap usage (bytes)=141471744 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1233 File Output Format Counters Bytes Written=1129 [root@hadoop100 ~]#
在web端管理界面中能夠看到對應的application:
執行的結果,看到就是「#」 出現的最多,出現了12次,這也難怪,裏面好多都是註釋嘛。
[root@hadoop100 tmp]# hadoop fs -ls /root/tmp Found 2 items -rw-r--r-- 5 root supergroup 0 2019-09-16 16:44 /root/tmp/_SUCCESS -rw-r--r-- 5 root supergroup 1129 2019-09-16 16:44 /root/tmp/part-r-00000 [root@hadoop100 tmp]# hadoop fs -cat /root/tmp/part-r-0000 cat: `/root/tmp/part-r-0000': No such file or directory [root@hadoop100 tmp]# hadoop fs -cat /root/tmp/part-r-00000 # 12 #version=DEVEL 1 $6$JBLRSbsT070BPmiq$Of51A9N3Zjn/gZ23mLMlVs8vSEFL6ybkfJ1K1uJLAwumtkt1PaLcko1SSszN87FLlCRZsk143gLSV22Rv0zDr/ 1 %addon 1 %anaconda 1 %end 3 %packages 1 --addsupport=zh_CN.UTF-8 1 --boot-drive=sda 1 --bootproto=dhcp 1 --device=enp0s3 1 --disable 1 --disabled="chronyd" 1 --emptyok 1
。。。
經過web 界面能夠查看hdfs中的文件列表 http://192.168.56.100:50070/explorer.html#
hadoop還有好多好玩兒的東西,等待我去發現呢,過幾天再來更新。