目錄java
Hadoop的框架最核心的設計就是:HDFS和MapReduce。HDFS爲海量的數據提供了存儲,則MapReduce爲海量的數據提供了計算。node
Hadoop實現了一個分佈式文件系統(Hadoop Distributed File System),簡稱HDFS。web
HDFS有高容錯性的特色,而且設計用來部署在低廉的(low-cost)硬件上;並且它提供高吞吐量(high throughput)來訪問應用程序的數據,適合那些有着超大數據集(large data set)的應用程序。HDFS放寬了(relax)POSIX的要求,能夠以流的形式訪問(streaming access)文件系統中的數據。redis
HDFS採用主從(Master/Slave)結構模型,一個HDFS集羣是由一個NameNode和若干個DataNode組成的。NameNode做爲主服務器,管理文件系統命名空間和客戶端對文件的訪問操做。DataNode管理存儲的數據。HDFS支持文件形式的數據。數據庫
從內部來看,文件被分紅若干個數據塊,這若干個數據塊存放在一組DataNode上。NameNode執行文件系統的命名空間,如打開、關閉、重命名文件或目錄等,也負責數據塊到具體DataNode的映射。DataNode負責處理文件系統客戶端的文件讀寫,並在NameNode的統一調度下進行數據庫的建立、刪除和複製工做。NameNode是全部HDFS元數據的管理者,用戶數據永遠不會通過NameNode。apache
Hadoop MapReduce是google MapReduce 克隆版。瀏覽器
MapReduce是一種計算模型,用以進行大數據量的計算。其中Map對數據集上的獨立元素進行指定的操做,生成鍵-值對形式中間結果。Reduce則對中間結果中相同「鍵」的全部「值」進行規約,以獲得最終結果。MapReduce這樣的功能劃分,很是適合在大量計算機組成的分佈式並行環境裏進行數據處理。服務器
Hadoop MapReduce採用Master/Slave(M/S)架構,以下圖所示,主要包括如下組件:Client、JobTracker、TaskTracker和Task。架構
# 下載安裝包 wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz # 解壓安裝包 tar xf hadoop-2.7.3.tar.gz && mv hadoop-2.7.3 /usr/local/hadoop # 建立目錄 mkdir -p /home/hadoop/{name,data,log,journal}
建立文件/etc/profile.d/hadoop.sh
。app
# HADOOP ENV export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
使 Hadoop 環境變量生效。
source /etc/profile.d/hadoop.sh
編輯文件/usr/local/hadoop/etc/hadoop/hadoop-env.sh
,修改下面字段。
export JAVA_HOME=/usr/java/default export HADOOP_HOME=/usr/local/hadoop
編輯文件/usr/local/hadoop/etc/hadoop/yarn-env.sh
,修改下面字段。
export JAVA_HOME=/usr/java/default
編輯文件/usr/local/hadoop/etc/hadoop/slaves
datanode01 datanode02 datanode03
編輯文件/usr/local/hadoop/etc/hadoop/core-site.xml
,修改成以下:
<configuration> <!-- 指定hdfs的nameservice爲cluster1 --> <property> <name>fs.default.name</name> <value>hdfs://cluster1:9000</value> </property> <!-- 指定hadoop臨時目錄 --> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/data</value> </property> <!-- 指定zookeeper地址 --> <property> <name>ha.zookeeper.quorum</name> <value>zk01:2181,zk02:2181,zk03:2181</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <!--設置hadoop的緩衝區大小爲128MB--> <property> <name>io.file.buffer.size</name> <value>131702</value> </property> </configuration>
編輯文件/usr/local/hadoop/etc/hadoop/hdfs-site.xml
,修改成以下:
<configuration> <!--NameNode存儲元數據的目錄 --> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/name</value> </property> <!--DataNode存儲數據塊的目錄--> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/hadoop/data</value> </property> <!--指定HDFS的副本數量--> <property> <name>dfs.replication</name> <value>2</value> </property> <!--開啓HDFS的WEB管理界面功能--> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <!--指定HDFS的nameservice爲cluster1,須要和core-site.xml中的保持一致 --> <property> <name>dfs.nameservices</name> <value>cluster1</value> </property> </configuration>
編輯文件/usr/local/hadoop/etc/hadoop/mapred-site.xml
,修改成以下:
<configuration> <!-- 指定mr框架爲yarn方式 --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!-- mapred作本地計算所使用的文件夾,能夠配置多塊硬盤,逗號分隔 --> <property> <name>mapred.local.dir</name> <value>/home/hadoop/data</value> </property> <!-- 使用管理員身份,指定做業map的堆大小--> <property> <name>mapreduce.admin.map.child.java.opts</name> <value>-Xmx256m</value> </property> <!-- 使用管理員身份,指定做業reduce的堆大小--> <property> <name>mapreduce.admin.reduce.child.java.opts</name> <value>-Xmx4096m</value> </property> <!-- 每一個TT子進程所使用的虛擬機內存大小 --> <property> <name>mapred.child.java.opts</name> <value>-Xmx512m</value> </property> <!-- 設置參數防止超時,默認600000ms,即600s--> <property> <name>mapred.task.timeout</name> <value>1200000</value> <final>true</final> </property> <!-- 禁止訪問NN的主機名稱列表,指定要動態刪除的節點--> <property> <name>dfs.hosts.exclude</name> <value>slaves.exclude</value> </property> <!-- 禁止鏈接JT的TT列表,節點摘除是頗有做用 --> <property> <name>mapred.hosts.exclude</name> <value>slaves.exclude</value> </property> </configuration>
編輯文件/usr/local/hadoop/etc/hadoop/yarn-site.xml
,修改成以下:
<configuration> <!--RM的主機名 --> <property> <name>yarn.resourcemanager.hostname</name> <value>namenode01</value> </property> <!--RM對客戶端暴露的地址,客戶端經過該地址向RM提交應用程序、殺死應用程序等--> <property> <name>yarn.resourcemanager.address</name> <value>${yarn.resourcemanager.hostname}:8032</value> </property> <!--RM對AM暴露的訪問地址,AM經過該地址向RM申請資源、釋放資源等--> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>${yarn.resourcemanager.hostname}:8030</value> </property> <!--RM對外暴露的web http地址,用戶可經過該地址在瀏覽器中查看集羣信息--> <property> <name>yarn.resourcemanager.webapp.address</name> <value>${yarn.resourcemanager.hostname}:8088</value> </property> <!--RM對NM暴露地址,NM經過該地址向RM彙報心跳、領取任務等--> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>${yarn.resourcemanager.hostname}:8031</value> </property> <!--RM對管理員暴露的訪問地址,管理員經過該地址向RM發送管理命令等--> <property> <name>yarn.resourcemanager.admin.address</name> <value>${yarn.resourcemanager.hostname}:8033</value> </property> <!--單個容器可申請的最小與最大內存,應用在運行申請內存時不能超過最大值,小於最小值則分配最小值--> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>983040</value> </property> <!--啓用的資源調度器主類。目前可用的有FIFO、Capacity Scheduler和Fair Scheduler--> <property> <name>yarn.resourcemanager.scheduler.class</name> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>${yarn.resourcemanager.hostname}:8031</value> </property> <!--RM對管理員暴露的訪問地址,管理員經過該地址向RM發送管理命令等--> <property> <name>yarn.resourcemanager.admin.address</name> <value>${yarn.resourcemanager.hostname}:8033</value> </property> <!--單個容器可申請的最小與最大內存,應用在運行申請內存時不能超過最大值,小於最小值則分配最小值--> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>8182</value> </property> <!--啓用的資源調度器主類。目前可用的有FIFO、Capacity Scheduler和Fair Scheduler--> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <!--啓用日誌彙集功能 --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <!--單個容器可申請的最小/最大虛擬CPU個數。好比設置爲1和4,則運行MapRedce做業時,每一個Task最少可申請1個虛擬CPU,最多可申請4個虛擬CPU--> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>512</value> </property> <!--單個可申請的最小/最大內存資源量 --> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <!--啓用日誌彙集功能 --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!-- 設置在HDFS上彙集的日誌最多保存多長時間--> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> <!--NM運行的Container,總的可用虛擬CPU個數,默認值8--> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>12</value> </property> <!--NM運行的Container,能夠分配到的物理內存,一旦設置,運行過程當中不可動態修改,默認8192MB--> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> </property> <!--是否啓動一個線程檢查每一個任務正使用的虛擬內存量,若是任務超出分配值,則直接將其殺掉,默認是true--> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <!--是否啓動一個線程檢查每一個任務正使用的物理內存量,若是任務超出分配值,則直接將其殺掉,默認是true--> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <!--每使用1MB物理內存,最多可用的虛擬內存數,默認值2.1--> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <!--一塊磁盤的最高使用率,當一塊磁盤的使用率超過該值時,則認爲該盤爲壞盤,再也不使用該盤,默認是100,表示100%--> <property> <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name> <value>98.0</value> </property> <!--NM運行的附屬服務,需配置成mapreduce_shuffle,纔可運行MapReduce程序--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!--爲了可以運行MapReduce程序,須要讓各個NodeManager在啓動時加載shuffle server--> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
cd /usr/local/hadoop/etc/hadoop scp * datanode01:/usr/local/hadoop/etc/hadoop scp * datanode02:/usr/local/hadoop/etc/hadoop scp * datanode03:/usr/local/hadoop/etc/hadoop chown -R hadoop:hadoop /usr/local/hadoop chmod 755 /usr/local/hadoop/etc/hadoop
hdfs namenode -format hadoop-daemon.sh start namenode
stop-all.sh start-all.sh
[root@namenode01 ~]# jps 17419 NameNode 17780 ResourceManager 18152 Jps [root@datanode01 ~]# jps 2227 DataNode 1292 QuorumPeerMain 2509 Jps 2334 NodeManager [root@datanode02 ~]# jps 13940 QuorumPeerMain 18980 DataNode 19093 NodeManager 19743 Jps [root@datanode03 ~]# jps 19238 DataNode 19350 NodeManager 14215 QuorumPeerMain 20014 Jps
訪問 http://192.168.1.200:50070/
訪問 http://192.168.1.200:8088/
[root@namenode01 ~]# hadoop fs -put /root/anaconda-ks.cfg /anaconda-ks.cfg
[root@namenode01 ~]# cd /usr/local/hadoop/share/hadoop/mapreduce/ [root@namenode01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /anaconda-ks.cfg /test 18/11/17 00:04:45 INFO client.RMProxy: Connecting to ResourceManager at namenode01/192.168.1.200:8032 18/11/17 00:04:45 INFO input.FileInputFormat: Total input paths to process : 1 18/11/17 00:04:45 INFO mapreduce.JobSubmitter: number of splits:1 18/11/17 00:04:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541095016765_0004 18/11/17 00:04:46 INFO impl.YarnClientImpl: Submitted application application_1541095016765_0004 18/11/17 00:04:46 INFO mapreduce.Job: The url to track the job: http://namenode01:8088/proxy/application_1541095016765_0004/ 18/11/17 00:04:46 INFO mapreduce.Job: Running job: job_1541095016765_0004 18/11/17 00:04:51 INFO mapreduce.Job: Job job_1541095016765_0004 running in uber mode : false 18/11/17 00:04:51 INFO mapreduce.Job: map 0% reduce 0% 18/11/17 00:04:55 INFO mapreduce.Job: map 100% reduce 0% 18/11/17 00:04:59 INFO mapreduce.Job: map 100% reduce 100% 18/11/17 00:04:59 INFO mapreduce.Job: Job job_1541095016765_0004 completed successfully 18/11/17 00:04:59 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=1222 FILE: Number of bytes written=241621 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1023 HDFS: Number of bytes written=941 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=1758 Total time spent by all reduces in occupied slots (ms)=2125 Total time spent by all map tasks (ms)=1758 Total time spent by all reduce tasks (ms)=2125 Total vcore-milliseconds taken by all map tasks=1758 Total vcore-milliseconds taken by all reduce tasks=2125 Total megabyte-milliseconds taken by all map tasks=1800192 Total megabyte-milliseconds taken by all reduce tasks=2176000 Map-Reduce Framework Map input records=38 Map output records=90 Map output bytes=1274 Map output materialized bytes=1222 Input split bytes=101 Combine input records=90 Combine output records=69 Reduce input groups=69 Reduce shuffle bytes=1222 Reduce input records=69 Reduce output records=69 Spilled Records=138 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=99 CPU time spent (ms)=970 Physical memory (bytes) snapshot=473649152 Virtual memory (bytes) snapshot=4921606144 Total committed heap usage (bytes)=441450496 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=922 File Output Format Counters Bytes Written=941
[root@namenode01 mapreduce]# hadoop fs -cat /test/part-r-00000 # 11 #version=DEVEL 1 $6$kRQ2y1nt/B6c6ETs$ITy0O/E9P5p0ePWlHJ7fRTqVrqGEQf7ZGi5IX2pCA7l25IdEThUNjxelq6wcD9SlSa1cGcqlJy2jjiV9/lMjg/ 1 %addon 1 %end 2 %packages 1 --all 1 --boot-drive=sda 1 --bootproto=dhcp 1 --device=enp1s0 1 --disable 1 --drives=sda 1 --enable 1 --enableshadow 1 --hostname=localhost.localdomain 1 --initlabel 1 --ipv6=auto 1 --isUtc 1 --iscrypted 1 --location=mbr 1 --onboot=off 1 --only-use=sda 1 --passalgo=sha512 1 --reserve-mb='auto' 1 --type=lvm 1 --vckeymap=cn 1 --xlayouts='cn' 1 @^minimal 1 @core 1 Agent 1 Asia/Shanghai 1 CDROM 1 Keyboard 1 Network 1 Partition 1 Root 1 Run 1 Setup 1 System 4 Use 2 auth 1 authorization 1 autopart 1 boot 1 bootloader 2 cdrom 1 clearing 1 clearpart 1 com_redhat_kdump 1 configuration 1 first 1 firstboot 1 graphical 2 ignoredisk 1 information 3 install 1 installation 1 keyboard 1 lang 1 language 1 layouts 1 media 1 network 2 on 1 password 1 rootpw 1 the 1 timezone 2 zh_CN.UTF-8 1
查看fs幫助命令: hadoop fs -help 查看HDFS磁盤空間: hadoop fs -df -h 建立目錄: hadoop fs -mkdir 上傳本地文件: hadoop fs -put 查看文件: hadoop fs -ls 查看文件內容: hadoop fs –cat 複製文件: hadoop fs -cp 下載HDFS文件到本地: hadoop fs -get 移動文件: hadoop fs -mv 刪除文件: hadoop fs -rm -r -f 刪除文件夾: hadoop fs -rm –r