目的: 前期學習了一些機器學習基本算法,實際企業應用中算法是核心,運行的環境和數據處理的平臺是基礎。java
手段: 搭建簡易hadoop集羣(因爲機器限制在本身的筆記本上經過虛擬機搭建)node
win10linux
vmware15.0.0git
3 ubuntu 虛擬機(1 臺做爲master ,另外2臺做爲 slave一、slave2)web
hadoop2.8.5算法
jdk1.8apache
1. 安裝vmware ,安裝ubuntu 先安裝一臺,後面配置完成後直接克隆 (此處不做詳細介紹,可參考其它文檔進行搭建)ubuntu
2. linux基礎環境配置vim
a) 建立用戶 test 執行全部安裝相關操做 :windows
sudo useradd -m test -s /bin/bash
sudo passwd hadoop
b)安裝基礎軟件
1. 基礎工具 . sudo apt-get install vim (edit tools) . sudo apt-get install openssh-client openssh-server (openssh service for log in the server via ssh) . sudo apt-get install nfs-common (for nfs mounting ) . sudo apt-get install git (for git tool) 2.Setup nfs service on Ubuntu for mounting . sudo apt-get install nfs-kernel-server (install nfs server) . sudo mkdir /nfsroot; . sudo chmod 777 /nfsroot ( create /nfsroot fold as mounting directory) . sudo vim /etc/exports (config the mount directory) add below line in /etc/exports: /nfsroot *(rw,sync,no_root_squash) . sudo service nfs-kernel-server restart (restart nfs service) 3. setup samba service for share folders with windows OS . sudo apt-get install samba smbclient (install necessay tools) . sudo apt-get install samba smbclient (config the samba server) . Add following lines in /etc/samba//smb.conf: [nfsroot] comment = nfsroot path = /nfsroot public = yes guest ok = yes browseable = yes writeable = yes . sudo service smbd restart (restart the samba service)
c) 配置服務器之間免密互相訪問(經過公鑰私鑰的方式)
ssh-keygen -t rsa # 會有提示,都按回車就能夠
cat id_rsa.pub >> authorized_keys # 加入受權
當全部節點都克隆完成後能夠測試ssh登陸: ssh 192.168.xx.xxx@test
3. 配置java和hadoop軟件
下載jdk1.8 解壓文件放在 /opt/java 目錄下,並配置環境變量 (java –version 進行測試)
下載hadoop2.8.5 解壓文件放在 /opt/hadoop 目錄下,並配置環境變量 (hadoop version 進行測試)
4. 克隆當前版本的linux
vmware有克隆虛擬機的功能,會將全部配置進行克隆
配置每臺機器的域名
sudo hostname master (主節點)
sudo hostname slave1 (從節點)
sudo hostname slave2(從節點)
配置每臺機器的固定ip地址,並增長域名解析配置: vim /etc/hosts 文件增長以下配置:
127.0.0.1 localhost
192.168.61.100 master
192.168.61.101 slave1
192.168.61.102 slave2
這裏能夠先配置一臺,而後經過scp命令將配置複製到其餘兩臺機器上去,後面的hdfs、yarn、MapReduce的配置一樣如此。
5. 配置HDFS
到hadoop安裝目錄下配置: ./etc/hadoop/core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/home/test/hadoop-2.8.5/hdfs/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration>
配置hdfs: vim ./etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/hadoop-2.8.5/hdfs/name</value> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/hadoop-2.8.5/hdfs/data</value> <final>true</final> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:9001</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
6. 配置yarn
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.resourcemanager.address</name> <value>master:18040</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:18030</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:18088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:18025</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:18141</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1024</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>3.0</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>1</value> </property> <property>
<name>yarn.nodemanager.localizer.address</name>
<value>0.0.0.0:8040</value>
</property>
<property>
<description>The address of the container manager in the NM.</description>
<name>yarn.nodemanager.address</name>
<value>0.0.0.0:8041</value>
</property>
<property>
<description>NM Webapp address.</description>
<name>yarn.nodemanager.webapp.address</name>
<value>0.0.0.0:8042</value>
</property> </configuration>
7. 配置mapreduce
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>1024</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>1024</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>1024</value> </property> </configuration>
8. 測試:
在master節點上運行 ./sbin/start-all.sh
經過jps 能夠查看 master上的namenode和slave上的datanode (結果以下)
test@master:/opt/hadoop-2.8.5$ jps
8960 Jps
7940 NameNode
8373 ResourceManager
8206 SecondaryNameNode
slave2上運行結果以下:
test@slave2:/opt/hadoop-2.8.5/logs$ jps
7301 Jps
6938 NodeManager
6767 DataNode
在運行完start-all.sh腳本後。 就能夠運行hadoop自帶的wordcount程序了。
1. 上傳文件到hdfs的wc_input中
2. 執行實例程序
./bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /wc_input /wc_output.out7
3. 執行結果以下:
18/10/21 16:13:18 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.61.100:18040 18/10/21 16:13:20 INFO input.FileInputFormat: Total input files to process : 2 18/10/21 16:13:20 INFO mapreduce.JobSubmitter: number of splits:2 18/10/21 16:13:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1540109557238_0001 18/10/21 16:13:21 INFO impl.YarnClientImpl: Submitted application application_1540109557238_0001 18/10/21 16:13:21 INFO mapreduce.Job: The url to track the job: http://master:18088/proxy/application_1540109557238_0001/ 18/10/21 16:13:21 INFO mapreduce.Job: Running job: job_1540109557238_0001 18/10/21 16:13:35 INFO mapreduce.Job: Job job_1540109557238_0001 running in uber mode : false 18/10/21 16:13:35 INFO mapreduce.Job: map 0% reduce 0% 18/10/21 16:13:42 INFO mapreduce.Job: map 50% reduce 0% 18/10/21 16:13:46 INFO mapreduce.Job: map 100% reduce 0% 18/10/21 16:13:51 INFO mapreduce.Job: map 100% reduce 100% 18/10/21 16:13:52 INFO mapreduce.Job: Job job_1540109557238_0001 completed successfully 18/10/21 16:13:52 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=93 FILE: Number of bytes written=473483 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=242 HDFS: Number of bytes written=39 HDFS: Number of read operations=9 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=7691 Total time spent by all reduces in occupied slots (ms)=3635 Total time spent by all map tasks (ms)=7691 Total time spent by all reduce tasks (ms)=3635 Total vcore-milliseconds taken by all map tasks=7691 Total vcore-milliseconds taken by all reduce tasks=3635 Total megabyte-milliseconds taken by all map tasks=7875584 Total megabyte-milliseconds taken by all reduce tasks=3722240 Map-Reduce Framework Map input records=3 Map output records=8 Map output bytes=71 Map output materialized bytes=99 Input split bytes=203 Combine input records=8 Combine output records=8 Reduce input groups=6 Reduce shuffle bytes=99 Reduce input records=8 Reduce output records=6 Spilled Records=16 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=178 CPU time spent (ms)=2180 Physical memory (bytes) snapshot=721473536 Virtual memory (bytes) snapshot=5936779264 Total committed heap usage (bytes)=474480640 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=39 File Output Format Counters Bytes Written=39
注: 配置、安裝、執行過程當中不可避免遇到問題,須要學會看log解決問題。