去hadoop官網下載hadoop-2.8.0壓縮包到hadoop1.而後放到/opt下並解壓.html
$ gunzip hadoop-2.8.0.tar.gz $ tar -xvf hadoop-2.8.0.tar
而後修改hadoop-2.8.0的目錄權限,使hdfs和yarn均有權限讀寫該目錄:java
# chown -R hdfs:hadoop /opt/hadoop-2.8.0
編輯/etc/profile:node
export HADOOP_HOME=/opt/hadoop-2.8.0 export HADOOP_PREFIX=/opt/hadoop-2.8.0 export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64/server export PATH=${HADOOP_HOME}/bin:$PATH
編輯/opt/hadoop-2.8.0/ect/hadoop/hadoop-env.shlinux
#export JAVA_HOME=/usr/local/java/jdk1.8.0_121 #export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3 #hadoop進程的最大heapsize包括namenode/datanode/ secondarynamenode等,默認1000M #export HADOOP_HEAPSIZE= #namenode的初始heapsize,默認取上面的值,按須要分配 #export HADOOP_NAMENODE_INIT_HEAPSIZE="" #JVM啓動參數,默認爲空 #export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true" #還能夠單獨配置各個組件的內存: #export HADOOP_NAMENODE_OPTS= #export HADOOP_DATANODE_OPTS #export HADOOP_SECONDARYNAMENODE_OPTS #設置hadoop日誌,默認是$HADOOP_HOME/log #export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER export HADOOP_LOG_DIR=/var/log/hadoop/
根據本身系統的規劃來設置各個參數.要注意namenode所用的blockmap和namespace空間都在heapsize中,因此生產環境要設較大的heapsize.
注意全部組件使用的內存和,生產給linux系統留5-15%的內存(通常留10G).根據本身系統的規劃來設置各個參數.要注意namenode所用的blockmap和namespace空間都在heapsize中,因此生產環境要設較大的heapsize.web
注意全部組件使用的內存和,生產給linux系統留5-15%的內存(通常留10G).apache
編輯/opt/hadoop-2.8.0/ect/hadoop/yarn-env.shapp
#export JAVA_HOME=/usr/local/java/jdk1.8.0_121 #JAVA_HEAP_MAX=-Xmx1000m #YARN_HEAPSIZE=1000 #yarn 守護進程heapsize #export YARN_RESOURCEMANAGER_HEAPSIZE=1000 #單獨設置RESOURCEMANAGER的HEAPSIZE #export YARN_TIMELINESERVER_HEAPSIZE=1000 #單獨設置TIMELINESERVER(jobhistoryServer)的HEAPSIZE #export YARN_RESOURCEMANAGER_OPTS= #單獨設置RESOURCEMANAGER的JVM選項 #export YARN_NODEMANAGER_HEAPSIZE=1000 #單獨設置NODEMANAGER的HEAPSIZE #export YARN_NODEMANAGER_OPTS= #單獨設置NODEMANAGER的JVM選項 export YARN_LOG_DIR=/var/log/yarn #設置yarn的日誌目錄
根據環境配置,這裏不設置,生產環境注意JVM參數及日誌文件位置ssh
# export JAVA_HOME=/home/y/libexec/jdk1.6.0/ #export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=1000 #export HADOOP_MAPRED_ROOT_LOGGER=INFO,RFA #export HADOOP_JOB_HISTORYSERVER_OPTS= #export HADOOP_MAPRED_LOG_DIR="" # Where log files are stored. $HADOOP_MAPRED_HOME/logs by default. #export HADOOP_JHS_LOGGER=INFO,RFA # Hadoop JobSummary logger. #export HADOOP_MAPRED_PID_DIR= # The pid files are stored. /tmp by default. #export HADOOP_MAPRED_IDENT_STRING= #A string representing this instance of hadoop. $USER by default #export HADOOP_MAPRED_NICENESS= #The scheduling priority for daemons. Defaults to 0. export export HADOOP_MAPRED_LOG_DIR=/var/log/yarn
根據環境配置,這裏不設置,生產環境注意JVM參數及日誌文件位置webapp
參下的設置在官網http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/ClusterSetup.html 均可以找到ide
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop1:9000</value> <description>HDFS 端口</description> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>fs.trash.interval</name> <value>1440</value> <description>啓動hdfs回收站,回收站保留時間1440分鐘</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop-2.8.0/tmp</value> <description>默認值/tmp/hadoop-${user.name},修改爲持久化的目錄</description> </property> </configuration>
core-site.xml裏有衆多的參數,但只修改這兩個就能啓動,其它參數請參考官方文檔.
這裏只設置如下只個參數:
<configuration> <property> <name>dfs.replication</name> <value>1</value> <description>數據塊的備份數量,生產建議爲3</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>/opt/hadoop-2.8.0/namenodedir</value> <description>保存namenode元數據的目錄,生產上放在raid中</description> </property> <property> <name>dfs.blocksize</name> <value>134217728</value> <description>數據塊大小,128M,根據業務場景設置,大文件多就設更大值.</description> </property> <property> <name>dfs.namenode.handler.count</name> <value>100</value> <description>namenode處理的rpc請求數,大集羣設置更大的值</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>/opt/hadoop-2.8.0/datadir</value> <description>datanode保存數據目錄,生產上設置成每一個磁盤的路徑,不建議用raid</description> </property> </configuration>
這裏只設置如下只個參數,其它參數請參考官網.
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop1</value> <description>設置resourcemanager節點</description> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> <description>設置nodemanager的aux服務</description> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>32</value> <description>每一個container的最小大小MB</description> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>128</value> <description>每一個container的最大大小MB</description> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1024</value> <description>爲nodemanager分配的最大內存MB</description> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/home/yarn/nm-local-dir</value> <description>nodemanager本地目錄</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>1</value> <description>每一個nodemanger機器上可用的CPU,默認爲-1,即讓yarn自動檢測CPU個數,可是當前yarn沒法檢測,實際上該值是8</description> </property> </configuration>
生產上請設置:
ResourceManager的參數:
Parameter | Value | Notes |
---|---|---|
yarn.resourcemanager.address | ResourceManager host:port for clients to submit jobs. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. resourcemanager的地址,格式 主機:端口 |
yarn.resourcemanager.scheduler.address | ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. 調度器地址 ,覆蓋yarn.resourcemanager.hostname |
yarn.resourcemanager.resource-tracker.address | ResourceManager host:port for NodeManagers. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. datanode像rm報告的端口, 覆蓋 yarn.resourcemanager.hostname |
yarn.resourcemanager.admin.address ResourceManager host:port for administrative commands. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. RM管理地址,覆蓋 yarn.resourcemanager.hostname | |
yarn.resourcemanager.webapp.address | ResourceManager web-ui host:port. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. RM web地址,有默認值 |
yarn.resourcemanager.hostname | ResourceManager host. | host Single hostname that can be set in place of setting allyarn.resourcemanager*address resources. Results in default ports for ResourceManager components. RM的主機,使用默認端口 |
yarn.resourcemanager.scheduler.class | ResourceManager Scheduler class. | CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler |
yarn.scheduler.minimum-allocation-mb | Minimum limit of memory to allocate to each container request at the Resource Manager. | In MBs 最小容器內存(每一個container最小內存) |
yarn.scheduler.maximum-allocation-mb | Maximum limit of memory to allocate to each container request at the Resource Manager. | In MBs 最大容器內存(每一個container最大內存) |
yarn.resourcemanager.nodes.include-path /yarn.resourcemanager.nodes.exclude-path | List of permitted/excluded NodeManagers. | If necessary, use these files to control the list of allowable NodeManagers. 哪些datanode能夠被RM管理 |
NodeManager的參數:
Parameter | Value | Notes |
---|---|---|
yarn.nodemanager.resource.memory-mb | Resource i.e. available physical memory, in MB, for given NodeManager | Defines total available resources on the NodeManager to be made available to running containers Yarn在NodeManager最大內存 |
yarn.nodemanager.vmem-pmem-ratio | Maximum ratio by which virtual memory usage of tasks may exceed physical memory | The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. 任務使用的虛擬內存超過被容許的推理內存的比率,超過則kill掉 |
yarn.nodemanager.local-dirs | Comma-separated list of paths on the local filesystem where intermediate data is written. | Multiple paths help spread disk i/o. mr運行時中間數據的存放目錄,建議用多個磁盤分攤I/O,,默認是 HADOOP_YARN_HOME/log |
yarn.nodemanager.log-dirs | Comma-separated list of paths on the local filesystem where logs are written. | Multiple paths help spread disk i/o. mr任務日誌的目錄,建議用多個磁盤分攤I/O,,默認是 HADOOP_YARN_HOME/log/userlog |
yarn.nodemanager.log.retain-seconds | 10800 | Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. |
yarn.nodemanager.remote-app-log-dir | /logs | HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled. |
yarn.nodemanager.aux-services | mapreduce_shuffle | Shuffle service that needs to be set for Map Reduce applications. shuffle服務類型 |
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> <description>使用yarn來管理mr</description> </property> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop2:10020</value> <description>jobhistory主機的地址</description> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop2:19888</value> <description>jobhistory web的主機地址</description> </property> <property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/opt/hadoop/hadoop-2.8.0/mrHtmp</value> <description>正在的mr任務監控內容的存放目錄</description> </property> <property> <name>mapreduce.jobhistory.done-dir</name> <value>/opt/hadoop/hadoop-2.8.0/mrhHdone</value> <description>執行完畢的mr任務監控內容的存放目錄</description> </property> </configuration>
在/opt/hadoop-2.8.0/ect/hadoop/slave中寫上從節點
hadoop3
hadoop4
hadoop5
將 /etc/profile /opt/* 複製到其它節點上
$ scp hdfs@hadoop1:/etc/profile /etc $ scp -r hdfs@hadoop1:/opt/* /opt/
建議先壓縮再傳….
$HADOOP_HOME/bin/hdfs namenode -format
以hdfs使用 $HADOOP_HOME/sbin/start-dfs.sh啓動整個hdfs集羣或者,使用
$HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode #啓動單個namenode $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode #啓動單個datanode
啓動日誌會寫在$HADOOP_HOME/log下,能夠在hadoop-env.sh裏設置日誌路徑
打 http://hadoop1:50070 或者執行 hdfs dfs -mkdir /test測試
在hadoop1上啓resourcemanager:
yarn $ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
在hadoop3 hadoop4 hadoop5上啓動nodemanager:
yarn $ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
若是設置了slave文件而且以yarn配置了ssh互信,那能夠在任意一個節點執行:start-yarn.sh便可啓動整個集羣
而後打開RM頁面:
若是啓動有問題,查看在yarn-evn.sh設置的YARN_LOG_DIR下的日誌查找緣由.注意yarn啓動時用的目錄的權限.
[hdfs@hadoop1 hadoop-2.8.0]$ hdfs dfs -mkdir -p /user/hdfs/input [hdfs@hadoop1 hadoop-2.8.0]$ hdfs dfs -put etc/hadoop/ /user/hdfs/input [hdfs@hadoop1 hadoop-2.8.0]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output 'dfs[a-z.]+' 17/06/27 04:16:45 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hdfs/.staging/job_1498507021248_0003 java.io.IOException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=128 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:279)
而後就報錯了.請求的最大內存是1536MB,最大內存是128M.1536是MR任務默認請求的最小資源.最大資源是128M?個人集羣明明有3G的資源.這裏信息應該是錯誤的,當一個container的最在內存不能知足單個個map任務用的最小內存時會報錯,報的是container的內存大小而不是集羣的總內存.當前的集羣配置是,每一個container最小使用32MB內存,最大使用128MB內存,而一個map默認最小使用1024MB的內存.
如今,修改下每一個map和reduce任務用的最小資源:
修改mapred-site.xml,添加:
<property> <name>mapreduce.map.memory.mb</name> <value>128</value> <description>map任務最小使用的內存</description> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>128</value> <description>reduce任務最小使用的內存</description> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>128</value> <description>mapreduce任務默認使用的內存</description> </property>
再次執行:
[hdfs@hadoop1 hadoop-2.8.0]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output 'dfs[a-z.]+' 17/06/27 05:04:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable …….. 17/06/27 05:04:36 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hdfs/.staging/job_1498510574463_0006 org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://hadoop1:9000/user/hdfs/grep-temp-2069706136 ……..
資源的問題解決了,也驗證了個人想法.可是此次又報了一個錯誤,缺乏目錄.在2.6.3以及2.7.3中,我都測試過,沒發現這個問題,暫且無論個.至於MR的可用性,之後會再用其它方式驗證.懷疑jar包有問題.
不知道你們注意到沒有,每次執行hdfs命令時,都會報:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
這是因爲不能使用本地庫的緣由.hadoop依賴於linux上一次本地庫,好比zlib等來提升效率.
關於本地庫,請看個人另外一篇文章:
關於參數,我會另起一篇介紹比較重要的參數
下一篇,設置HDFS的HA