(1)需求:從1G數據中,統計每一個單詞出現次數。服務器3臺,每臺配置4G內存,4核CPU,4線程。java
(2)需求分析:node
1G/128m = 8個MapTask;1個ReduceTask:1個mrAppMastershell
平均每一個節點運行10個/3臺 ≈ 3個任務(4 3 3)express
(1)修改:hadoop-env.shapache
export HDFS_NAMENODE_OPTS = "-Dhadoop.security.logger=INFO,RFAS -Xmx1024m" export HDFS_DATANODE_OPTS = "-Dhadoop.security.logger=ERROR,RFAS -Xmx1024m"
(2)修改:hdfs-site.xml服務器
<!--NameNode有一個工做線程池,默認值是10--> <property> <name>dfs.namenode.handler.count</name> <value>21</value> </property>
(3)修改core-site.xmlapp
<!-- 配置垃圾回收時間爲 60 分鐘 --> <property> <name>fs.trash.interval</name> <value>60</value> </property>
(4)將配置分發到三臺服務器上less
rsync -av 分發的文件名稱 用戶名@主機名稱:儲存配置文件地址
(1)修改mapred-site.xmloop
<!-- 環形緩衝區大小,默認 100m --> <property> <name>mapreduce.task.io.sort.mb</name> <value>100</value> </property> <!-- 環形緩衝區溢寫閾值,默認 0.8 --> <property> <name>mapreduce.map.sort.spill.percent</name> <value>0.80</value> </property> <!-- merge 合併次數,默認 10 個 --> <property> <name>mapreduce.task.io.sort.factor</name> <value>10</value> </property> <!-- maptask 內存,默認 1g; maptask 堆內存大小默認和該值大小一致 mapreduce.map.java.opts --> <property> <name>mapreduce.map.memory.mb</name> <value>-1</value> <description> The amount of memory to request from the scheduler for each map task. If this is not specified or is non-positive, it is inferred from mapreduce.map.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024. </description> </property> <!-- matask 的 CPU 核數,默認 1 個 --> <property> <name>mapreduce.map.cpu.vcores</name> <value>1</value> </property> <!-- matask 異常重試次數,默認 4 次 --> <property> <name>mapreduce.map.maxattempts</name> <value>4</value> </property> <!-- 每一個 Reduce 去 Map 中拉取數據的並行數。默認值是 5 --> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>5</value> </property> <!-- Buffer 大小佔 Reduce 可用內存的比例,默認值 0.7 --> <property> <name>mapreduce.reduce.shuffle.input.buffer.percent</name> <value>0.70</value> </property> <!-- Buffer 中的數據達到多少比例開始寫入磁盤,默認值 0.66。 --> <property> <name>mapreduce.reduce.shuffle.merge.percent</name> <value>0.66</value> </property> <!-- reducetask 內存,默認 1g;reducetask 堆內存大小默認和該值大小一致 mapreduce.reduce.java.opts --> <property> <name>mapreduce.reduce.memory.mb</name> <value>-1</value> <description>The amount of memory to request from the scheduler for each reduce task. If this is not specified or is non-positive, it is inferred from mapreduce.reduce.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024. </description> </property> <!-- reducetask 的 CPU 核數,默認 1 個 --> <property> <name>mapreduce.reduce.cpu.vcores</name> <value>2</value> </property> <!-- reducetask 失敗重試次數,默認 4 次 --> <property> <name>mapreduce.reduce.maxattempts</name> <value>4</value> </property> <!-- 當MapTask完成的比例達到該值後纔會爲ReduceTask申請資源。默認是0.05--> <property> <name>mapreduce.job.reduce.slowstart.completedmaps</name> <value>0.05</value> </property> <!-- 若是程序在規定的默認 10 分鐘內沒有讀到數據,將強制超時退出 --> <property> <name>mapreduce.task.timeout</name> <value>600000</value> </property>
(2)服務器分發配置文件this
rsync -av 分發的文件名稱 用戶名@主機名稱:儲存配置文件地址
(1)修改Yarn-site.xml
<!-- 選擇調度器,默認容量 --> <property> <description>The class to use as the resource scheduler.</description> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <!-- ResourceManager 處理調度器請求的線程數量,默認 50;若是提交的任務數大於 50,能夠增長該值,可是不能超過 3 臺 * 4 線程 = 12 線程(去除其餘應用程序實際不能超過 8) --> <property> <description>Number of threads to handle schedulerinterface.</description> <name>yarn.resourcemanager.scheduler.client.thread-count</name> <value>8</value> </property> <!-- 是否讓 yarn 自動檢測硬件進行配置,默認是 false,若是該節點有不少其餘應用程序,建議 手動配置。若是該節點沒有其餘應用程序,能夠採用自動 --> <property> <description>Enable auto-detection of node capabilities such as memory and CPU.</description> <name>yarn.nodemanager.resource.detect-hardware-capabilities</name> <value>false</value> </property> <!-- 是否將虛擬核數看成 CPU 核數,默認是 false,採用物理 CPU 核數 --> <property> <description>Flag to determine if logical processors(such as hyperthreads) should be counted as cores. Only applicable on Linux when yarn.nodemanager.resource.cpu-vcores is set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true. </description> <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name> <value>false</value> </property> <!-- 虛擬核數和物理核數乘數,默認是 1.0 --> <property> <description>Multiplier to determine how to convert phyiscal cores to vcores. This value is used if yarn.nodemanager.resource.cpu-vcores is set to -1(which implies auto-calculate vcores) and yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The number of vcores will be calculated as number of CPUs * multiplier. </description> <name>yarn.nodemanager.resource.pcores-vcores-multiplier</name> <value>1.0</value> </property> <!-- NodeManager 使用內存數,默認 8G,修改成 4G 內存 --> <property> <description>Amount of physical memory, in MB, that can be allocated for containers. If set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true, it is automatically calculated(in case of Windows and Linux). In other cases, the default is 8192MB. </description> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <!-- nodemanager 的 CPU 核數,不按照硬件環境自動設定時默認是 8 個,修改成 4 個 --> <property> <description>Number of vcores that can be allocated for containers. This is used by the RM scheduler when allocating resources for containers. This is not used to limit the number of CPUs used by YARN containers. If it is set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true, it is automatically determined from the hardware in case of Windows and Linux. In other cases, number of vcores is 8 by default. </description> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <!-- 容器最小內存,默認 1G --> <property> <description>The minimum allocation for every container request at the RM in MBs. Memory requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have less memory than this value will be shut down by the resource manager. </description> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <!-- 容器最大內存,默認 8G,修改成 2G --> <property> <description>The maximum allocation for every container request at the RM in MBs. Memory requests higher than this will throw an InvalidResourceRequestException. </description> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <!-- 容器最小 CPU 核數,默認 1 個 --> <property> <description>The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the resource manager. </description> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <!-- 容器最大 CPU 核數,默認 4 個,修改成 2 個 --> <property> <description>The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an InvalidResourceRequestException. </description> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property> <!-- 虛擬內存檢查,默認打開,修改成關閉 --> <property> <description>Whether virtual memory limits will be enforced for containers.</description> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <!-- 虛擬內存和物理內存設置比例,默認 2.1 --> <property> <description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio. </description> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property>
(2)服務器分發配置文件
rsync -av 分發的文件名稱 用戶名@主機名稱:儲存配置文件地址
(1)重啓集羣
sbin/stop-yarn.sh sbin/start-yarn.sh
(2)執行 WordCount 程序
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /wcinput /wcoutput
說明:在hadoop文件夾下運行命令,/input 爲要統計的 1G 數據所在的文件夾目錄,/output 爲要輸出統計結果的文件夾目錄。
(3)觀察 Yarn 任務執行頁面
網址:hadoop103:8088
(4)運行結果
/wcinput/work.txt原內容:
運行結果:生成文件夾/wcoutput