【翻譯】給新的Hadoop集羣選擇合適的硬件（三）

時間 2019-11-13

原文原文鏈接

接上一篇：http://www.javashuo.com/article/p-kuzqisfj-hn.htmlnode

其餘考慮因素

It is important to remember that the Hadoop ecosystem is designed with a parallel environment in mind. When purchasing processors, we do not recommended getting the highest GHz chips, which draw high watts (130+). This will cause two problems: higher consumption of power and greater heat expulsion. The mid-range models tend to offer the best bang for the buck in terms of GHz, price, and core count.linux

時刻牢記，Hadoop生態系統被設計成並行的環境。當購買處理器的時候，不推薦購買主頻最高、能耗高（超過130w）的芯片，這會引發2個問題：更高電量消耗和更大的發熱量。數據庫

When we encounter applications that produce large amounts of intermediate data — outputting data on the same order as the amount read in — we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2 Gbps per machine. Bonded 2Gbps is tolerable for up to about 12TB of data per nodes. Once you move above 12TB, you will want to move to bonded 4Gbps(4x1Gbps). Alternatively, for customers that have already moved to 10 Gigabit Ethernet or Infiniband, these solutions can be used to address network-bound workloads. Confirm that your operating system and BIOS are compatible if you’re considering switching to 10 Gigabit Ethernet.app

當應用程序產生大量的中間數據時，數據的輸出順序跟讀入順序保持一致。推薦一個網卡開放2個端口或者通道聚合網卡來提供每一個機器2Gbps的帶寬。2Gbps的帶寬能夠容納每一個節點12TB的數據。一旦你須要移動12TB的數據，你須要4Gbps的帶寬（4*1Gbps）。固然，不少客戶已經用上了萬兆以太網卡或者無限帶寬，這個解決方案能夠用在受網速限制的工做場景中。記得確認你的操做系統或BIOS是否兼容萬兆網卡。dom

When computing memory requirements, remember that Java uses up to 10 percent of it for managing the virtual machine. We recommend configuring Hadoop to use strict heap size restrictions in order to avoid memory swapping to disk. Swapping greatly impacts MapReduce job performance and can be avoided by configuring machines with more RAM, as well as setting appropriate kernel settings on most Linux distributions.ide

當計算內存時，記住Java虛擬機最多用10%來管理Java虛擬機。建議配置Hadoop時，明確堆的大小（heap size）來避免內存數據與磁盤交換。磁盤交換嚴重影響了MapReduce的性能，增長內存能夠避免這個問題，固然，大多數發行版linux系統也能夠經過修改恰當的內核設置。oop

It is also important to optimize RAM for the memory channel width. For example, when using dual-channel memory, each machine should be configured with pairs of DIMMs. With triple-channel memory each machine should have triplets of DIMMs. Similarly, quad-channel DIMM should be in groups of four.性能

內存通道帶寬堆優化內存也很重要。舉個例子，雙通道內存，每一個機器須要配置的2個一組DDIM。三通道配置成3個一組DDIM。同理，4通道內存，須要4個組成一組DDIM。測試

不僅是MapReduce

Hadoop is far bigger than HDFS and MapReduce; it’s an all-encompassing data platform. For that reason, CDH includes many different ecosystem products (and, in fact, is rarely used solely for MapReduce). Additional software components to consider when sizing your cluster include Apache HBase, Cloudera Impala, and Cloudera Search. They should all be run on the DataNode process to maintain data locality.flex

Hadoop不僅有HDFS和MapRecude。他是一個全面的數據平臺。CDH包含了許多不一樣的程序（事實上，不多單獨的用MapReduce。其餘軟件也須要被考慮在集羣內，包括Hbase，Cloudera Impala，Cloudera Search。他們應該運行在DataNode進程上就地維護數據。

HBase is a reliable, column-oriented data store that provides consistent, low-latency, random read/write access. Cloudera Search solves the need for full text search on content stored in CDH to simplify access for new types of users, but also open the door for new types of data storage inside Hadoop. Cloudera Search is based on Apache Lucene/Solr Cloud and Apache Tika and extends valuable functionality and flexibility for search through its wider integration with CDH. The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and HBase without requiring data movement or transformation.

HBase是一個可靠的，面向列的、連續的，低延遲、隨機讀寫數據庫。

CS基於Lucene、Solr、Tika。兼顧了靈活性和功能性，所以普遍繼承在CDH中。基於Apache協議的Impala項目帶來了可擴展的數據庫並行技術，不須要數據移動或轉換就解決了HDFS和HBase低延時SQL查詢的問題。

HBase users should be aware of heap-size limits due to garbage collector (GC) timeouts. Other JVM column stores also face this issue. Thus, we recommend a maximum of ~16GB heap per Region Server. HBase does not require too many other resources to run on top of Hadoop, but to maintain real-time SLAs you should use schedulers such as fair and capacity along with Linux Cgroups.

HBase用戶須要關注堆內存大小的限制，因爲垃圾回收暫停。其餘基於JVM的列存儲系統都會面臨這一問題。所以，咱們建議每一個RegionServer最大分配16GB堆內存。HBase在Hadoop上不須要太多其餘的資源，可是爲了保持實時的SLA ，應該使用調度器，如Linxu控制組提供的公平調度器和容量調度器。

Impala uses memory for most of its functionality, consuming up to 80 percent of available RAM resources under default configurations, so we recommend at least 96GB of RAM per node. Users that run Impala alongside MapReduce should consult our recommendations in 「Configuring Impala and MapReduce for Multi-tenant Performance.」 It is also possible to specify a per-process or per-query memory limit for Impala.

Impala大多數功能須要用到內存，默認配置下最大能夠消耗80%的可用內存，建議每一個節點至少96GB內存。Impala和MapReduce一塊兒運行時，須要查看咱們的意見。固然也能夠指定一個核心或者每一個查詢的內存限制。

Search is the most interesting component to size. The recommended sizing exercise is to purchase one node, install Solr and Lucene, and load your documents. Once the documents are indexed and searched in the desired manner, scalability comes into play. Keep loading documents until the indexing and query latency exceed necessary values to the project — this will give you a baseline for max documents per node based on available resources and a baseline count of nodes not including and desired replication factor.

肯定搜索組件的大小是最有趣的。推薦購買一個節點，安裝Solr和Lucene，加載數據來再進行大小調整。當索引、搜索文檔知足不了的時候，擴展性就發揮做用了。一致加載文檔直到索引和查詢延遲超過項目指定的值。這就是每一個節點放多少大文檔的基準線和節點的基線數量（不包括指望的副本因素）。

總結

Purchasing appropriate hardware for a Hadoop cluster requires benchmarking and careful planning to fully understand the workload. However, Hadoop clusters are commonly heterogeneous and Cloudera recommends deploying initial hardware with balanced specifications when getting started. It is important to remember when using multiple ecosystem components resource usage will vary and focusing on resource management will be your key to success.

We encourage you to chime in about your experience configuring production Hadoop clusters in comments!

購買恰當的硬件須要基準測試和仔細的規劃才能徹底弄明白工做場景。然而，Hadoop集羣一般各類各樣。咱們建議初次部署配置均衡的硬件。最重要的是，不一樣生態組建的需求說明是不同的，關注資源管理纔是關鍵。

Kevin O’Dell is a Systems Engineer at Cloudera.