Because Cloudera’s customers need to thoroughly understand their workloads in order to fully optimize Hadoop hardware, a classic chicken-and-egg problem ensues. Most teams looking to build a Hadoop cluster don’t yet know the eventual profile of their workload, and often the first jobs that an organization runs with Hadoop are far different than the jobs that Hadoop is ultimately used for as proficiency increases. Furthermore, some workloads might be bound in unforeseen ways. For example, some theoretical IO-bound workloads might actually be CPU-bound because of a user’s choice of compression, or different implementations of an algorithm might change how the MapReduce job is constrained.服務器
For these reasons, when the team is unfamiliar with the types of jobs it is going to run, as an initial approach it makes sense to invest in a balanced Hadoop cluster.The next step would be to benchmark MapReduce jobs running on the balanced cluster to analyze how they’re bound. To achieve that goal, it’s straightforward to measure live workloads and determine bottlenecks by putting thorough monitoring in place. We recommend installing Cloudera Manager on the Hadoop cluster to provide real-time statistics about CPU, disk, and network load. With Cloudera Manager installed, Hadoop administrators can then run their MapReduce jobs and check the Cloudera Manager dashboard to see how each machine is performing.app
因爲這些緣由,團隊不熟悉運行任務的類型,做爲一個最初的方案,投入一個均衡的Hadoop集羣就變得有意義了。下一步在集羣上跑MapReduce的基準測試任務來分析限制在哪裏。爲了達到這一目的,最直接的方式就是經過測量實際的場景和設置監控來確認瓶頸所在。咱們建議安裝Cloudera Manager在集羣上來提供實時的CPU、磁盤、網絡負載統計。安裝CM後,管理員可運行MapReduce、查看儀表盤來獲取每一個機器的運行狀況。運維
In addition to building out a cluster appropriate for the workload, we encourage customers to work with their hardware vendor to understand the economics of power and cooling. Since Hadoop runs on tens, hundreds, or thousands of nodes, an operations team can save a significant amount of money by investing in power-efficient hardware. Each hardware vendor will be able to provide tools and recommendations for how to monitor power and cooling.ide
The first step in choosing a machine configuration is to understand the type of hardware your operations team already manages. Operations teams often have opinions or hard requirements about new machine purchases, and will prefer to work with hardware with which they’re already familiar. Hadoop is not the only system that benefits from efficiencies of scale. Again, as a general suggestion, if the cluster is new or you can’t accurately predict your ultimate workload, we advise that you use balanced hardware.oop
There are four types of roles in a basic Hadoop cluster: NameNode (and Standby NameNode), JobTracker, TaskTracker, and DataNode. (A node is a machine performing a particular task.) Most machines in your cluster will perform two of these roles, functioning as both DataNode (for data storage) and TaskTracker (for data processing).優化
Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:
The NameNode role is responsible for coordinating data storage on the cluster, and the JobTracker for coordinating data processing. (The Standby NameNode should not be co-located on the NameNode machine for clusters and will run on hardware identical to that of the NameNode.) Cloudera recommends that customers purchase enterprise-class machines for running the NameNode and JobTracker, with redundant power and enterprise-grade disks in RAID 1 or 10 configurations.
The NameNode will also require RAM directly proportional to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of NameNode memory for every 1 million blocks stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the NameNode provides plenty of room to grow the cluster. We also recommend having HA configured on both the NameNode and JobTracker, features that have been available in the CDH4 line for some time.
Here are the recommended specifications for NameNode/JobTracker/Standby NameNode nodes. The drive count will fluctuate depending on the amount of redundancy:
With a Hadoop cluster in place, the team can start identifying workloads and prepare to benchmark those workloads to identify hardware bottlenecks. After some time benchmarking and monitoring, the team will understand how additional machines should be configured. Heterogeneous Hadoop clusters are common, especially as they grow in size and number of use cases – so starting with a set of machines that are not 「ideal」 for your workload will not be a waste of time. Cloudera Manager offers templates that allow different hardware profiles to be managed in groups, making it simple to manage heterogeneous clusters.
Below is a list of various hardware configurations for different workloads, including our original 「balanced」 recommendation: