【翻譯】給新的Hadoop集羣選擇合適的硬件（一）

時間 2019-12-04

原文原文鏈接

One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.web

Cloudera的客戶們剛接觸Apache Hadoop的時候，他們提出的第一問題就是如何爲他們新的hadoop集羣選擇恰當的硬件。數據庫

Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)緩存

雖然Haddop被設計跑在工業標準的硬件上，推薦一個理想的集羣配置並不像列出一個硬件規格清單那樣簡單。在給定的工做場景下，選擇既知足性能又經濟的硬件須要測試。（例如，在IO密集型的工做場景下，須要每一個核心投入更多的spindles）服務器

In this blog post, you’ll learn some of the principles of workload evaluation and the critical role it plays in hardware selection. You’ll also learn the various factors that Hadoop administrators should take into account during this process.網絡

在這篇博客裏，你將學到工做場景的評估原則和你選擇的硬件擔任的角色，還有在這個過程當中，hadoop管理員須要考慮的各類因素。app

存儲結合計算

Over the past decade, IT organizations have standardized on blades and SANs (Storage Area Networks) to satisfy their grid and processing-intensive workloads. While this model makes a lot of sense for a number of standard applications such as web servers, app servers, smaller structured databases, and data movement, the requirements for infrastructure have changed as the amount of data and number of users has grown. Web servers now have caching tiers, databases have gone massively parallel with local disk, and data movement jobs are pushing more data than they can handle locally.ide

在過去的十年裏，IT組織制定了刀片機、網絡存儲設備（SAN）等標準來知足網絡和處理密集型工做。這對於大量的標準應用是有意義的，例如web服務器，應用服務器，結構化數據庫和數據移動，隨着用戶量和數據量增加到必定數量，對基礎設施的需求就變了。Web服務器如今有緩存層，本地數據庫已經不能處理大量的磁盤並行任務和數據移動。oop

Hardware vendors have created innovative systems to address these requirements including storage blades, SAS (Serial Attached SCSI) switches, external SATA arrays, and larger capacity rack units. However, Hadoop is based on a new approach to storing and processing complex data, with data movement minimized. Instead of relying on a SAN for massive storage and reliability then moving it to a collection of blades for processing, Hadoop handles large data volumes and reliability in the software tier.post

硬件供應商們建立了不少革命性的系統來知足這些需求，例如刀片存儲，SAS交換機，擴展SATA陣列和大容量機架單元。而後，Hadoop基於一種新的方式來存儲和處理複雜的數據，經過最小化的數據移動。Hadoop在軟件層處理大數據量和可靠性，而不依賴SAN設備的海量存儲和可靠性，而後移動到一系列刀片機上來計算。性能

Hadoop distributes data across a cluster of balanced machines and uses replication to ensure data reliability and fault tolerance. Because data is distributed on machines with compute power, processing can be sent directly to the machines storing the data. Since each machine in a Hadoop cluster stores as well as processes data, those machines need to be configured to satisfy both data storage and processing requirements.

Hadoop將數據分佈到配置均衡的集羣上主機上，經過副本的方式來保證可靠性和容錯性。因爲數據分佈在有計算能力的機器上，程序可發送到存儲數據的那臺機器上執行。因爲集羣的每一天機器即用來存儲又用來計算，所以機器配置須要知足這兩方面的需求。

爲何工做場景很重要

In nearly all cases, a MapReduce job will either encounter a bottleneck reading data from disk or from the network (known as an IO-bound job) or in processing data (CPU-bound). An example of an IO-bound job is sorting, which requires very little processing (simple comparisons) and a lot of reading and writing to disk. An example of a CPU-bound job is classification, where some input data is processed in very complex ways to determine ontology.

在幾乎全部的案例中，一個MapReduce任務的瓶頸可能來自磁盤讀寫、網絡（讀寫瓶頸）或CPU（瓶頸）一個典型的受讀寫限制的任務是排序，僅須要不多的CPU進程，卻須要大量的磁盤讀寫。典型的受CPU限制的任務是分類，須要複雜的方法來處理輸入數據來得出結論。

Here are several more examples of IO-bound workloads: