Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular disk filesystems.
管理網絡中跨多臺計算機存儲的文件系統被稱之爲分佈式文件系統。node
由於是以網絡爲基礎,也就引入了網絡編程的複雜性,所以使得分佈式文件系統比普通的磁盤文件系統更加複雜。apache
1) The Design of HDFS
a) HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
HDFS被設計成以流數據訪問模式來進行存儲超大型文件。執行在商業硬件集羣上。
b) HDFS is built around the idea that the most efficient data processing pattern is a writeonce,read-many-times pattern. A dataset is typically generated or copied from source,and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
HDFS創建在一次寫入,屢次讀取這樣一個最高效的數據處理模式的理念之上。數據集一般有數據源生成或者從數據源複製而來,接着在此數據集上進行長時間的數據分析操做。每一次分析都會涉及到一大部分數據,甚至整個數據集,所以讀取整個數據集的時間比讀取第一條記錄的延遲更重要。編程
c) Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors)for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
hadoop不需要昂貴的、高可靠性的硬件,hadoop執行在商業硬件集羣上(普通硬件可以從各類供應商來得到),所以整個集羣節點發生問題的機會是很是高的。至少是對於大集羣而言。
2) HDFS Concepts
a) A disk has a block size, which is the minimum amount of data that it can read or write.Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.
每一個磁盤都有一個默認的數據塊大小,這是磁盤可以讀寫的最小數據量。單個磁盤文件管理系統構建於處理磁盤塊數據之上,它的大小是磁盤塊大小的整數倍。磁盤文件系統的塊大小通常是幾KB,然而磁盤塊大小通常是512字節。緩存
b) Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
跟單個磁盤文件系統不一樣的是,HDFS中比磁盤塊小的文件不會沾滿整個塊的潛在存儲空間。markdown
c) HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.
HDFS的塊比磁盤塊大,這樣會最小化搜索代價。假設塊足夠大,那麼從磁盤數據傳輸的時間就會比搜尋塊開始位置的時間長得多,所以,傳輸由多個塊組成的大文件取決於磁盤傳輸速率
d) Having a block abstraction for a distributed filesystem brings several benefits. The first benefit is the most obvious: a file can be larger than any single disk in the network. Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem. Furthermore, blocks fit well with replication for providing fault tolerance and availability.
對分佈式文件系統的塊進行抽象可以帶來幾個優勢 。網絡
首先一個顯而易見的是。一個文件的大小可以大於網絡中不論什麼一個磁盤的容量。app
其次,用抽象塊而不是整個文件可以使得存儲子系統獲得簡化。最後,抽象塊很是適合於備份,這樣可以提升容錯性和有用性。
e) An HDFS cluster has two types of nodes operating in a master−worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace.
在主機-從機模式下工做的HDFS集羣有兩種類型的節點可以操做:一個namenode(主機上)和若干個datanode(從機上)。namenode管理整個文件系統的命名空間。
f) Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
datanode是文件系統的直接工做點。它們存儲和檢索數據塊(受client或者namenode通知),並且週期性的向namenode報告它們所存儲的塊列表信息。
g) For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata.
基於這個緣由。確保namenode對故障的彈性機制很是重要,爲此,hadoop提供了兩種機制。第一種機制是備份那些由文件系統元數據持久狀態組成的文件。
h) It is also possible to run a secondary namenode, which despite its name does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
還有一種可能的方法是執行一個輔助的namenode,雖然它不會被用做namenode。分佈式
它的主要角色是經過可編輯的日誌週期性的融合命名空間鏡像,以防止可編輯日誌過大。ide
i) However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary.
而後,輔助namenode中的狀態老是滯後於主節點,所以,在主節點的整個故障事件中。數據丟失差點兒是確定的。在這樣的狀況下。一般的作法是把存儲在NFS上的元數據文件複製到輔助namenode中,並且做爲一個新的主節點namenode來執行。
j) Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.
一般一個節點會從磁盤中讀取塊數據,但是對已頻繁訪問的文件,其塊數據可能會緩存在節點的內存中。一個非堆形式的塊緩存。
k) HDFS federation,introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace.
HDFS中的federation,是在2.X系列公佈中引進的,它贊成一個集羣經過添加namenode節點來擴展。每一個namenode節點管理文件系統命名空間中的一部分。
l) Hadoop 2 remedied this situation by adding support for HDFS high availability (HA). In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption. A few architectural changes are needed to allow this to happen:
hadoop 2 經過添加對HA的支持糾正了這樣的狀況。在這樣的實現方式中,將會有2個namenode實現雙機熱備份。oop
在發生主活動節點故障的時候,備份主節點就可以在不發生明顯的中斷的狀況下接管繼續響應client請求的責任。下面這些結構性的變化是贊成發生的:
m) The namenodes must use highly available shared storage to share the edit log. Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk. Clients must be configured to handle namenode failover, using a mechanism that is transparent to users. The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace.
namenode必須使用高有用性的共享存儲來實現可編輯日誌的共享,由於塊映射信息是存儲在namenode的內存中,而不是磁盤上,因此datanode必須發送塊信息報告至雙機熱備份的namenode,client必須進行故障切換的操做配置。這個可以經過一個對用戶透明的機制來實現。輔助節點的角色經過備份被包括進來,其含有活動主節點命名空間週期性檢查點信息。
n) There are two choices for the highly available shared storage: an NFS filer, or a quorum journal manager (QJM). The QJM is a dedicated HDFS implementation, designed for the sole purpose of providing a highly available edit log, and is the recommended choice for most HDFS installations.
對於高有用性共享存儲有兩種選擇:NFS文件。QJM(quorum journal manager)。QJM專一於HDFS的實現。其惟一目的就是提供一個高有用性的可編輯日誌。也是大可能是HDFS安裝時所推薦的。
o) If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries and an up-to-date block mapping.
假設活動namenode發生問題 。備份節點會迅速接管任務(在數秒內)。由於在內存中備份節點有最新的可用狀態,包括最新的可編輯日誌記錄和塊映射信息。
p) The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller. There are various failover controllers, but the default implementation uses ZooKeeper to ensure that only one namenode is active.Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures and trigger a failover should a namenode fail.
從活動主節點到備份節點的故障切換是由系統中一個新的實體——故障切換控制器來管理的。雖然有多種版本號的故障切換控制器。但是hadoop默認的是ZooKeeper,它也可確保惟獨一個namenode是處於活動狀態。每一個namenode節點上都執行一個輕量級的故障切換控制器進程。它的任務就是去監控namenode的故障。一旦namenode發生問題,它就會觸發故障切換。
q) The HA implementation goes to great lengths to ensure that the previously active namenode is prevented from doing any damage and causing corruption — a method known as fencing.
HA的實現會不遺餘力的去確保以前的活動主節點不會作出不論什麼致使故障的有害舉動。這種方法就是fencing。
r) The QJM only allows one namenode to write to the edit log at one time; however, it is still possible for the previously active namenode to serve stale read requests to clients, so setting up an SSH fencing command that will kill the namenode’s process is a good idea.
QJM只贊成一個namenode在同一時刻進行可編輯日誌的寫入操做。然而,對已以前的活動節點來講,響應來自client的陳舊的讀取請求服務是可能的。所以創建一個可以殺死namenode進程的fencing命令式一個好方法。
3) The Command-Line Interface
a) You can type hadoop fs -help to get detailed help on every command.
你可以在每一個命令上使用hadoop fs –help來得到具體的幫助。
b) Let’s copy the file back to the local filesystem and check whether it’s the same:
% hadoop fs -copyToLocal quangle.txt quangle.copy.txt
% md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = e7891a2627cf263a079fb0f18256ffb2
MD5 (quangle.copy.txt) = e7891a2627cf263a079fb0f18256ffb2
The MD5 digests are the same, showing that the file survived its trip to HDFS and is back intact.
讓咱們來拷貝這個文件到本地,檢查它們是不是同樣的文件。
MD5是同樣的,代表這個文件傳輸到了HDFS,並且原封不動的回來了。
4) Hadoop Filesystems
a) Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop, and there are several concrete implementations.
hadoop對於文件系統有一個抽象的概念,HDFS僅是當中一個實現。Java的抽象類org.apache.hadoop.fs.FileSystem定義了client到文件系統之間的接口,並且該抽象類還有幾個具體的實現。
b) Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the Java API. hadoop是用Java編寫的,所以大多數hadoop文件系統的交互經過Java API來進行的。 c) By exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages to interact with HDFS. Note that the HTTP interface is slower than the native Java client, so should be avoided for very large data transfers if possible. 把文件系統的接口做爲一個Java API。會讓非Java應用程序訪問HDFS時有些麻煩。經過WebHDFS協議實現的HTTP REST API可以很是easy的讓其它語言與HDFS進行交互。注意,HTTP接口比本地Javaclient要慢,所以,有可能的話,應該避免使用該接口進行大文件數據傳輸。