【轉載 Hadoop&Spark 動手實踐 2】Hadoop2.7.3 HDFS理論與動手實踐

簡介html

HDFS(Hadoop Distributed File System )Hadoop分佈式文件系統。是根據google發表的論文翻版的。論文爲GFS(Google File System)Google 文件系統(中文英文)。java

HDFS有不少特色node

    ① 保存多個副本,且提供容錯機制,副本丟失或宕機自動恢復。默認存3份。web

    ② 運行在廉價的機器上。shell

    ③ 適合大數據的處理。多大?多小?HDFS默認會將文件分割成block,64M爲1個block。而後將block按鍵值對存儲在HDFS上,並將鍵值對的映射存到內存中。若是小文件太多,那內存的負擔會很重。express

如上圖所示,HDFS也是按照Master和Slave的結構。分NameNode、SecondaryNameNode、DataNode這幾個角色。apache

NameNode:是Master節點,是大領導。管理數據塊映射;處理客戶端的讀寫請求;配置副本策略;管理HDFS的名稱空間;bootstrap

SecondaryNameNode:是一個小弟,分擔大哥namenode的工做量;是NameNode的冷備份;合併fsimage和fsedits而後再發給namenode。api

DataNode:Slave節點,奴隸,幹活的。負責存儲client發來的數據塊block;執行數據塊的讀寫操做。瀏覽器

熱備份:b是a的熱備份,若是a壞掉。那麼b立刻運行代替a的工做。

冷備份:b是a的冷備份,若是a壞掉。那麼b不能立刻代替a工做。可是b上存儲a的一些信息,減小a壞掉以後的損失。

fsimage:元數據鏡像文件(文件系統的目錄樹。)

edits:元數據的操做日誌(針對文件系統作的修改操做記錄)

namenode內存中存儲的是=fsimage+edits。

SecondaryNameNode負責定時默認1小時,從namenode上,獲取fsimage和edits來進行合併,而後再發送給namenode。減小namenode的工做量。

 


 

工做原理

寫操做:

有一個文件FileA,100M大小。Client將FileA寫入到HDFS上。

HDFS按默認配置。

HDFS分佈在三個機架上Rack1,Rack2,Rack3。

 

a. Client將FileA按64M分塊。分紅兩塊,block1和Block2;

b. Client向nameNode發送寫數據請求,如圖藍色虛線①------>。

c. NameNode節點,記錄block信息。並返回可用的DataNode,如粉色虛線②--------->。

    Block1: host2,host1,host3

    Block2: host7,host8,host4

    原理:

        NameNode具備RackAware機架感知功能,這個能夠配置。

        若client爲DataNode節點,那存儲block時,規則爲:副本1,同client的節點上;副本2,不一樣機架節點上;副本3,同第二個副本機架的另外一個節點上;其餘副本隨機挑選。

        若client不爲DataNode節點,那存儲block時,規則爲:副本1,隨機選擇一個節點上;副本2,不一樣副本1,機架上;副本3,同副本2相同的另外一個節點上;其餘副本隨機挑選。

d. client向DataNode發送block1;發送過程是以流式寫入。

    流式寫入過程,

        1>將64M的block1按64k的package劃分;

        2>而後將第一個package發送給host2;

        3>host2接收完後,將第一個package發送給host1,同時client想host2發送第二個package;

        4>host1接收完第一個package後,發送給host3,同時接收host2發來的第二個package。

        5>以此類推,如圖紅線實線所示,直到將block1發送完畢。

        6>host2,host1,host3向NameNode,host2向Client發送通知,說「消息發送完了」。如圖粉紅顏色實線所示。

        7>client收到host2發來的消息後,向namenode發送消息,說我寫完了。這樣就真完成了。如圖黃色粗實線

        8>發送完block1後,再向host7,host8,host4發送block2,如圖藍色實線所示。

        9>發送完block2後,host7,host8,host4向NameNode,host7向Client發送通知,如圖淺綠色實線所示。

        10>client向NameNode發送消息,說我寫完了,如圖黃色粗實線。。。這樣就完畢了。

分析,經過寫過程,咱們能夠了解到:

    寫1T文件,咱們須要3T的存儲,3T的網絡流量貸款。

    在執行讀或寫的過程當中,NameNode和DataNode經過HeartBeat進行保存通訊,肯定DataNode活着。若是發現DataNode死掉了,就將死掉的DataNode上的數據,放到其餘節點去。讀取時,要讀其餘節點去。

    掛掉一個節點,不要緊,還有其餘節點能夠備份;甚至,掛掉某一個機架,也不要緊;其餘機架上,也有備份。

 

讀操做:

 

讀操做就簡單一些了,如圖所示,client要從datanode上,讀取FileA。而FileA由block1和block2組成。 

 

那麼,讀操做流程爲:

a. client向namenode發送讀請求。

b. namenode查看Metadata信息,返回fileA的block的位置。

    block1:host2,host1,host3

    block2:host7,host8,host4

c. block的位置是有前後順序的,先讀block1,再讀block2。並且block1去host2上讀取;而後block2,去host7上讀取;

 

上面例子中,client位於機架外,那麼若是client位於機架內某個DataNode上,例如,client是host6。那麼讀取的時候,遵循的規律是:

優選讀取本機架上的數據

 


 

HDFS中經常使用到的命令

一、hadoop fs

1
2
3
4
5
6
7
8
9
10
11
12
13
hadoop fs -ls /
hadoop fs -lsr
hadoop fs -mkdir /user/hadoop
hadoop fs -put a.txt /user/hadoop/
hadoop fs -get /user/hadoop/a.txt /
hadoop fs -cp src dst
hadoop fs -mv src dst
hadoop fs -cat /user/hadoop/a.txt
hadoop fs -rm /user/hadoop/a.txt
hadoop fs -rmr /user/hadoop/a.txt
hadoop fs -text /user/hadoop/a.txt
hadoop fs -copyFromLocal localsrc dst 與hadoop fs -put功能相似。
hadoop fs -moveFromLocal localsrc dst 將本地文件上傳到hdfs,同時刪除本地文件。

二、hadoop fsadmin 

1
2
3
hadoop dfsadmin -report
hadoop dfsadmin -safemode enter | leave | get | wait
hadoop dfsadmin -setBalancerBandwidth  1000

三、hadoop fsck

四、start-balancer.sh

 

 

========================================================================

再來一篇更加詳細的!

一、HDFS簡介

  HDFS(Hadoop Distributed File System)是Hadoop項目的核心子項目,是分佈式計算中數據存儲管理的基礎,是基於流數據模式訪問和處理超大文件的需求而開發的,能夠運行於廉價的商用服務器上。它所具備的高容錯、高可靠性、高可擴展性、高得到性、高吞吐率等特徵爲海量數據提供了不怕故障的存儲,爲超大數據集(Large Data Set)的應用處理帶來了不少便利。

  Hadoop整合了衆多文件系統,在其中有一個綜合性的文件系統抽象,它提供了文件系統實現的各種接口,HDFS只是這個抽象文件系統的一個實例。提供了一個高層的文件系統抽象類org.apache.hadoop.fs.FileSystem,這個抽象類展現了一個分佈式文件系統,並有幾個具體實現,以下表1-1所示。

表1-1 Hadoop的文件系統

文件系統

URI方案

Java實現

(org.apache.hadoop)

定義

Local

file

fs.LocalFileSystem

支持有客戶端校驗和本地文件系統。帶有校驗和的本地系統文件在fs.RawLocalFileSystem中實現。

HDFS

hdfs

hdfs.DistributionFileSystem

Hadoop的分佈式文件系統。

HFTP

hftp

hdfs.HftpFileSystem

支持經過HTTP方式以只讀的方式訪問HDFS,distcp常常用在不一樣的HDFS集羣間複製數據。

HSFTP

hsftp

hdfs.HsftpFileSystem

支持經過HTTPS方式以只讀的方式訪問HDFS。

HAR

har

fs.HarFileSystem

構建在Hadoop文件系統之上,對文件進行歸檔。Hadoop歸檔文件主要用來減小NameNode的內存使用

KFS

kfs

fs.kfs.KosmosFileSystem

Cloudstore(其前身是Kosmos文件系統)文件系統是相似於HDFS和Google的GFS文件系統,使用C++編寫。

FTP

ftp

fs.ftp.FtpFileSystem

由FTP服務器支持的文件系統。

S3(本地)

s3n

fs.s3native.NativeS3FileSystem

基於Amazon S3的文件系統。

S3(基於塊)

s3 

fs.s3.NativeS3FileSystem

基於Amazon S3的文件系統,以塊格式存儲解決了S3的5GB文件大小的限制。

  Hadoop提供了許多文件系統的接口,用戶可使用URI方案選取合適的文件系統來實現交互。

二、HDFS基礎概念

2.1 數據塊(block)

  • HDFS(Hadoop Distributed File System)默認的最基本的存儲單位是64M的數據塊。
  • 和普通文件系統相同的是,HDFS中的文件是被分紅64M一塊的數據塊存儲的。
  • 不一樣於普通文件系統的是,HDFS中,若是一個文件小於一個數據塊的大小,並不佔用整個數據塊存儲空間。

2.2 NameNode和DataNode

  HDFS體系結構中有兩類節點,一類是NameNode,又叫"元數據節點";另外一類是DataNode,又叫"數據節點"。這兩類節點分別承擔Master和Worker具體任務的執行節點。

  1)元數據節點用來管理文件系統的命名空間

  • 其將全部的文件和文件夾的元數據保存在一個文件系統樹中。
  • 這些信息也會在硬盤上保存成如下文件:命名空間鏡像(namespace image)及修改日誌(edit log)
  • 其還保存了一個文件包括哪些數據塊,分佈在哪些數據節點上。然而這些信息並不存儲在硬盤上,而是在系統啓動的時候從數據節點收集而成的。

  2)數據節點是文件系統中真正存儲數據的地方。

  • 客戶端(client)或者元數據信息(namenode)能夠向數據節點請求寫入或者讀出數據塊。
  • 其週期性的向元數據節點回報其存儲的數據塊信息。

  3)從元數據節點(secondary namenode)

  • 從元數據節點並非元數據節點出現問題時候的備用節點,它和元數據節點負責不一樣的事情。
  • 其主要功能就是週期性將元數據節點的命名空間鏡像文件和修改日誌合併,以防日誌文件過大。這點在下面會相信敘述。
  • 合併事後的命名空間鏡像文件也在從元數據節點保存了一份,以防元數據節點失敗的時候,能夠恢復。

2.3 元數據節點目錄結構

 

  

 

  VERSION文件是java properties文件,保存了HDFS的版本號。

  • layoutVersion是一個負整數,保存了HDFS的持續化在硬盤上的數據結構的格式版本號。
  • namespaceID是文件系統的惟一標識符,是在文件系統初次格式化時生成的。
  • cTime此處爲0
  • storageType表示此文件夾中保存的是元數據節點的數據結構。

 

namespaceID=1232737062

cTime=0

storageType=NAME_NODE

layoutVersion=-18

 

2.4 數據節點的目錄結構

 

  

  • 數據節點的VERSION文件格式以下:

 

namespaceID=1232737062

storageID=DS-1640411682-127.0.1.1-50010-1254997319480

cTime=0

storageType=DATA_NODE

layoutVersion=-18

 

  • blk_<id>保存的是HDFS的數據塊,其中保存了具體的二進制數據。
  • blk_<id>.meta保存的是數據塊的屬性信息:版本信息,類型信息,和checksum
  • 當一個目錄中的數據塊到達必定數量的時候,則建立子文件夾來保存數據塊及數據塊屬性信息。

2.5 文件系統命名空間映像文件及修改日誌

  • 當文件系統客戶端(client)進行寫操做時,首先把它記錄在修改日誌中(edit log)
  • 元數據節點在內存中保存了文件系統的元數據信息。在記錄了修改日誌後,元數據節點則修改內存中的數據結構。
  • 每次的寫操做成功以前,修改日誌都會同步(sync)到文件系統。
  • fsimage文件,也即命名空間映像文件,是內存中的元數據在硬盤上的checkpoint,它是一種序列化的格式,並不可以在硬盤上直接修改。
  • 同數據的機制類似,當元數據節點失敗時,則最新checkpoint的元數據信息從fsimage加載到內存中,而後逐一從新執行修改日誌中的操做。
  • 從元數據節點就是用來幫助元數據節點將內存中的元數據信息checkpoint到硬盤上的
  • checkpoint的過程以下:
    • 從元數據節點通知元數據節點生成新的日誌文件,之後的日誌都寫到新的日誌文件中。
    • 從元數據節點用http get從元數據節點得到fsimage文件及舊的日誌文件。
    • 從元數據節點將fsimage文件加載到內存中,並執行日誌文件中的操做,而後生成新的fsimage文件。
    • 從元數據節點獎新的fsimage文件用http post傳回元數據節點
    • 元數據節點能夠將舊的fsimage文件及舊的日誌文件,換爲新的fsimage文件和新的日誌文件(第一步生成的),而後更新fstime文件,寫入這次checkpoint的時間。
    • 這樣元數據節點中的fsimage文件保存了最新的checkpoint的元數據信息,日誌文件也從新開始,不會變的很大了。

 

 

三、HDFS體系結構

  HDFS是一個主/從(Mater/Slave)體系結構,從最終用戶的角度來看,它就像傳統的文件系統同樣,能夠經過目錄路徑對文件執行CRUD(Create、Read、Update和Delete)操做。但因爲分佈式存儲的性質,HDFS集羣擁有一個NameNode和一些DataNode。NameNode管理文件系統的元數據,DataNode存儲實際的數據。客戶端經過同NameNode和DataNodes的交互訪問文件系統。客戶端聯繫NameNode以獲取文件的元數據,而真正的文件I/O操做是直接和DataNode進行交互的。

 

圖3.1 HDFS整體結構示意圖

 

  1)NameNode、DataNode和Client

  • NameNode能夠看做是分佈式文件系統中的管理者,主要負責管理文件系統的命名空間、集羣配置信息和存儲塊的複製等。NameNode會將文件系統的Meta-data存儲在內存中,這些信息主要包括了文件信息、每個文件對應的文件塊的信息和每個文件塊在DataNode的信息等。
  • DataNode是文件存儲的基本單元,它將Block存儲在本地文件系統中,保存了Block的Meta-data,同時週期性地將全部存在的Block信息發送給NameNode。
  • Client就是須要獲取分佈式文件系統文件的應用程序。

  2)文件寫入

  • Client向NameNode發起文件寫入的請求。
  • NameNode根據文件大小和文件塊配置狀況,返回給Client它所管理部分DataNode的信息。
  • Client將文件劃分爲多個Block,根據DataNode的地址信息,按順序寫入到每個DataNode塊中。

  3)文件讀取

  • Client向NameNode發起文件讀取的請求。
  • NameNode返回文件存儲的DataNode的信息。
  • Client讀取文件信息。

 

  HDFS典型的部署是在一個專門的機器上運行NameNode,集羣中的其餘機器各運行一個DataNode;也能夠在運行NameNode的機器上同時運行DataNode,或者一臺機器上運行多個DataNode。一個集羣只有一個NameNode的設計大大簡化了系統架構。

四、HDFS的優缺點

4.1 HDFS的優勢

  1)處理超大文件

  這裏的超大文件一般是指百MB、設置數百TB大小的文件。目前在實際應用中,HDFS已經能用來存儲管理PB級的數據了。

  2)流式的訪問數據

  HDFS的設計創建在更多地響應"一次寫入、屢次讀寫"任務的基礎上。這意味着一個數據集一旦由數據源生成,就會被複製分發到不一樣的存儲節點中,而後響應各類各樣的數據分析任務請求。在多數狀況下,分析任務都會涉及數據集中的大部分數據,也就是說,對HDFS來講,請求讀取整個數據集要比讀取一條記錄更加高效

  3)運行於廉價的商用機器集羣上

  Hadoop設計對硬件需求比較,只須運行在低廉的商用硬件集羣上,而無需昂貴的高可用性機器上。廉價的商用機也就意味着大型集羣中出現節點故障狀況的機率很是高。這就要求設計HDFS時要充分考慮數據的可靠性,安全性及高可用性。

4.2 HDFS的缺點

  1)不適合低延遲數據訪問

  若是要處理一些用戶要求時間比較短的低延遲應用請求,則HDFS不適合。HDFS是爲了處理大型數據集分析任務的,主要是爲達到數據吞吐量而設計的,這就可能要求以高延遲做爲代價。

  改進策略:對於那些有低延時要求的應用程序,HBase是一個更好的選擇。經過上層數據管理項目來儘量地彌補這個不足。在性能上有了很大的提高,它的口號就是goes real time。使用緩存或多master設計能夠下降client的數據請求壓力,以減小延時。還有就是對HDFS系統內部的修改,這就得權衡大吞吐量與低延時了,HDFS不是萬能的銀彈。

  2)沒法高效存儲大量小文件

  由於Namenode把文件系統的元數據放置在內存中,因此文件系統所能容納的文件數目是由Namenode的內存大小來決定。通常來講,每個文件文件夾Block須要佔據150字節左右的空間,因此,若是你有100萬個文件,每個佔據一個Block,你就至少須要300MB內存。當前來講,數百萬的文件仍是可行的,當擴展到數十億時,對於當前的硬件水平來講就無法實現了。還有一個問題就是,由於Map task的數量是由splits來決定的,因此用MR處理大量的小文件時,就會產生過多的Maptask,線程管理開銷將會增長做業時間。舉個例子,處理10000M的文件,若每一個split爲1M,那就會有10000個Maptasks,會有很大的線程開銷;若每一個split爲100M,則只有100個Maptasks,每一個Maptask將會有更多的事情作,而線程的管理開銷也將減少不少。

  改進策略:要想讓HDFS能處理好小文件,有很多方法。

  • 利用SequenceFile、MapFile、Har等方式歸檔小文件,這個方法的原理就是把小文件 歸檔起來管理,HBase就是基於此的。對於這種方法,若是想找回原來的小文件內容,那就必須得知道與歸檔文件的 映射關係。
  • 橫向擴展,一個Hadoop集羣能管理的小文件有限,那就把幾個Hadoop集羣拖在一個虛擬服務器後面,造成一個大的Hadoop集羣。google也是這麼幹過的。
  • 多Master設計,這個做用顯而易見了。正在研發中的GFS II也要改成分佈式多Master設計,還支持Master的Failover,並且Block大小改成1M,有意要調優處理小文件啊。
  • 附帶個Alibaba DFS的設計,也是多Master設計,它把Metadata的映射存儲和管理分開了,由多個Metadata存儲節點和一個查詢Master節點組成。

  3)不支持多用戶寫入及任意修改文件

  在HDFS的一個文件中只有一個寫入者,並且寫操做只能在文件末尾完成,即只能執行追加操做。目前HDFS還不支持多個用戶同一文件操做,以及在文件任意位置進行修改。

五、HDFS經常使用操做

  先說一下"hadoop fs 和hadoop dfs的區別",看兩本Hadoop書上各有用到,但效果同樣,求證與網絡發現下面一解釋比較中肯。

  粗略的講,fs是個比較抽象的層面,在分佈式環境中,fs就是dfs,但在本地環境中,fs是local file system,這個時候dfs就不能用。

5.1 文件操做

  1)列出HDFS文件

  此處爲你展現如何經過"-ls"命令列出HDFS下的文件:

 

hadoop fs -ls

 

  執行結果如圖5-1-1所示。在這裏須要注意:在HDFS中未帶參數的"-ls"命名沒有返回任何值,它默認返回HDFS的"home"目錄下的內容。在HDFS中,當前目錄這樣一個概念,也cd這個命令。

 

  

圖5-1-1 列出HDFS文件

  2)列出HDFS目錄下某個文檔中的文件

  此處爲你展現如何經過"-ls 文件名"命令瀏覽HDFS下名爲"input"的文檔中文件:

 

hadoop fs –ls input

 

  執行結果如圖5-1-2所示。

 

  

圖5-1-2 列出HDFS下名爲input的文檔下的文件

  3)上傳文件到HDFS

  此處爲你展現如何經過"-put 文件1 文件2"命令將"Master.Hadoop"機器下的"/home/hadoop"目錄下的file文件上傳到HDFS上並重命名test

 

hadoop fs –put ~/file test

 

  執行結果如圖5-1-3所示。在執行"-put"時兩種可能,便是執行成功執行失敗。在上傳文件時,文件首先複製到DataNode上。只有全部的DataNode都成功接收完數據,文件上傳纔是成功的。其餘狀況(如文件上傳終端等)對HDFS來講都是作了無用功。

 

  

圖5-1-3 成功上傳file到HDFS

  4)將HDFS中文件複製到本地系統中

  此處爲你展現如何經過"-get 文件1 文件2"命令將HDFS中的"output"文件複製到本地系統並命名爲"getout"。

 

hadoop fs –get output getout

 

  執行結果如圖5-1-4所示。

 

  

圖5-1-4 成功將HDFS中output文件複製到本地系統

  備註:與"-put"命令同樣,"-get"操做既能夠操做文件,也能夠操做目錄

  5)刪除HDFS下的文檔

  此處爲你展現如何經過"-rmr 文件"命令刪除HDFS下名爲"newoutput"的文檔:

 

hadoop fs –rmr newoutput

 

  執行結果如圖5-1-5所示。

 

  

圖5-1-5 成功刪除HDFS下的newoutput文檔

  6)查看HDFS下某個文件

  此處爲你展現如何經過"-cat 文件"命令查看HDFS下input文件中內容:

 

hadoop fs -cat input/*

 

  執行結果如圖5-1-6所示。

 

  

圖5-1-6 HDFS下input文件的內容

  "hadoop fs"的命令遠不止這些,本小節介紹的命令已能夠在HDFS上完成大多數常規操做。對於其餘操做,能夠經過"-help commandName"命令所列出的清單來進一步學習與探索。

5.2 管理與更新

  1)報告HDFS的基本統計狀況

  此處爲你展現經過"-report"命令如何查看HDFS的基本統計信息:

 

hadoop dfsadmin -report

 

  執行結果如圖5-2-1所示。

 

  

圖5-2-1 HDFS基本統計信息

  2)退出安全模式

  NameNode在啓動時自動進入安全模式。安全模式是NameNode的一種狀態,在這個階段,文件系統不容許有任何修改。安全模式的目的是在系統啓動時檢查各個DataNode上數據塊的有效性,同時根據策略對數據塊進行必要的複製刪除,當數據塊最小百分比數知足的最小副本數條件時,會自動退出安全模式。

  系統顯示"Name node is in safe mode",說明系統正處於安全模式,這時只須要等待17秒便可,也能夠經過下面的命令退出安全模式:

 

hadoop dfsadmin –safemode enter

 

  成功退出安全模式結果如圖5-2-2所示。

 

  

圖5-2-2 成功退出安全模式

  3)進入安全模式

  在必要狀況下,能夠經過如下命令把HDFS置於安全模式:

 

hadoop dfsadmin –safemode enter

 

  執行結果如圖5-2-3所示。

 

  

圖5-2-3 進入HDFS安全模式

  4)添加節點

  可擴展性是HDFS的一個重要特性,向HDFS集羣中添加節點是很容易實現的。添加一個新的DataNode節點,首先在新加節點上安裝好Hadoop,要和NameNode使用相同的配置(能夠直接從NameNode複製),修改"/usr/hadoop/conf/master"文件,加入NameNode主機名。而後在NameNode節點上修改"/usr/hadoop/conf/slaves"文件,加入新節點主機名,再創建到新加點無密碼的SSH鏈接,運行啓動命令:

 

start-all.sh

 

  5)負載均衡

  HDFS的數據在各個DataNode中的分佈肯能很不均勻,尤爲是在DataNode節點出現故障或新增DataNode節點時。新增數據塊時NameNode對DataNode節點的選擇策略也有可能致使數據塊分佈的不均勻。用戶可使用命令從新平衡DataNode上的數據塊的分佈:

 

start-balancer.sh

 

  執行命令前,DataNode節點上數據分佈狀況如圖5-2-4所示。

 

  

 

  負載均衡完畢後,DataNode節點上數據的分佈狀況如圖5-2-5所示。

 

  

 

  執行負載均衡命令如圖5-2-6所示。

 

  

  

====================================================================

=============================================================

http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

Overview

The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others. The FS shell is invoked by:

bin/hadoop fs <args>

All FS shell commands take path URIs as arguments. The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost).

Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of the commands. Error information is sent to stderr and the output is sent to stdout.

If HDFS is being used, hdfs dfs is a synonym.

See the Commands Manual for generic shell options.

appendToFile

Usage: hadoop fs -appendToFile <localsrc> ... <dst>

Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.

  • hadoop fs -appendToFile localfile /user/hadoop/hadoopfile
  • hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
  • hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
  • hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Exit Code:

Returns 0 on success and 1 on error.

cat

Usage: hadoop fs -cat URI [URI ...]

Copies source paths to stdout.

Example:

  • hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
  • hadoop fs -cat file:///file3 /user/hadoop/file4

Exit Code:

Returns 0 on success and -1 on error.

checksum

Usage: hadoop fs -checksum URI

Returns the checksum information of a file.

Example:

  • hadoop fs -checksum hdfs://nn1.example.com/file1
  • hadoop fs -checksum file:///etc/hosts

chgrp

Usage: hadoop fs -chgrp [-R] GROUP URI [URI ...]

Change group association of files. The user must be the owner of files, or else a super-user. Additional information is in the Permissions Guide.

Options

  • The -R option will make the change recursively through the directory structure.

chmod

Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]

Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. Additional information is in the Permissions Guide.

Options

  • The -R option will make the change recursively through the directory structure.

chown

Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Change the owner of files. The user must be a super-user. Additional information is in the Permissions Guide.

Options

  • The -R option will make the change recursively through the directory structure.

copyFromLocal

Usage: hadoop fs -copyFromLocal <localsrc> URI

Similar to put command, except that the source is restricted to a local file reference.

Options:

  • The -f option will overwrite the destination if it already exists.

copyToLocal

Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Similar to get command, except that the destination is restricted to a local file reference.

count

Usage: hadoop fs -count [-q] [-h] [-v] <paths>

Count the number of directories, files and bytes under the paths that match the specified file pattern. The output columns with -count are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

The output columns with -count -q are: QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

The -h option shows sizes in human readable format.

The -v option displays a header line.

Example:

  • hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
  • hadoop fs -count -q hdfs://nn1.example.com/file1
  • hadoop fs -count -q -h hdfs://nn1.example.com/file1
  • hdfs dfs -count -q -h -v hdfs://nn1.example.com/file1

Exit Code:

Returns 0 on success and -1 on error.

cp

Usage: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>

Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.

‘raw.*’ namespace extended attributes are preserved if (1) the source and destination filesystems support them (HDFS only), and (2) all source and destination pathnames are in the /.reserved/raw hierarchy. Determination of whether raw.* namespace xattrs are preserved is independent of the -p (preserve) flag.

Options:

  • The -f option will overwrite the destination if it already exists.
  • The -p option will preserve file attributes [topx] (timestamps, ownership, permission, ACL, XAttr). If -p is specified with no arg, then preserves timestamps, ownership, permission. If -pa is specified, then preserves permission also because ACL is a super-set of permission. Determination of whether raw namespace extended attributes are preserved is independent of the -p flag.

Example:

  • hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

createSnapshot

See HDFS Snapshots Guide.

deleteSnapshot

See HDFS Snapshots Guide.

df

Usage: hadoop fs -df [-h] URI [URI ...]

Displays free space.

Options:

  • The -h option will format file sizes in a 「human-readable」 fashion (e.g 64.0m instead of 67108864)

Example:

  • hadoop dfs -df /user/hadoop/dir1

du

Usage: hadoop fs -du [-s] [-h] URI [URI ...]

Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

Options:

  • The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
  • The -h option will format file sizes in a 「human-readable」 fashion (e.g 64.0m instead of 67108864)

Example:

  • hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1

Exit Code: Returns 0 on success and -1 on error.

dus

Usage: hadoop fs -dus <args>

Displays a summary of file lengths.

Note: This command is deprecated. Instead use hadoop fs -du -s.

expunge

Usage: hadoop fs -expunge

Permanently delete files in checkpoints older than the retention threshold from trash directory, and create new checkpoint.

When checkpoint is created, recently deleted files in trash are moved under the checkpoint. Files in checkpoints older than fs.trash.checkpoint.interval will be permanently deleted on the next invocation of -expunge command.

If the file system supports the feature, users can configure to create and delete checkpoints periodically by the parameter stored as fs.trash.checkpoint.interval (in core-site.xml). This value should be smaller or equal to fs.trash.interval.

Refer to the HDFS Architecture guide for more information about trash feature of HDFS.

find

Usage: hadoop fs -find <path> ... <expression> ...

Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.

The following primary expressions are recognised:

  • -name pattern
    -iname pattern

    Evaluates as true if the basename of the file matches the pattern using standard file system globbing. If -iname is used then the match is case insensitive.

  • -print
    -print0Always

    evaluates to true. Causes the current pathname to be written to standard output. If the -print0 expression is used then an ASCII NULL character is appended.

The following operators are recognised:

  • expression -a expression
    expression -and expression
    expression expression

    Logical AND operator for joining two expressions. Returns true if both child expressions return true. Implied by the juxtaposition of two expressions and so does not need to be explicitly specified. The second expression will not be applied if the first fails.

Example:

hadoop fs -find / -name test -print

Exit Code:

Returns 0 on success and -1 on error.

get

Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.

Example:

  • hadoop fs -get /user/hadoop/file localfile
  • hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile

Exit Code:

Returns 0 on success and -1 on error.

getfacl

Usage: hadoop fs -getfacl [-R] <path>

Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.

Options:

  • -R: List the ACLs of all files and directories recursively.
  • path: File or directory to list.

Examples:

  • hadoop fs -getfacl /file
  • hadoop fs -getfacl -R /dir

Exit Code:

Returns 0 on success and non-zero on error.

getfattr

Usage: hadoop fs -getfattr [-R] -n name | -d [-e en] <path>

Displays the extended attribute names and values (if any) for a file or directory.

Options:

  • -R: Recursively list the attributes for all files and directories.
  • -n name: Dump the named extended attribute value.
  • -d: Dump all extended attribute values associated with pathname.
  • -e encoding: Encode values after retrieving them. Valid encodings are 「text」, 「hex」, and 「base64」. Values encoded as text strings are enclosed in double quotes ("), and values encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively.
  • path: The file or directory.

Examples:

  • hadoop fs -getfattr -d /file
  • hadoop fs -getfattr -R -n user.myAttr /dir

Exit Code:

Returns 0 on success and non-zero on error.

getmerge

Usage: hadoop fs -getmerge [-nl] <src> <localdst>

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.

Examples:

  • hadoop fs -getmerge -nl /src /opt/output.txt
  • hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt

Exit Code:

Returns 0 on success and non-zero on error.

help

Usage: hadoop fs -help

Return usage output.

ls

Usage: hadoop fs -ls [-d] [-h] [-R] <args>

Options:

  • -d: Directories are listed as plain files.
  • -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
  • -R: Recursively list subdirectories encountered.

For a file ls returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

For a directory it returns list of its direct children as in Unix. A directory is listed as:

permissions userid groupid modification_date modification_time dirname

Files within a directory are order by filename by default.

Example:

  • hadoop fs -ls /user/hadoop/file1

Exit Code:

Returns 0 on success and -1 on error.

lsr

Usage: hadoop fs -lsr <args>

Recursive version of ls.

Note: This command is deprecated. Instead use hadoop fs -ls -R

mkdir

Usage: hadoop fs -mkdir [-p] <paths>

Takes path uri’s as argument and creates directories.

Options:

  • The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

Example:

  • hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
  • hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

moveFromLocal

Usage: hadoop fs -moveFromLocal <localsrc> <dst>

Similar to put command, except that the source localsrc is deleted after it’s copied.

moveToLocal

Usage: hadoop fs -moveToLocal [-crc] <src> <dst>

Displays a 「Not implemented yet」 message.

mv

Usage: hadoop fs -mv URI [URI ...] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.

Example:

  • hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

Exit Code:

Returns 0 on success and -1 on error.

put

Usage: hadoop fs -put <localsrc> ... <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.

  • hadoop fs -put localfile /user/hadoop/hadoopfile
  • hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
  • hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
  • hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Exit Code:

Returns 0 on success and -1 on error.

renameSnapshot

See HDFS Snapshots Guide.

rm

Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI ...]

Delete files specified as args.

If trash is enabled, file system instead moves the deleted file to a trash directory (given by FileSystem#getTrashRoot).

Currently, the trash feature is disabled by default. User can enable trash by setting a value greater than zero for parameter fs.trash.interval (in core-site.xml).

See expunge about deletion of files in trash.

Options:

  • The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.
  • The -R option deletes the directory and any content under it recursively.
  • The -r option is equivalent to -R.
  • The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.

Example:

  • hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

Exit Code:

Returns 0 on success and -1 on error.

rmdir

Usage: hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]

Delete a directory.

Options:

  • --ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.

Example:

  • hadoop fs -rmdir /user/hadoop/emptydir

rmr

Usage: hadoop fs -rmr [-skipTrash] URI [URI ...]

Recursive version of delete.

Note: This command is deprecated. Instead use hadoop fs -rm -r

setfacl

Usage: hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[--set <acl_spec> <path>]

Sets Access Control Lists (ACLs) of files and directories.

Options:

  • -b: Remove all but the base ACL entries. The entries for user, group and others are retained for compatibility with permission bits.
  • -k: Remove the default ACL.
  • -R: Apply operations to all files and directories recursively.
  • -m: Modify ACL. New entries are added to the ACL, and existing entries are retained.
  • -x: Remove specified ACL entries. Other ACL entries are retained.
  • --set: Fully replace the ACL, discarding all existing entries. The acl_spec must include entries for user, group, and others for compatibility with permission bits.
  • acl_spec: Comma separated list of ACL entries.
  • path: File or directory to modify.

Examples:

  • hadoop fs -setfacl -m user:hadoop:rw- /file
  • hadoop fs -setfacl -x user:hadoop /file
  • hadoop fs -setfacl -b /file
  • hadoop fs -setfacl -k /dir
  • hadoop fs -setfacl --set user::rw-,user:hadoop:rw-,group::r--,other::r-- /file
  • hadoop fs -setfacl -R -m user:hadoop:r-x /dir
  • hadoop fs -setfacl -m default:user:hadoop:r-x /dir

Exit Code:

Returns 0 on success and non-zero on error.

setfattr

Usage: hadoop fs -setfattr -n name [-v value] | -x name <path>

Sets an extended attribute name and value for a file or directory.

Options:

  • -b: Remove all but the base ACL entries. The entries for user, group and others are retained for compatibility with permission bits.
  • -n name: The extended attribute name.
  • -v value: The extended attribute value. There are three different encoding methods for the value. If the argument is enclosed in double quotes, then the value is the string inside the quotes. If the argument is prefixed with 0x or 0X, then it is taken as a hexadecimal number. If the argument begins with 0s or 0S, then it is taken as a base64 encoding.
  • -x name: Remove the extended attribute.
  • path: The file or directory.

Examples:

  • hadoop fs -setfattr -n user.myAttr -v myValue /file
  • hadoop fs -setfattr -n user.noValue /file
  • hadoop fs -setfattr -x user.myAttr /file

Exit Code:

Returns 0 on success and non-zero on error.

setrep

Usage: hadoop fs -setrep [-R] [-w] <numReplicas> <path>

Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path.

Options:

  • The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time.
  • The -R flag is accepted for backwards compatibility. It has no effect.

Example:

  • hadoop fs -setrep -w 3 /user/hadoop/dir1

Exit Code:

Returns 0 on success and -1 on error.

stat

Usage: hadoop fs -stat [format] <path> ...

Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as 「yyyy-MM-dd HH:mm:ss」 and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.

Example:

  • hadoop fs -stat "%F %u:%g %b %y %n" /file

Exit Code: Returns 0 on success and -1 on error.

tail

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.

Options:

  • The -f option will output appended data as the file grows, as in Unix.

Example:

  • hadoop fs -tail pathname

Exit Code: Returns 0 on success and -1 on error.

test

Usage: hadoop fs -test -[defsz] URI

Options:

  • -d: f the path is a directory, return 0.
  • -e: if the path exists, return 0.
  • -f: if the path is a file, return 0.
  • -s: if the path is not empty, return 0.
  • -z: if the file is zero length, return 0.

Example:

  • hadoop fs -test -e filename

text

Usage: hadoop fs -text <src>

Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.

touchz

Usage: hadoop fs -touchz URI [URI ...]

Create a file of zero length.

Example:

  • hadoop fs -touchz pathname

Exit Code: Returns 0 on success and -1 on error.

truncate

Usage: hadoop fs -truncate [-w] <length> <paths>

Truncate all files that match the specified file pattern to the specified length.

Options:

  • The -w flag requests that the command waits for block recovery to complete, if necessary. Without -w flag the file may remain unclosed for some time while the recovery is in progress. During this time file cannot be reopened for append.

Example:

  • hadoop fs -truncate 55 /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -truncate -w 127 hdfs://nn1.example.com/user/hadoop/file1

usage

Usage: hadoop fs -usage command

Return the help for an individual command.

 

 

 

http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

 

HDFS Commands Guide

Overview

All HDFS commands are invoked by the bin/hdfs script. Running the hdfs script without any arguments prints the description for all commands.

Usage: hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]

Hadoop has an option parsing framework that employs parsing generic options as well as running classes.

COMMAND_OPTIONS Description
--config
--loglevel
The common set of shell options. These are documented on the Commands Manual page.
GENERIC_OPTIONS The common set of options supported by multiple commands. See the Hadoop Commands Manual for more information.
COMMAND COMMAND_OPTIONS Various commands with their options are described in the following sections. The commands have been grouped into User Commands and Administration Commands.

User Commands

Commands useful for users of a hadoop cluster.

classpath

Usage: hdfs classpath

Prints the class path needed to get the Hadoop jar and the required libraries

dfs

Usage: hdfs dfs [COMMAND [COMMAND_OPTIONS]]

Run a filesystem command on the file system supported in Hadoop. The various COMMAND_OPTIONS can be found at File System Shell Guide.

fetchdt

Usage: hdfs fetchdt [--webservice <namenode_http_addr>] <path>

COMMAND_OPTION Description
--webservice https_address use http protocol instead of RPC
fileName File name to store the token into.

Gets Delegation Token from a NameNode. See fetchdt for more info.

fsck

Usage:

   hdfs fsck <path>
          [-list-corruptfileblocks |
          [-move | -delete | -openforwrite]
          [-files [-blocks [-locations | -racks]]]
          [-includeSnapshots]
          [-storagepolicies] [-blockId <blk_Id>]
COMMAND_OPTION Description
path Start checking from this path.
-delete Delete corrupted files.
-files Print out files being checked.
-files -blocks Print out the block report
-files -blocks -locations Print out locations for every block.
-files -blocks -racks Print out network topology for data-node locations.
-includeSnapshots Include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it.
-list-corruptfileblocks Print out list of missing blocks and files they belong to.
-move Move corrupted files to /lost+found.
-openforwrite Print out files opened for write.
-storagepolicies Print out storage policy summary for the blocks.
-blockId Print out information about the block.

Runs the HDFS filesystem checking utility. See fsck for more info.

getconf

Usage:

   hdfs getconf -namenodes
   hdfs getconf -secondaryNameNodes
   hdfs getconf -backupNodes
   hdfs getconf -includeFile
   hdfs getconf -excludeFile
   hdfs getconf -nnRpcAddresses
   hdfs getconf -confKey [key]
COMMAND_OPTION Description
-namenodes gets list of namenodes in the cluster.
-secondaryNameNodes gets list of secondary namenodes in the cluster.
-backupNodes gets list of backup nodes in the cluster.
-includeFile gets the include file path that defines the datanodes that can join the cluster.
-excludeFile gets the exclude file path that defines the datanodes that need to decommissioned.
-nnRpcAddresses gets the namenode rpc addresses
-confKey [key] gets a specific key from the configuration

Gets configuration information from the configuration directory, post-processing.

groups

Usage: hdfs groups [username ...]

Returns the group information given one or more usernames.

lsSnapshottableDir

Usage: hdfs lsSnapshottableDir [-help]

COMMAND_OPTION Description
-help print help

Get the list of snapshottable directories. When this is run as a super user, it returns all snapshottable directories. Otherwise it returns those directories that are owned by the current user.

jmxget

Usage: hdfs jmxget [-localVM ConnectorURL | -port port | -server mbeanserver | -service service]

COMMAND_OPTION Description
-help print help
-localVM ConnectorURL connect to the VM on the same machine
-port mbean server port specify mbean server port, if missing it will try to connect to MBean Server in the same VM
-service specify jmx service, either DataNode or NameNode, the default

Dump JMX information from a service.

oev

Usage: hdfs oev [OPTIONS] -i INPUT_FILE -o OUTPUT_FILE

Required command line arguments:

COMMAND_OPTION Description
-i,--inputFile arg edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format
-o,--outputFile arg Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option

Optional command line arguments:

COMMAND_OPTION Description
-f,--fix-txids Renumber the transaction IDs in the input, so that there are no gaps or invalid transaction IDs.
-h,--help Display usage information and exit
-r,--ecover When reading binary edit logs, use recovery mode. This will give you the chance to skip corrupt parts of the edit log.
-p,--processor arg Select which type of processor to apply against image file, currently supported processors are: binary (native binary format that Hadoop uses), xml (default, XML format), stats (prints statistics about edits file)
-v,--verbose More verbose output, prints the input and output filenames, for processors that write to a file, also output to screen. On large image files this will dramatically increase processing time (default is false).

Hadoop offline edits viewer.

oiv

Usage: hdfs oiv [OPTIONS] -i INPUT_FILE

Required command line arguments:

COMMAND_OPTION Description
-i,--inputFile arg edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format

Optional command line arguments:

COMMAND_OPTION Description
-h,--help Display usage information and exit
-o,--outputFile arg Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option
-p,--processor arg Select which type of processor to apply against image file, currently supported processors are: binary (native binary format that Hadoop uses), xml (default, XML format), stats (prints statistics about edits file)

Hadoop Offline Image Viewer for newer image files.

oiv_legacy

Usage: hdfs oiv_legacy [OPTIONS] -i INPUT_FILE -o OUTPUT_FILE

COMMAND_OPTION Description
-h,--help Display usage information and exit
-i,--inputFile arg edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format
-o,--outputFile arg Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option

Hadoop offline image viewer for older versions of Hadoop.

snapshotDiff

Usage: hdfs snapshotDiff <path> <fromSnapshot> <toSnapshot>

Determine the difference between HDFS snapshots. See the HDFS Snapshot Documentation for more information.

version

Usage: hdfs version

Prints the version.

Administration Commands

Commands useful for administrators of a hadoop cluster.

balancer

Usage:

    hdfs balancer
          [-threshold <threshold>]
          [-policy <policy>]
          [-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
          [-include [-f <hosts-file> | <comma-separated list of hosts>]]
          [-idleiterations <idleiterations>]
COMMAND_OPTION Description
-policy <policy> datanode (default): Cluster is balanced if each datanode is balanced.
blockpool: Cluster is balanced if each block pool in each datanode is balanced.
-threshold <threshold> Percentage of disk capacity. This overwrites the default threshold.
-exclude -f <hosts-file> | <comma-separated list of hosts> Excludes the specified datanodes from being balanced by the balancer.
-include -f <hosts-file> | <comma-separated list of hosts> Includes only the specified datanodes to be balanced by the balancer.
-idleiterations <iterations> Maximum number of idle iterations before exit. This overwrites the default idleiterations(5).

Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process. See Balancer for more details.

Note that the blockpool policy is more strict than the datanode policy.

cacheadmin

Usage: hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]

See the HDFS Cache Administration Documentation for more information.

crypto

Usage:

  hdfs crypto -createZone -keyName <keyName> -path <path>
  hdfs crypto -help <command-name>
  hdfs crypto -listZones

See the HDFS Transparent Encryption Documentation for more information.

datanode

Usage: hdfs datanode [-regular | -rollback | -rollingupgrace rollback]

COMMAND_OPTION Description
-regular Normal datanode startup (default).
-rollback Rollback the datanode to the previous version. This should be used after stopping the datanode and distributing the old hadoop version.
-rollingupgrade rollback Rollback a rolling upgrade operation.

Runs a HDFS datanode.

dfsadmin

Usage:

    hdfs dfsadmin [GENERIC_OPTIONS]
          [-report [-live] [-dead] [-decommissioning]]
          [-safemode enter | leave | get | wait]
          [-saveNamespace]
          [-rollEdits]
          [-restoreFailedStorage true |false |check]
          [-refreshNodes]
          [-setQuota <quota> <dirname>...<dirname>]
          [-clrQuota <dirname>...<dirname>]
          [-setSpaceQuota <quota> <dirname>...<dirname>]
          [-clrSpaceQuota <dirname>...<dirname>]
          [-setStoragePolicy <path> <policyName>]
          [-getStoragePolicy <path>]
          [-finalizeUpgrade]
          [-rollingUpgrade [<query> |<prepare> |<finalize>]]
          [-metasave filename]
          [-refreshServiceAcl]
          [-refreshUserToGroupsMappings]
          [-refreshSuperUserGroupsConfiguration]
          [-refreshCallQueue]
          [-refresh <host:ipc_port> <key> [arg1..argn]]
          [-reconfig <datanode |...> <host:ipc_port> <start |status>]
          [-printTopology]
          [-refreshNamenodes datanodehost:port]
          [-deleteBlockPool datanode-host:port blockpoolId [force]]
          [-setBalancerBandwidth <bandwidth in bytes per second>]
          [-allowSnapshot <snapshotDir>]
          [-disallowSnapshot <snapshotDir>]
          [-fetchImage <local directory>]
          [-shutdownDatanode <datanode_host:ipc_port> [upgrade]]
          [-getDatanodeInfo <datanode_host:ipc_port>]
          [-triggerBlockReport [-incremental] <datanode_host:ipc_port>]
          [-help [cmd]]
COMMAND_OPTION Description
-report [-live] [-dead] [-decommissioning] Reports basic filesystem information and statistics. Optional flags may be used to filter the list of displayed DataNodes.
-safemode enter|leave|get|wait Safe mode maintenance command. Safe mode is a Namenode state in which it 
1. does not accept changes to the name space (read-only) 
2. does not replicate or delete blocks. 
Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well.
-saveNamespace Save current namespace into storage directories and reset edits log. Requires safe mode.
-rollEdits Rolls the edit log on the active NameNode.
-restoreFailedStorage true|false|check This option will turn on/off automatic attempt to restore failed storage replicas. If a failed storage becomes available again the system will attempt to restore edits and/or fsimage during checkpoint. ‘check’ option will return current setting.
-refreshNodes Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned.
-setQuota <quota> <dirname>…<dirname> See HDFS Quotas Guide for the detail.
-clrQuota <dirname>…<dirname> See HDFS Quotas Guide for the detail.
-setSpaceQuota <quota> <dirname>…<dirname> See HDFS Quotas Guide for the detail.
-clrSpaceQuota <dirname>…<dirname> See HDFS Quotas Guide for the detail.
-setStoragePolicy <path> <policyName> Set a storage policy to a file or a directory.
-getStoragePolicy <path> Get the storage policy of a file or a directory.
-finalizeUpgrade Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed by Namenode doing the same. This completes the upgrade process.
-rollingUpgrade [<query>|<prepare>|<finalize>] See Rolling Upgrade document for the detail.
-metasave filename Save Namenode’s primary data structures to filename in the directory specified by hadoop.log.dir property. filename is overwritten if it exists. filename will contain one line for each of the following
1. Datanodes heart beating with Namenode
2. Blocks waiting to be replicated
3. Blocks currently being replicated
4. Blocks waiting to be deleted
-refreshServiceAcl Reload the service-level authorization policy file.
-refreshUserToGroupsMappings Refresh user-to-groups mappings.
-refreshSuperUserGroupsConfiguration Refresh superuser proxy groups mappings
-refreshCallQueue Reload the call queue from config.
-refresh <host:ipc_port> <key> [arg1..argn] Triggers a runtime-refresh of the resource specified by <key> on <host:ipc_port>. All other args after are sent to the host.
-reconfig <datanode |…> <host:ipc_port> <start|status> Start reconfiguration or get the status of an ongoing reconfiguration. The second parameter specifies the node type. Currently, only reloading DataNode’s configuration is supported.
-printTopology Print a tree of the racks and their nodes as reported by the Namenode
-refreshNamenodes datanodehost:port For the given datanode, reloads the configuration files, stops serving the removed block-pools and starts serving new block-pools.
-deleteBlockPool datanode-host:port blockpoolId [force] If force is passed, block pool directory for the given blockpool id on the given datanode is deleted along with its contents, otherwise the directory is deleted only if it is empty. The command will fail if datanode is still serving the block pool. Refer to refreshNamenodes to shutdown a block pool service on a datanode.
-setBalancerBandwidth <bandwidth in bytes per second> Changes the network bandwidth used by each datanode during HDFS block balancing. <bandwidth> is the maximum number of bytes per second that will be used by each datanode. This value overrides the dfs.balance.bandwidthPerSec parameter. NOTE: The new value is not persistent on the DataNode.
-allowSnapshot <snapshotDir> Allowing snapshots of a directory to be created. If the operation completes successfully, the directory becomes snapshottable. See the HDFS Snapshot Documentation for more information.
-disallowSnapshot <snapshotDir> Disallowing snapshots of a directory to be created. All snapshots of the directory must be deleted before disallowing snapshots. See the HDFS Snapshot Documentation for more information.
-fetchImage <local directory> Downloads the most recent fsimage from the NameNode and saves it in the specified local directory.
-shutdownDatanode <datanode_host:ipc_port> [upgrade] Submit a shutdown request for the given datanode. See Rolling Upgrade document for the detail.
-getDatanodeInfo <datanode_host:ipc_port> Get the information about the given datanode. See Rolling Upgrade document for the detail.
-triggerBlockReport [-incremental] <datanode_host:ipc_port> Trigger a block report for the given datanode. If ‘incremental’ is specified, it will be otherwise, it will be a full block report.
-help [cmd] Displays help for the given command or all commands if none is specified.

Runs a HDFS dfsadmin client.

haadmin

Usage:

    hdfs haadmin -checkHealth <serviceId>
    hdfs haadmin -failover [--forcefence] [--forceactive] <serviceId> <serviceId>
    hdfs haadmin -getServiceState <serviceId>
    hdfs haadmin -help <command>
    hdfs haadmin -transitionToActive <serviceId> [--forceactive]
    hdfs haadmin -transitionToStandby <serviceId>
COMMAND_OPTION Description
-checkHealth check the health of the given NameNode
-failover initiate a failover between two NameNodes
-getServiceState determine whether the given NameNode is Active or Standby
-transitionToActive transition the state of the given NameNode to Active (Warning: No fencing is done)
-transitionToStandby transition the state of the given NameNode to Standby (Warning: No fencing is done)

See HDFS HA with NFS or HDFS HA with QJM for more information on this command.

journalnode

Usage: hdfs journalnode

This comamnd starts a journalnode for use with HDFS HA with QJM.

mover

Usage: hdfs mover [-p <files/dirs> | -f <local file name>]

COMMAND_OPTION Description
-f <local file> Specify a local file containing a list of HDFS files/dirs to migrate.
-p <files/dirs> Specify a space separated list of HDFS files/dirs to migrate.

Runs the data migration utility. See Mover for more details.

Note that, when both -p and -f options are omitted, the default path is the root directory.

namenode

Usage:

  hdfs namenode [-backup] |
          [-checkpoint] |
          [-format [-clusterid cid ] [-force] [-nonInteractive] ] |
          [-upgrade [-clusterid cid] [-renameReserved<k-v pairs>] ] |
          [-upgradeOnly [-clusterid cid] [-renameReserved<k-v pairs>] ] |
          [-rollback] |
          [-rollingUpgrade <downgrade |rollback> ] |
          [-finalize] |
          [-importCheckpoint] |
          [-initializeSharedEdits] |
          [-bootstrapStandby] |
          [-recover [-force] ] |
          [-metadataVersion ]
COMMAND_OPTION Description
-backup Start backup node.
-checkpoint Start checkpoint node.
-format [-clusterid cid] [-force] [-nonInteractive] Formats the specified NameNode. It starts the NameNode, formats it and then shut it down. -force option formats if the name directory exists. -nonInteractive option aborts if the name directory exists, unless -force option is specified.
-upgrade [-clusterid cid] [-renameReserved <k-v pairs>] Namenode should be started with upgrade option after the distribution of new Hadoop version.
-upgradeOnly [-clusterid cid] [-renameReserved <k-v pairs>] Upgrade the specified NameNode and then shutdown it.
-rollback Rollback the NameNode to the previous version. This should be used after stopping the cluster and distributing the old Hadoop version.
-rollingUpgrade <downgrade|rollback|started> See Rolling Upgrade document for the detail.
-finalize Finalize will remove the previous state of the files system. Recent upgrade will become permanent. Rollback option will not be available anymore. After finalization it shuts the NameNode down.
-importCheckpoint Loads image from a checkpoint directory and save it into the current one. Checkpoint dir is read from property fs.checkpoint.dir
-initializeSharedEdits Format a new shared edits dir and copy in enough edit log segments so that the standby NameNode can start up.
-bootstrapStandby Allows the standby NameNode’s storage directories to be bootstrapped by copying the latest namespace snapshot from the active NameNode. This is used when first configuring an HA cluster.
-recover [-force] Recover lost metadata on a corrupt filesystem. See HDFS User Guide for the detail.
-metadataVersion Verify that configured directories exist, then print the metadata versions of the software and the image.

Runs the namenode. More info about the upgrade, rollback and finalize is at Upgrade Rollback.

nfs3

Usage: hdfs nfs3

This comamnd starts the NFS3 gateway for use with the HDFS NFS3 Service.

portmap

Usage: hdfs portmap

This comamnd starts the RPC portmap for use with the HDFS NFS3 Service.

secondarynamenode

Usage: hdfs secondarynamenode [-checkpoint [force]] | [-format] | [-geteditsize]

COMMAND_OPTION Description
-checkpoint [force] Checkpoints the SecondaryNameNode if EditLog size >= fs.checkpoint.size. If force is used, checkpoint irrespective of EditLog size.
-format Format the local storage during startup.
-geteditsize Prints the number of uncheckpointed transactions on the NameNode.

Runs the HDFS secondary namenode. See Secondary Namenode for more info.

storagepolicies

Usage: hdfs storagepolicies

Lists out all storage policies. See the HDFS Storage Policy Documentation for more information.

zkfc

Usage: hdfs zkfc [-formatZK [-force] [-nonInteractive]]

COMMAND_OPTION Description
-formatZK Format the Zookeeper instance
-h Display help

This comamnd starts a Zookeeper Failover Controller process for use with HDFS HA with QJM.

Debug Commands

Useful commands to help administrators debug HDFS issues, like validating block files and calling recoverLease.

verify

Usage: hdfs debug verify [-meta <metadata-file>] [-block <block-file>]

COMMAND_OPTION Description
-block block-file Optional parameter to specify the absolute path for the block file on the local file system of the data node.
-meta metadata-file Absolute path for the metadata file on the local file system of the data node.

Verify HDFS metadata and block files. If a block file is specified, we will verify that the checksums in the metadata file match the block file.

recoverLease

Usage: hdfs debug recoverLease [-path <path>] [-retries <num-retries>]

COMMAND_OPTION Description
[-path path] HDFS path for which to recover the lease.
[-retries num-retries] Number of times the client will retry calling recoverLease. The default number of retries is 1.

Recover the lease on the specified path. The path must reside on an HDFS filesystem. The default number of retries is 1.

 


六、HDFS API詳解

  Hadoop中關於文件操做類基本上所有是在"org.apache.hadoop.fs"包中,這些API可以支持的操做包含:打開文件,讀寫文件,刪除文件等。

Hadoop類庫中最終面向用戶提供的接口類FileSystem,該類是個抽象類,只能經過來類的get方法獲得具體類。get方法存在幾個重載版本,經常使用的是這個:

 

static FileSystem get(Configuration conf);

 

  該類封裝了幾乎全部的文件操做,例如mkdir,delete等。綜上基本上能夠得出操做文件的程序庫框架:

 

operator()

{

    獲得Configuration對象

    獲得FileSystem對象

    進行文件操做

}

 

6.1 上傳本地文件

  經過"FileSystem.copyFromLocalFile(Path src,Patch dst)"可將本地文件上傳HDFS的制定位置上,其中src和dst均爲文件的完整路徑。具體事例以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class CopyFile {

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

       

        //本地文件

        Path src =new Path("D:\\HebutWinOS");

        //HDFS爲止

        Path dst =new Path("/");

       

        hdfs.copyFromLocalFile(src, dst);

        System.out.println("Upload to"+conf.get("fs.default.name"));

       

        FileStatus files[]=hdfs.listStatus(dst);

        for(FileStatus file:files){

            System.out.println(file.getPath());

        }

    }

}

 

  運行結果能夠經過控制檯、項目瀏覽器和SecureCRT查看,如圖6-1-一、圖6-1-二、圖6-1-3所示。

  1)控制檯結果

 

  

圖6-1-1 運行結果(1)

  2)項目瀏覽器

 

  

圖6-1-2 運行結果(2)

  3)SecureCRT結果

 

  

圖6-1-3 運行結果(3)

6.2 建立HDFS文件

  經過"FileSystem.create(Path f)"可在HDFS上建立文件,其中f爲文件的完整路徑。具體實現以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataOutputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class CreateFile {

 

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

       

        byte[] buff="hello hadoop world!\n".getBytes();

       

        Path dfs=new Path("/test");

       

        FSDataOutputStream outputStream=hdfs.create(dfs);

        outputStream.write(buff,0,buff.length);

       

    }

}

 

  運行結果如圖6-2-1和圖6-2-2所示。

  1)項目瀏覽器

 

  

圖6-2-1 運行結果(1)

  2)SecureCRT結果

 

  

圖6-2-2 運行結果(2)

6.3 建立HDFS目錄

  經過"FileSystem.mkdirs(Path f)"可在HDFS上建立文件夾,其中f爲文件夾的完整路徑。具體實現以下:

 

package com.hebut.dir;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class CreateDir {

 

    public static void main(String[] args) throws Exception{

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

       

        Path dfs=new Path("/TestDir");

       

        hdfs.mkdirs(dfs);

 

    }

}

 

  運行結果如圖6-3-1和圖6-3-2所示。

  1)項目瀏覽器

 

  

圖6-3-1 運行結果(1)

  2)SecureCRT結果

 

  

圖6-3-2 運行結果(2)

6.4 重命名HDFS文件

  經過"FileSystem.rename(Path src,Path dst)"可爲指定的HDFS文件重命名,其中src和dst均爲文件的完整路徑。具體實現以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class Rename{

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

     

        Path frpaht=new Path("/test");    //舊的文件名

        Path topath=new Path("/test1");    //新的文件名

       

        boolean isRename=hdfs.rename(frpaht, topath);

       

        String result=isRename?"成功":"失敗";

        System.out.println("文件重命名結果爲:"+result);

       

    }

}

 

  運行結果如圖6-4-1和圖6-4-2所示。

  1)項目瀏覽器

 

  

圖6-4-1 運行結果(1)

    2)SecureCRT結果

 

  

圖6-4-2 運行結果(2)

6.5 刪除HDFS上的文件

  經過"FileSystem.delete(Path f,Boolean recursive)"可刪除指定的HDFS文件,其中f爲須要刪除文件的完整路徑,recuresive用來肯定是否進行遞歸刪除。具體實現以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class DeleteFile {

 

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

       

        Path delef=new Path("/test1");

       

        boolean isDeleted=hdfs.delete(delef,false);

        //遞歸刪除

        //boolean isDeleted=hdfs.delete(delef,true);

        System.out.println("Delete?"+isDeleted);

    }

}

 

  運行結果如圖6-5-1和圖6-5-2所示。

  1)控制檯結果

 

  

圖6-5-1 運行結果(1)

    2)項目瀏覽器

  

圖6-5-2 運行結果(2)

6.6 刪除HDFS上的目錄

  同刪除文件代碼同樣,只是換成刪除目錄路徑便可,若是目錄下有文件,要進行遞歸刪除。

6.7 查看某個HDFS文件是否存在

  經過"FileSystem.exists(Path f)"可查看指定HDFS文件是否存在,其中f爲文件的完整路徑。具體實現以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class CheckFile {

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

        Path findf=new Path("/test1");

        boolean isExists=hdfs.exists(findf);

        System.out.println("Exist?"+isExists);

    }

}

 

  運行結果如圖6-7-1和圖6-7-2所示。

  1)控制檯結果

 

  

圖6-7-1 運行結果(1)

  2)項目瀏覽器

 

  

圖6-7-2 運行結果(2)

6.8 查看HDFS文件的最後修改時間

  經過"FileSystem.getModificationTime()"可查看指定HDFS文件的修改時間。具體實現以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class GetLTime {

 

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

       

        Path fpath =new Path("/user/hadoop/test/file1.txt");

       

        FileStatus fileStatus=hdfs.getFileStatus(fpath);

        long modiTime=fileStatus.getModificationTime();

       

        System.out.println("file1.txt的修改時間是"+modiTime);

    }

}

 

  運行結果如圖6-8-1所示。

 

  

圖6-8-1 控制檯結果

6.9 讀取HDFS某個目錄下的全部文件

  經過"FileStatus.getPath()"可查看指定HDFS中某個目錄下全部文件。具體實現以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class ListAllFile {

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

       

        Path listf =new Path("/user/hadoop/test");

       

        FileStatus stats[]=hdfs.listStatus(listf);

        for(int i = 0; i < stats.length; ++i)

     {

       System.out.println(stats[i].getPath().toString());

     }

        hdfs.close();

    }

}

 

  運行結果如圖6-9-1和圖6-9-2所示。

  1)控制檯結果

 

  

圖6-9-1 運行結果(1)

 

  2)項目瀏覽器

 

  

圖6-9-2 運行結果(2)

6.10 查找某個文件在HDFS集羣的位置

  經過"FileSystem.getFileBlockLocation(FileStatus file,long start,long len)"可查找指定文件在HDFS集羣上的位置,其中file爲文件的完整路徑,start和len來標識查找文件的路徑。具體實現以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.BlockLocation;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

 

public class FileLoc {

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem hdfs=FileSystem.get(conf);

        Path fpath=new Path("/user/hadoop/cygwin");

       

        FileStatus filestatus = hdfs.getFileStatus(fpath);

        BlockLocation[] blkLocations = hdfs.getFileBlockLocations(filestatus, 0, filestatus.getLen());

 

        int blockLen = blkLocations.length;

        for(int i=0;i<blockLen;i++){

            String[] hosts = blkLocations[i].getHosts();

            System.out.println("block_"+i+"_location:"+hosts[0]);

        }

    }

}

 

  運行結果如圖6-10-1和6.10.2所示。

  1)控制檯結果

 

  

圖6-10-1 運行結果(1)

  2)項目瀏覽器

 

  

圖6-10-2 運行結果(2)

6.11 獲取HDFS集羣上全部節點名稱信息

  經過"DatanodeInfo.getHostName()"可獲取HDFS集羣上的全部節點名稱。具體實現以下:

 

package com.hebut.file;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.hdfs.DistributedFileSystem;

import org.apache.hadoop.hdfs.protocol.DatanodeInfo;

 

public class GetList {

 

    public static void main(String[] args) throws Exception {

        Configuration conf=new Configuration();

        FileSystem fs=FileSystem.get(conf);

       

        DistributedFileSystem hdfs = (DistributedFileSystem)fs;

        DatanodeInfo[] dataNodeStats = hdfs.getDataNodeStats();

       

        for(int i=0;i<dataNodeStats.length;i++){

            System.out.println("DataNode_"+i+"_Name:"+dataNodeStats[i].getHostName());

        }

    }

}

 

  運行結果如圖6-11-1所示。

 

  

圖6-11-1 控制檯結果

七、HDFS的讀寫數據流

7.1 文件的讀取剖析

  

 

  文件讀取的過程以下:

  1)解釋一

  • 客戶端(client)用FileSystem的open()函數打開文件。
  • DistributedFileSystem用RPC調用元數據節點,獲得文件的數據塊信息。
  • 對於每個數據塊,元數據節點返回保存數據塊的數據節點的地址。
  • DistributedFileSystem返回FSDataInputStream給客戶端,用來讀取數據。
  • 客戶端調用stream的read()函數開始讀取數據。
  • DFSInputStream鏈接保存此文件第一個數據塊的最近的數據節點。
  • Data從數據節點讀到客戶端(client)。
  • 當此數據塊讀取完畢時,DFSInputStream關閉和此數據節點的鏈接,而後鏈接此文件下一個數據塊的最近的數據節點。
  • 當客戶端讀取完畢數據的時候,調用FSDataInputStream的close函數。
  • 在讀取數據的過程當中,若是客戶端在與數據節點通訊出現錯誤,則嘗試鏈接包含此數據塊的下一個數據節點。
  • 失敗的數據節點將被記錄,之後再也不鏈接。

  2)解釋二

  • 使用HDFS提供的客戶端開發庫,向遠程的Namenode發起RPC請求;
  • Namenode會視狀況返回文件的部分或者所有block列表,對於每一個block,Namenode都會返回有該block拷貝的datanode地址;
  • 客戶端開發庫會選取離客戶端最接近的datanode來讀取block;
  • 讀取完當前block的數據後,關閉與當前的datanode鏈接,併爲讀取下一個block尋找最佳的datanode;
  • 當讀完列表的block後,且文件讀取尚未結束,客戶端開發庫會繼續向Namenode獲取下一批的block列表。
  • 讀取完一個block都會進行checksum驗證,若是讀取datanode時出現錯誤,客戶端會通知Namenode,而後再從下一個擁有該block拷貝的datanode繼續讀。

7.2 文件的寫入剖析

  

 

  寫入文件的過程比讀取較爲複雜:

  1)解釋一

  • 客戶端調用create()來建立文件
  • DistributedFileSystem用RPC調用元數據節點,在文件系統的命名空間中建立一個新的文件。
  • 元數據節點首先肯定文件原來不存在,而且客戶端有建立文件的權限,而後建立新文件。
  • DistributedFileSystem返回DFSOutputStream,客戶端用於寫數據。
  • 客戶端開始寫入數據,DFSOutputStream將數據分紅塊,寫入data queue。
  • Data queue由Data Streamer讀取,並通知元數據節點分配數據節點,用來存儲數據塊(每塊默認複製3塊)。分配的數據節點放在一個pipeline裏。
  • Data Streamer將數據塊寫入pipeline中的第一個數據節點。第一個數據節點將數據塊發送給第二個數據節點。第二個數據節點將數據發送給第三個數據節點。
  • DFSOutputStream爲發出去的數據塊保存了ack queue,等待pipeline中的數據節點告知數據已經寫入成功。
  • 若是數據節點在寫入的過程當中失敗:
    • 關閉pipeline,將ack queue中的數據塊放入data queue的開始。
    • 當前的數據塊在已經寫入的數據節點中被元數據節點賦予新的標示,則錯誤節點重啓後可以察覺其數據塊是過期的,會被刪除。
    • 失敗的數據節點從pipeline中移除,另外的數據塊則寫入pipeline中的另外兩個數據節點。
    • 元數據節點則被通知此數據塊是複製塊數不足,未來會再建立第三份備份。
  • 當客戶端結束寫入數據,則調用stream的close函數。此操做將全部的數據塊寫入pipeline中的數據節點,並等待ack queue返回成功。最後通知元數據節點寫入完畢。

  2)解釋二

  • 使用HDFS提供的客戶端開發庫,向遠程的Namenode發起RPC請求;
  • Namenode會檢查要建立的文件是否已經存在,建立者是否有權限進行操做,成功則會爲文件建立一個記錄,不然會讓客戶端拋出異常;
  • 當客戶端開始寫入文件的時候,開發庫會將文件切分紅多個packets,並在內部以"data queue"的形式管理這些packets,並向Namenode申請新的blocks,獲取用來存儲replicas的合適的datanodes列表,列表的大小根據在Namenode中對replication的設置而定。
  • 開始以pipeline(管道)的形式將packet寫入全部的replicas中。開發庫把packet以流的方式寫入第一個datanode,該datanode把該packet存儲以後,再將其傳遞給在此pipeline中的下一個datanode,直到最後一個datanode,這種寫數據的方式呈流水線的形式。
  • 最後一個datanode成功存儲以後會返回一個ack packet,在pipeline裏傳遞至客戶端,在客戶端的開發庫內部維護着"ack queue",成功收到datanode返回的ack packet後會從"ack queue"移除相應的packet。
  • 若是傳輸過程當中,有某個datanode出現了故障,那麼當前的pipeline會被關閉,出現故障的datanode會從當前的pipeline中移除,剩餘的block會繼續剩下的datanode中繼續以pipeline的形式傳輸,同時Namenode會分配一個新的datanode,保持replicas設定的數量。
相關文章
相關標籤/搜索