簡介html
HDFS(Hadoop Distributed File System )Hadoop分佈式文件系統。是根據google發表的論文翻版的。論文爲GFS(Google File System)Google 文件系統(中文,英文)。java
HDFS有不少特色:node
① 保存多個副本,且提供容錯機制,副本丟失或宕機自動恢復。默認存3份。web
② 運行在廉價的機器上。shell
③ 適合大數據的處理。多大?多小?HDFS默認會將文件分割成block,64M爲1個block。而後將block按鍵值對存儲在HDFS上,並將鍵值對的映射存到內存中。若是小文件太多,那內存的負擔會很重。express
如上圖所示,HDFS也是按照Master和Slave的結構。分NameNode、SecondaryNameNode、DataNode這幾個角色。apache
NameNode:是Master節點,是大領導。管理數據塊映射;處理客戶端的讀寫請求;配置副本策略;管理HDFS的名稱空間;bootstrap
SecondaryNameNode:是一個小弟,分擔大哥namenode的工做量;是NameNode的冷備份;合併fsimage和fsedits而後再發給namenode。api
DataNode:Slave節點,奴隸,幹活的。負責存儲client發來的數據塊block;執行數據塊的讀寫操做。瀏覽器
熱備份:b是a的熱備份,若是a壞掉。那麼b立刻運行代替a的工做。
冷備份:b是a的冷備份,若是a壞掉。那麼b不能立刻代替a工做。可是b上存儲a的一些信息,減小a壞掉以後的損失。
fsimage:元數據鏡像文件(文件系統的目錄樹。)
edits:元數據的操做日誌(針對文件系統作的修改操做記錄)
namenode內存中存儲的是=fsimage+edits。
SecondaryNameNode負責定時默認1小時,從namenode上,獲取fsimage和edits來進行合併,而後再發送給namenode。減小namenode的工做量。
工做原理
寫操做:
有一個文件FileA,100M大小。Client將FileA寫入到HDFS上。
HDFS按默認配置。
HDFS分佈在三個機架上Rack1,Rack2,Rack3。
a. Client將FileA按64M分塊。分紅兩塊,block1和Block2;
b. Client向nameNode發送寫數據請求,如圖藍色虛線①------>。
c. NameNode節點,記錄block信息。並返回可用的DataNode,如粉色虛線②--------->。
Block1: host2,host1,host3
Block2: host7,host8,host4
原理:
NameNode具備RackAware機架感知功能,這個能夠配置。
若client爲DataNode節點,那存儲block時,規則爲:副本1,同client的節點上;副本2,不一樣機架節點上;副本3,同第二個副本機架的另外一個節點上;其餘副本隨機挑選。
若client不爲DataNode節點,那存儲block時,規則爲:副本1,隨機選擇一個節點上;副本2,不一樣副本1,機架上;副本3,同副本2相同的另外一個節點上;其餘副本隨機挑選。
d. client向DataNode發送block1;發送過程是以流式寫入。
流式寫入過程,
1>將64M的block1按64k的package劃分;
2>而後將第一個package發送給host2;
3>host2接收完後,將第一個package發送給host1,同時client想host2發送第二個package;
4>host1接收完第一個package後,發送給host3,同時接收host2發來的第二個package。
5>以此類推,如圖紅線實線所示,直到將block1發送完畢。
6>host2,host1,host3向NameNode,host2向Client發送通知,說「消息發送完了」。如圖粉紅顏色實線所示。
7>client收到host2發來的消息後,向namenode發送消息,說我寫完了。這樣就真完成了。如圖黃色粗實線
8>發送完block1後,再向host7,host8,host4發送block2,如圖藍色實線所示。
9>發送完block2後,host7,host8,host4向NameNode,host7向Client發送通知,如圖淺綠色實線所示。
10>client向NameNode發送消息,說我寫完了,如圖黃色粗實線。。。這樣就完畢了。
分析,經過寫過程,咱們能夠了解到:
①寫1T文件,咱們須要3T的存儲,3T的網絡流量貸款。
②在執行讀或寫的過程當中,NameNode和DataNode經過HeartBeat進行保存通訊,肯定DataNode活着。若是發現DataNode死掉了,就將死掉的DataNode上的數據,放到其餘節點去。讀取時,要讀其餘節點去。
③掛掉一個節點,不要緊,還有其餘節點能夠備份;甚至,掛掉某一個機架,也不要緊;其餘機架上,也有備份。
讀操做:
讀操做就簡單一些了,如圖所示,client要從datanode上,讀取FileA。而FileA由block1和block2組成。
那麼,讀操做流程爲:
a. client向namenode發送讀請求。
b. namenode查看Metadata信息,返回fileA的block的位置。
block1:host2,host1,host3
block2:host7,host8,host4
c. block的位置是有前後順序的,先讀block1,再讀block2。並且block1去host2上讀取;而後block2,去host7上讀取;
上面例子中,client位於機架外,那麼若是client位於機架內某個DataNode上,例如,client是host6。那麼讀取的時候,遵循的規律是:
優選讀取本機架上的數據。
HDFS中經常使用到的命令
一、hadoop fs
1
2
3
4
5
6
7
8
9
10
11
12
13
|
hadoop fs -ls /
hadoop fs -lsr
hadoop fs -mkdir /user/hadoop
hadoop fs -put a.txt /user/hadoop/
hadoop fs -get /user/hadoop/a.txt /
hadoop fs -cp src dst
hadoop fs -mv src dst
hadoop fs -cat /user/hadoop/a.txt
hadoop fs -rm /user/hadoop/a.txt
hadoop fs -rmr /user/hadoop/a.txt
hadoop fs -text /user/hadoop/a.txt
hadoop fs -copyFromLocal localsrc dst 與hadoop fs -put功能相似。
hadoop fs -moveFromLocal localsrc dst 將本地文件上傳到hdfs,同時刪除本地文件。
|
二、hadoop fsadmin
1
2
3
|
hadoop dfsadmin -report
hadoop dfsadmin -safemode enter | leave | get | wait
hadoop dfsadmin -setBalancerBandwidth
1000
|
三、hadoop fsck
四、start-balancer.sh
========================================================================
再來一篇更加詳細的!
HDFS(Hadoop Distributed File System)是Hadoop項目的核心子項目,是分佈式計算中數據存儲管理的基礎,是基於流數據模式訪問和處理超大文件的需求而開發的,能夠運行於廉價的商用服務器上。它所具備的高容錯、高可靠性、高可擴展性、高得到性、高吞吐率等特徵爲海量數據提供了不怕故障的存儲,爲超大數據集(Large Data Set)的應用處理帶來了不少便利。
Hadoop整合了衆多文件系統,在其中有一個綜合性的文件系統抽象,它提供了文件系統實現的各種接口,HDFS只是這個抽象文件系統的一個實例。提供了一個高層的文件系統抽象類org.apache.hadoop.fs.FileSystem,這個抽象類展現了一個分佈式文件系統,並有幾個具體實現,以下表1-1所示。
表1-1 Hadoop的文件系統
文件系統 |
URI方案 |
Java實現 (org.apache.hadoop) |
定義 |
Local |
file |
fs.LocalFileSystem |
支持有客戶端校驗和本地文件系統。帶有校驗和的本地系統文件在fs.RawLocalFileSystem中實現。 |
HDFS |
hdfs |
hdfs.DistributionFileSystem |
Hadoop的分佈式文件系統。 |
HFTP |
hftp |
hdfs.HftpFileSystem |
支持經過HTTP方式以只讀的方式訪問HDFS,distcp常常用在不一樣的HDFS集羣間複製數據。 |
HSFTP |
hsftp |
hdfs.HsftpFileSystem |
支持經過HTTPS方式以只讀的方式訪問HDFS。 |
HAR |
har |
fs.HarFileSystem |
構建在Hadoop文件系統之上,對文件進行歸檔。Hadoop歸檔文件主要用來減小NameNode的內存使用。 |
KFS |
kfs |
fs.kfs.KosmosFileSystem |
Cloudstore(其前身是Kosmos文件系統)文件系統是相似於HDFS和Google的GFS文件系統,使用C++編寫。 |
FTP |
ftp |
fs.ftp.FtpFileSystem |
由FTP服務器支持的文件系統。 |
S3(本地) |
s3n |
fs.s3native.NativeS3FileSystem |
基於Amazon S3的文件系統。 |
S3(基於塊) |
s3 |
fs.s3.NativeS3FileSystem |
基於Amazon S3的文件系統,以塊格式存儲解決了S3的5GB文件大小的限制。 |
Hadoop提供了許多文件系統的接口,用戶可使用URI方案選取合適的文件系統來實現交互。
HDFS體系結構中有兩類節點,一類是NameNode,又叫"元數據節點";另外一類是DataNode,又叫"數據節點"。這兩類節點分別承擔Master和Worker具體任務的執行節點。
1)元數據節點用來管理文件系統的命名空間
2)數據節點是文件系統中真正存儲數據的地方。
3)從元數據節點(secondary namenode)
VERSION文件是java properties文件,保存了HDFS的版本號。
namespaceID=1232737062
cTime=0
storageType=NAME_NODE
layoutVersion=-18
namespaceID=1232737062
storageID=DS-1640411682-127.0.1.1-50010-1254997319480
cTime=0
storageType=DATA_NODE
layoutVersion=-18
HDFS是一個主/從(Mater/Slave)體系結構,從最終用戶的角度來看,它就像傳統的文件系統同樣,能夠經過目錄路徑對文件執行CRUD(Create、Read、Update和Delete)操做。但因爲分佈式存儲的性質,HDFS集羣擁有一個NameNode和一些DataNode。NameNode管理文件系統的元數據,DataNode存儲實際的數據。客戶端經過同NameNode和DataNodes的交互訪問文件系統。客戶端聯繫NameNode以獲取文件的元數據,而真正的文件I/O操做是直接和DataNode進行交互的。
圖3.1 HDFS整體結構示意圖
1)NameNode、DataNode和Client
2)文件寫入
3)文件讀取
HDFS典型的部署是在一個專門的機器上運行NameNode,集羣中的其餘機器各運行一個DataNode;也能夠在運行NameNode的機器上同時運行DataNode,或者一臺機器上運行多個DataNode。一個集羣只有一個NameNode的設計大大簡化了系統架構。
1)處理超大文件
這裏的超大文件一般是指百MB、設置數百TB大小的文件。目前在實際應用中,HDFS已經能用來存儲管理PB級的數據了。
2)流式的訪問數據
HDFS的設計創建在更多地響應"一次寫入、屢次讀寫"任務的基礎上。這意味着一個數據集一旦由數據源生成,就會被複製分發到不一樣的存儲節點中,而後響應各類各樣的數據分析任務請求。在多數狀況下,分析任務都會涉及數據集中的大部分數據,也就是說,對HDFS來講,請求讀取整個數據集要比讀取一條記錄更加高效。
3)運行於廉價的商用機器集羣上
Hadoop設計對硬件需求比較低,只須運行在低廉的商用硬件集羣上,而無需昂貴的高可用性機器上。廉價的商用機也就意味着大型集羣中出現節點故障狀況的機率很是高。這就要求設計HDFS時要充分考慮數據的可靠性,安全性及高可用性。
1)不適合低延遲數據訪問
若是要處理一些用戶要求時間比較短的低延遲應用請求,則HDFS不適合。HDFS是爲了處理大型數據集分析任務的,主要是爲達到高的數據吞吐量而設計的,這就可能要求以高延遲做爲代價。
改進策略:對於那些有低延時要求的應用程序,HBase是一個更好的選擇。經過上層數據管理項目來儘量地彌補這個不足。在性能上有了很大的提高,它的口號就是goes real time。使用緩存或多master設計能夠下降client的數據請求壓力,以減小延時。還有就是對HDFS系統內部的修改,這就得權衡大吞吐量與低延時了,HDFS不是萬能的銀彈。
2)沒法高效存儲大量小文件
由於Namenode把文件系統的元數據放置在內存中,因此文件系統所能容納的文件數目是由Namenode的內存大小來決定。通常來講,每個文件、文件夾和Block須要佔據150字節左右的空間,因此,若是你有100萬個文件,每個佔據一個Block,你就至少須要300MB內存。當前來講,數百萬的文件仍是可行的,當擴展到數十億時,對於當前的硬件水平來講就無法實現了。還有一個問題就是,由於Map task的數量是由splits來決定的,因此用MR處理大量的小文件時,就會產生過多的Maptask,線程管理開銷將會增長做業時間。舉個例子,處理10000M的文件,若每一個split爲1M,那就會有10000個Maptasks,會有很大的線程開銷;若每一個split爲100M,則只有100個Maptasks,每一個Maptask將會有更多的事情作,而線程的管理開銷也將減少不少。
改進策略:要想讓HDFS能處理好小文件,有很多方法。
3)不支持多用戶寫入及任意修改文件
在HDFS的一個文件中只有一個寫入者,並且寫操做只能在文件末尾完成,即只能執行追加操做。目前HDFS還不支持多個用戶對同一文件的寫操做,以及在文件任意位置進行修改。
先說一下"hadoop fs 和hadoop dfs的區別",看兩本Hadoop書上各有用到,但效果同樣,求證與網絡發現下面一解釋比較中肯。
粗略的講,fs是個比較抽象的層面,在分佈式環境中,fs就是dfs,但在本地環境中,fs是local file system,這個時候dfs就不能用。
1)列出HDFS文件
此處爲你展現如何經過"-ls"命令列出HDFS下的文件:
hadoop fs -ls
執行結果如圖5-1-1所示。在這裏須要注意:在HDFS中未帶參數的"-ls"命名沒有返回任何值,它默認返回HDFS的"home"目錄下的內容。在HDFS中,沒有當前目錄這樣一個概念,也沒有cd這個命令。
圖5-1-1 列出HDFS文件
2)列出HDFS目錄下某個文檔中的文件
此處爲你展現如何經過"-ls 文件名"命令瀏覽HDFS下名爲"input"的文檔中文件:
hadoop fs –ls input
執行結果如圖5-1-2所示。
圖5-1-2 列出HDFS下名爲input的文檔下的文件
3)上傳文件到HDFS
此處爲你展現如何經過"-put 文件1 文件2"命令將"Master.Hadoop"機器下的"/home/hadoop"目錄下的file文件上傳到HDFS上並重命名爲test:
hadoop fs –put ~/file test
執行結果如圖5-1-3所示。在執行"-put"時只有兩種可能,便是執行成功和執行失敗。在上傳文件時,文件首先複製到DataNode上。只有全部的DataNode都成功接收完數據,文件上傳纔是成功的。其餘狀況(如文件上傳終端等)對HDFS來講都是作了無用功。
圖5-1-3 成功上傳file到HDFS
4)將HDFS中文件複製到本地系統中
此處爲你展現如何經過"-get 文件1 文件2"命令將HDFS中的"output"文件複製到本地系統並命名爲"getout"。
hadoop fs –get output getout
執行結果如圖5-1-4所示。
圖5-1-4 成功將HDFS中output文件複製到本地系統
備註:與"-put"命令同樣,"-get"操做既能夠操做文件,也能夠操做目錄。
5)刪除HDFS下的文檔
此處爲你展現如何經過"-rmr 文件"命令刪除HDFS下名爲"newoutput"的文檔:
hadoop fs –rmr newoutput
執行結果如圖5-1-5所示。
圖5-1-5 成功刪除HDFS下的newoutput文檔
6)查看HDFS下某個文件
此處爲你展現如何經過"-cat 文件"命令查看HDFS下input文件中內容:
hadoop fs -cat input/*
執行結果如圖5-1-6所示。
圖5-1-6 HDFS下input文件的內容
"hadoop fs"的命令遠不止這些,本小節介紹的命令已能夠在HDFS上完成大多數常規操做。對於其餘操做,能夠經過"-help commandName"命令所列出的清單來進一步學習與探索。
1)報告HDFS的基本統計狀況
此處爲你展現經過"-report"命令如何查看HDFS的基本統計信息:
hadoop dfsadmin -report
執行結果如圖5-2-1所示。
圖5-2-1 HDFS基本統計信息
2)退出安全模式
NameNode在啓動時會自動進入安全模式。安全模式是NameNode的一種狀態,在這個階段,文件系統不容許有任何修改。安全模式的目的是在系統啓動時檢查各個DataNode上數據塊的有效性,同時根據策略對數據塊進行必要的複製或刪除,當數據塊最小百分比數知足的最小副本數條件時,會自動退出安全模式。
系統顯示"Name node is in safe mode",說明系統正處於安全模式,這時只須要等待17秒便可,也能夠經過下面的命令退出安全模式:
hadoop dfsadmin –safemode enter
成功退出安全模式結果如圖5-2-2所示。
圖5-2-2 成功退出安全模式
3)進入安全模式
在必要狀況下,能夠經過如下命令把HDFS置於安全模式:
hadoop dfsadmin –safemode enter
執行結果如圖5-2-3所示。
圖5-2-3 進入HDFS安全模式
4)添加節點
可擴展性是HDFS的一個重要特性,向HDFS集羣中添加節點是很容易實現的。添加一個新的DataNode節點,首先在新加節點上安裝好Hadoop,要和NameNode使用相同的配置(能夠直接從NameNode複製),修改"/usr/hadoop/conf/master"文件,加入NameNode主機名。而後在NameNode節點上修改"/usr/hadoop/conf/slaves"文件,加入新節點主機名,再創建到新加點無密碼的SSH鏈接,運行啓動命令:
start-all.sh
5)負載均衡
HDFS的數據在各個DataNode中的分佈肯能很不均勻,尤爲是在DataNode節點出現故障或新增DataNode節點時。新增數據塊時NameNode對DataNode節點的選擇策略也有可能致使數據塊分佈的不均勻。用戶可使用命令從新平衡DataNode上的數據塊的分佈:
start-balancer.sh
執行命令前,DataNode節點上數據分佈狀況如圖5-2-4所示。
負載均衡完畢後,DataNode節點上數據的分佈狀況如圖5-2-5所示。
執行負載均衡命令如圖5-2-6所示。
====================================================================
=============================================================
http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html
The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others. The FS shell is invoked by:
bin/hadoop fs <args>
All FS shell commands take path URIs as arguments. The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost).
Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of the commands. Error information is sent to stderr and the output is sent to stdout.
If HDFS is being used, hdfs dfs is a synonym.
See the Commands Manual for generic shell options.
Usage: hadoop fs -appendToFile <localsrc> ... <dst>
Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.
Exit Code:
Returns 0 on success and 1 on error.
Usage: hadoop fs -cat URI [URI ...]
Copies source paths to stdout.
Example:
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
Usage: hadoop fs -chgrp [-R] GROUP URI [URI ...]
Change group association of files. The user must be the owner of files, or else a super-user. Additional information is in the Permissions Guide.
Options
Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]
Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. Additional information is in the Permissions Guide.
Options
Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Change the owner of files. The user must be a super-user. Additional information is in the Permissions Guide.
Options
Usage: hadoop fs -copyFromLocal <localsrc> URI
Similar to put command, except that the source is restricted to a local file reference.
Options:
Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file reference.
Usage: hadoop fs -count [-q] [-h] [-v] <paths>
Count the number of directories, files and bytes under the paths that match the specified file pattern. The output columns with -count are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME
The output columns with -count -q are: QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME
The -h option shows sizes in human readable format.
The -v option displays a header line.
Example:
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>
Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
‘raw.*’ namespace extended attributes are preserved if (1) the source and destination filesystems support them (HDFS only), and (2) all source and destination pathnames are in the /.reserved/raw hierarchy. Determination of whether raw.* namespace xattrs are preserved is independent of the -p (preserve) flag.
Options:
Example:
Exit Code:
Returns 0 on success and -1 on error.
See HDFS Snapshots Guide.
See HDFS Snapshots Guide.
Usage: hadoop fs -df [-h] URI [URI ...]
Displays free space.
Options:
Example:
Usage: hadoop fs -du [-s] [-h] URI [URI ...]
Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
Options:
Example:
Exit Code: Returns 0 on success and -1 on error.
Usage: hadoop fs -dus <args>
Displays a summary of file lengths.
Note: This command is deprecated. Instead use hadoop fs -du -s.
Usage: hadoop fs -expunge
Permanently delete files in checkpoints older than the retention threshold from trash directory, and create new checkpoint.
When checkpoint is created, recently deleted files in trash are moved under the checkpoint. Files in checkpoints older than fs.trash.checkpoint.interval will be permanently deleted on the next invocation of -expunge command.
If the file system supports the feature, users can configure to create and delete checkpoints periodically by the parameter stored as fs.trash.checkpoint.interval (in core-site.xml). This value should be smaller or equal to fs.trash.interval.
Refer to the HDFS Architecture guide for more information about trash feature of HDFS.
Usage: hadoop fs -find <path> ... <expression> ...
Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.
The following primary expressions are recognised:
-name pattern
-iname pattern
Evaluates as true if the basename of the file matches the pattern using standard file system globbing. If -iname is used then the match is case insensitive.
-print
-print0Always
evaluates to true. Causes the current pathname to be written to standard output. If the -print0 expression is used then an ASCII NULL character is appended.
The following operators are recognised:
expression -a expression
expression -and expression
expression expression
Logical AND operator for joining two expressions. Returns true if both child expressions return true. Implied by the juxtaposition of two expressions and so does not need to be explicitly specified. The second expression will not be applied if the first fails.
Example:
hadoop fs -find / -name test -print
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
Example:
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -getfacl [-R] <path>
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.
Options:
Examples:
Exit Code:
Returns 0 on success and non-zero on error.
Usage: hadoop fs -getfattr [-R] -n name | -d [-e en] <path>
Displays the extended attribute names and values (if any) for a file or directory.
Options:
Examples:
Exit Code:
Returns 0 on success and non-zero on error.
Usage: hadoop fs -getmerge [-nl] <src> <localdst>
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.
Examples:
Exit Code:
Returns 0 on success and non-zero on error.
Usage: hadoop fs -ls [-d] [-h] [-R] <args>
Options:
For a file ls returns stat on the file with the following format:
permissions number_of_replicas userid groupid filesize modification_date modification_time filename
For a directory it returns list of its direct children as in Unix. A directory is listed as:
permissions userid groupid modification_date modification_time dirname
Files within a directory are order by filename by default.
Example:
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -lsr <args>
Recursive version of ls.
Note: This command is deprecated. Instead use hadoop fs -ls -R
Usage: hadoop fs -mkdir [-p] <paths>
Takes path uri’s as argument and creates directories.
Options:
Example:
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -moveFromLocal <localsrc> <dst>
Similar to put command, except that the source localsrc is deleted after it’s copied.
Usage: hadoop fs -moveToLocal [-crc] <src> <dst>
Displays a 「Not implemented yet」 message.
Usage: hadoop fs -mv URI [URI ...] <dest>
Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.
Example:
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.
Exit Code:
Returns 0 on success and -1 on error.
See HDFS Snapshots Guide.
Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI ...]
Delete files specified as args.
If trash is enabled, file system instead moves the deleted file to a trash directory (given by FileSystem#getTrashRoot).
Currently, the trash feature is disabled by default. User can enable trash by setting a value greater than zero for parameter fs.trash.interval (in core-site.xml).
See expunge about deletion of files in trash.
Options:
Example:
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]
Delete a directory.
Options:
Example:
Usage: hadoop fs -rmr [-skipTrash] URI [URI ...]
Recursive version of delete.
Note: This command is deprecated. Instead use hadoop fs -rm -r
Usage: hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[--set <acl_spec> <path>]
Sets Access Control Lists (ACLs) of files and directories.
Options:
Examples:
Exit Code:
Returns 0 on success and non-zero on error.
Usage: hadoop fs -setfattr -n name [-v value] | -x name <path>
Sets an extended attribute name and value for a file or directory.
Options:
Examples:
Exit Code:
Returns 0 on success and non-zero on error.
Usage: hadoop fs -setrep [-R] [-w] <numReplicas> <path>
Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path.
Options:
Example:
Exit Code:
Returns 0 on success and -1 on error.
Usage: hadoop fs -stat [format] <path> ...
Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as 「yyyy-MM-dd HH:mm:ss」 and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.
Example:
Exit Code: Returns 0 on success and -1 on error.
Usage: hadoop fs -tail [-f] URI
Displays last kilobyte of the file to stdout.
Options:
Example:
Exit Code: Returns 0 on success and -1 on error.
Usage: hadoop fs -test -[defsz] URI
Options:
Example:
Usage: hadoop fs -text <src>
Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.
Usage: hadoop fs -touchz URI [URI ...]
Create a file of zero length.
Example:
Exit Code: Returns 0 on success and -1 on error.
Usage: hadoop fs -truncate [-w] <length> <paths>
Truncate all files that match the specified file pattern to the specified length.
Options:
Example:
http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
All HDFS commands are invoked by the bin/hdfs script. Running the hdfs script without any arguments prints the description for all commands.
Usage: hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]
Hadoop has an option parsing framework that employs parsing generic options as well as running classes.
COMMAND_OPTIONS | Description |
---|---|
--config --loglevel |
The common set of shell options. These are documented on the Commands Manual page. |
GENERIC_OPTIONS | The common set of options supported by multiple commands. See the Hadoop Commands Manual for more information. |
COMMAND COMMAND_OPTIONS | Various commands with their options are described in the following sections. The commands have been grouped into User Commands and Administration Commands. |
Commands useful for users of a hadoop cluster.
Usage: hdfs classpath
Prints the class path needed to get the Hadoop jar and the required libraries
Usage: hdfs dfs [COMMAND [COMMAND_OPTIONS]]
Run a filesystem command on the file system supported in Hadoop. The various COMMAND_OPTIONS can be found at File System Shell Guide.
Usage: hdfs fetchdt [--webservice <namenode_http_addr>] <path>
COMMAND_OPTION | Description |
---|---|
--webservice https_address | use http protocol instead of RPC |
fileName | File name to store the token into. |
Gets Delegation Token from a NameNode. See fetchdt for more info.
Usage:
hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] [-includeSnapshots] [-storagepolicies] [-blockId <blk_Id>]
COMMAND_OPTION | Description |
---|---|
path | Start checking from this path. |
-delete | Delete corrupted files. |
-files | Print out files being checked. |
-files -blocks | Print out the block report |
-files -blocks -locations | Print out locations for every block. |
-files -blocks -racks | Print out network topology for data-node locations. |
-includeSnapshots | Include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it. |
-list-corruptfileblocks | Print out list of missing blocks and files they belong to. |
-move | Move corrupted files to /lost+found. |
-openforwrite | Print out files opened for write. |
-storagepolicies | Print out storage policy summary for the blocks. |
-blockId | Print out information about the block. |
Runs the HDFS filesystem checking utility. See fsck for more info.
Usage:
hdfs getconf -namenodes hdfs getconf -secondaryNameNodes hdfs getconf -backupNodes hdfs getconf -includeFile hdfs getconf -excludeFile hdfs getconf -nnRpcAddresses hdfs getconf -confKey [key]
COMMAND_OPTION | Description |
---|---|
-namenodes | gets list of namenodes in the cluster. |
-secondaryNameNodes | gets list of secondary namenodes in the cluster. |
-backupNodes | gets list of backup nodes in the cluster. |
-includeFile | gets the include file path that defines the datanodes that can join the cluster. |
-excludeFile | gets the exclude file path that defines the datanodes that need to decommissioned. |
-nnRpcAddresses | gets the namenode rpc addresses |
-confKey [key] | gets a specific key from the configuration |
Gets configuration information from the configuration directory, post-processing.
Usage: hdfs lsSnapshottableDir [-help]
COMMAND_OPTION | Description |
---|---|
-help | print help |
Get the list of snapshottable directories. When this is run as a super user, it returns all snapshottable directories. Otherwise it returns those directories that are owned by the current user.
Usage: hdfs jmxget [-localVM ConnectorURL | -port port | -server mbeanserver | -service service]
COMMAND_OPTION | Description |
---|---|
-help | print help |
-localVM ConnectorURL | connect to the VM on the same machine |
-port mbean server port | specify mbean server port, if missing it will try to connect to MBean Server in the same VM |
-service | specify jmx service, either DataNode or NameNode, the default |
Dump JMX information from a service.
Usage: hdfs oev [OPTIONS] -i INPUT_FILE -o OUTPUT_FILE
COMMAND_OPTION | Description |
---|---|
-i,--inputFile arg | edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format |
-o,--outputFile arg | Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option |
COMMAND_OPTION | Description |
---|---|
-f,--fix-txids | Renumber the transaction IDs in the input, so that there are no gaps or invalid transaction IDs. |
-h,--help | Display usage information and exit |
-r,--ecover | When reading binary edit logs, use recovery mode. This will give you the chance to skip corrupt parts of the edit log. |
-p,--processor arg | Select which type of processor to apply against image file, currently supported processors are: binary (native binary format that Hadoop uses), xml (default, XML format), stats (prints statistics about edits file) |
-v,--verbose | More verbose output, prints the input and output filenames, for processors that write to a file, also output to screen. On large image files this will dramatically increase processing time (default is false). |
Hadoop offline edits viewer.
Usage: hdfs oiv [OPTIONS] -i INPUT_FILE
COMMAND_OPTION | Description |
---|---|
-i,--inputFile arg | edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format |
COMMAND_OPTION | Description |
---|---|
-h,--help | Display usage information and exit |
-o,--outputFile arg | Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option |
-p,--processor arg | Select which type of processor to apply against image file, currently supported processors are: binary (native binary format that Hadoop uses), xml (default, XML format), stats (prints statistics about edits file) |
Hadoop Offline Image Viewer for newer image files.
Usage: hdfs oiv_legacy [OPTIONS] -i INPUT_FILE -o OUTPUT_FILE
COMMAND_OPTION | Description |
---|---|
-h,--help | Display usage information and exit |
-i,--inputFile arg | edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format |
-o,--outputFile arg | Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option |
Hadoop offline image viewer for older versions of Hadoop.
Usage: hdfs snapshotDiff <path> <fromSnapshot> <toSnapshot>
Determine the difference between HDFS snapshots. See the HDFS Snapshot Documentation for more information.
Commands useful for administrators of a hadoop cluster.
Usage:
hdfs balancer [-threshold <threshold>] [-policy <policy>] [-exclude [-f <hosts-file> | <comma-separated list of hosts>]] [-include [-f <hosts-file> | <comma-separated list of hosts>]] [-idleiterations <idleiterations>]
COMMAND_OPTION | Description |
---|---|
-policy <policy> | datanode (default): Cluster is balanced if each datanode is balanced. blockpool: Cluster is balanced if each block pool in each datanode is balanced. |
-threshold <threshold> | Percentage of disk capacity. This overwrites the default threshold. |
-exclude -f <hosts-file> | <comma-separated list of hosts> | Excludes the specified datanodes from being balanced by the balancer. |
-include -f <hosts-file> | <comma-separated list of hosts> | Includes only the specified datanodes to be balanced by the balancer. |
-idleiterations <iterations> | Maximum number of idle iterations before exit. This overwrites the default idleiterations(5). |
Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process. See Balancer for more details.
Note that the blockpool policy is more strict than the datanode policy.
Usage: hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]
See the HDFS Cache Administration Documentation for more information.
Usage:
hdfs crypto -createZone -keyName <keyName> -path <path> hdfs crypto -help <command-name> hdfs crypto -listZones
See the HDFS Transparent Encryption Documentation for more information.
Usage: hdfs datanode [-regular | -rollback | -rollingupgrace rollback]
COMMAND_OPTION | Description |
---|---|
-regular | Normal datanode startup (default). |
-rollback | Rollback the datanode to the previous version. This should be used after stopping the datanode and distributing the old hadoop version. |
-rollingupgrade rollback | Rollback a rolling upgrade operation. |
Runs a HDFS datanode.
Usage:
hdfs dfsadmin [GENERIC_OPTIONS] [-report [-live] [-dead] [-decommissioning]] [-safemode enter | leave | get | wait] [-saveNamespace] [-rollEdits] [-restoreFailedStorage true |false |check] [-refreshNodes] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-setSpaceQuota <quota> <dirname>...<dirname>] [-clrSpaceQuota <dirname>...<dirname>] [-setStoragePolicy <path> <policyName>] [-getStoragePolicy <path>] [-finalizeUpgrade] [-rollingUpgrade [<query> |<prepare> |<finalize>]] [-metasave filename] [-refreshServiceAcl] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-refreshCallQueue] [-refresh <host:ipc_port> <key> [arg1..argn]] [-reconfig <datanode |...> <host:ipc_port> <start |status>] [-printTopology] [-refreshNamenodes datanodehost:port] [-deleteBlockPool datanode-host:port blockpoolId [force]] [-setBalancerBandwidth <bandwidth in bytes per second>] [-allowSnapshot <snapshotDir>] [-disallowSnapshot <snapshotDir>] [-fetchImage <local directory>] [-shutdownDatanode <datanode_host:ipc_port> [upgrade]] [-getDatanodeInfo <datanode_host:ipc_port>] [-triggerBlockReport [-incremental] <datanode_host:ipc_port>] [-help [cmd]]
COMMAND_OPTION | Description |
---|---|
-report [-live] [-dead] [-decommissioning] | Reports basic filesystem information and statistics. Optional flags may be used to filter the list of displayed DataNodes. |
-safemode enter|leave|get|wait | Safe mode maintenance command. Safe mode is a Namenode state in which it 1. does not accept changes to the name space (read-only) 2. does not replicate or delete blocks. Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well. |
-saveNamespace | Save current namespace into storage directories and reset edits log. Requires safe mode. |
-rollEdits | Rolls the edit log on the active NameNode. |
-restoreFailedStorage true|false|check | This option will turn on/off automatic attempt to restore failed storage replicas. If a failed storage becomes available again the system will attempt to restore edits and/or fsimage during checkpoint. ‘check’ option will return current setting. |
-refreshNodes | Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned. |
-setQuota <quota> <dirname>…<dirname> | See HDFS Quotas Guide for the detail. |
-clrQuota <dirname>…<dirname> | See HDFS Quotas Guide for the detail. |
-setSpaceQuota <quota> <dirname>…<dirname> | See HDFS Quotas Guide for the detail. |
-clrSpaceQuota <dirname>…<dirname> | See HDFS Quotas Guide for the detail. |
-setStoragePolicy <path> <policyName> | Set a storage policy to a file or a directory. |
-getStoragePolicy <path> | Get the storage policy of a file or a directory. |
-finalizeUpgrade | Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed by Namenode doing the same. This completes the upgrade process. |
-rollingUpgrade [<query>|<prepare>|<finalize>] | See Rolling Upgrade document for the detail. |
-metasave filename | Save Namenode’s primary data structures to filename in the directory specified by hadoop.log.dir property. filename is overwritten if it exists. filename will contain one line for each of the following 1. Datanodes heart beating with Namenode 2. Blocks waiting to be replicated 3. Blocks currently being replicated 4. Blocks waiting to be deleted |
-refreshServiceAcl | Reload the service-level authorization policy file. |
-refreshUserToGroupsMappings | Refresh user-to-groups mappings. |
-refreshSuperUserGroupsConfiguration | Refresh superuser proxy groups mappings |
-refreshCallQueue | Reload the call queue from config. |
-refresh <host:ipc_port> <key> [arg1..argn] | Triggers a runtime-refresh of the resource specified by <key> on <host:ipc_port>. All other args after are sent to the host. |
-reconfig <datanode |…> <host:ipc_port> <start|status> | Start reconfiguration or get the status of an ongoing reconfiguration. The second parameter specifies the node type. Currently, only reloading DataNode’s configuration is supported. |
-printTopology | Print a tree of the racks and their nodes as reported by the Namenode |
-refreshNamenodes datanodehost:port | For the given datanode, reloads the configuration files, stops serving the removed block-pools and starts serving new block-pools. |
-deleteBlockPool datanode-host:port blockpoolId [force] | If force is passed, block pool directory for the given blockpool id on the given datanode is deleted along with its contents, otherwise the directory is deleted only if it is empty. The command will fail if datanode is still serving the block pool. Refer to refreshNamenodes to shutdown a block pool service on a datanode. |
-setBalancerBandwidth <bandwidth in bytes per second> | Changes the network bandwidth used by each datanode during HDFS block balancing. <bandwidth> is the maximum number of bytes per second that will be used by each datanode. This value overrides the dfs.balance.bandwidthPerSec parameter. NOTE: The new value is not persistent on the DataNode. |
-allowSnapshot <snapshotDir> | Allowing snapshots of a directory to be created. If the operation completes successfully, the directory becomes snapshottable. See the HDFS Snapshot Documentation for more information. |
-disallowSnapshot <snapshotDir> | Disallowing snapshots of a directory to be created. All snapshots of the directory must be deleted before disallowing snapshots. See the HDFS Snapshot Documentation for more information. |
-fetchImage <local directory> | Downloads the most recent fsimage from the NameNode and saves it in the specified local directory. |
-shutdownDatanode <datanode_host:ipc_port> [upgrade] | Submit a shutdown request for the given datanode. See Rolling Upgrade document for the detail. |
-getDatanodeInfo <datanode_host:ipc_port> | Get the information about the given datanode. See Rolling Upgrade document for the detail. |
-triggerBlockReport [-incremental] <datanode_host:ipc_port> | Trigger a block report for the given datanode. If ‘incremental’ is specified, it will be otherwise, it will be a full block report. |
-help [cmd] | Displays help for the given command or all commands if none is specified. |
Runs a HDFS dfsadmin client.
Usage:
hdfs haadmin -checkHealth <serviceId> hdfs haadmin -failover [--forcefence] [--forceactive] <serviceId> <serviceId> hdfs haadmin -getServiceState <serviceId> hdfs haadmin -help <command> hdfs haadmin -transitionToActive <serviceId> [--forceactive] hdfs haadmin -transitionToStandby <serviceId>
COMMAND_OPTION | Description |
---|---|
-checkHealth | check the health of the given NameNode |
-failover | initiate a failover between two NameNodes |
-getServiceState | determine whether the given NameNode is Active or Standby |
-transitionToActive | transition the state of the given NameNode to Active (Warning: No fencing is done) |
-transitionToStandby | transition the state of the given NameNode to Standby (Warning: No fencing is done) |
See HDFS HA with NFS or HDFS HA with QJM for more information on this command.
Usage: hdfs journalnode
This comamnd starts a journalnode for use with HDFS HA with QJM.
Usage: hdfs mover [-p <files/dirs> | -f <local file name>]
COMMAND_OPTION | Description |
---|---|
-f <local file> | Specify a local file containing a list of HDFS files/dirs to migrate. |
-p <files/dirs> | Specify a space separated list of HDFS files/dirs to migrate. |
Runs the data migration utility. See Mover for more details.
Note that, when both -p and -f options are omitted, the default path is the root directory.
Usage:
hdfs namenode [-backup] | [-checkpoint] | [-format [-clusterid cid ] [-force] [-nonInteractive] ] | [-upgrade [-clusterid cid] [-renameReserved<k-v pairs>] ] | [-upgradeOnly [-clusterid cid] [-renameReserved<k-v pairs>] ] | [-rollback] | [-rollingUpgrade <downgrade |rollback> ] | [-finalize] | [-importCheckpoint] | [-initializeSharedEdits] | [-bootstrapStandby] | [-recover [-force] ] | [-metadataVersion ]
COMMAND_OPTION | Description |
---|---|
-backup | Start backup node. |
-checkpoint | Start checkpoint node. |
-format [-clusterid cid] [-force] [-nonInteractive] | Formats the specified NameNode. It starts the NameNode, formats it and then shut it down. -force option formats if the name directory exists. -nonInteractive option aborts if the name directory exists, unless -force option is specified. |
-upgrade [-clusterid cid] [-renameReserved <k-v pairs>] | Namenode should be started with upgrade option after the distribution of new Hadoop version. |
-upgradeOnly [-clusterid cid] [-renameReserved <k-v pairs>] | Upgrade the specified NameNode and then shutdown it. |
-rollback | Rollback the NameNode to the previous version. This should be used after stopping the cluster and distributing the old Hadoop version. |
-rollingUpgrade <downgrade|rollback|started> | See Rolling Upgrade document for the detail. |
-finalize | Finalize will remove the previous state of the files system. Recent upgrade will become permanent. Rollback option will not be available anymore. After finalization it shuts the NameNode down. |
-importCheckpoint | Loads image from a checkpoint directory and save it into the current one. Checkpoint dir is read from property fs.checkpoint.dir |
-initializeSharedEdits | Format a new shared edits dir and copy in enough edit log segments so that the standby NameNode can start up. |
-bootstrapStandby | Allows the standby NameNode’s storage directories to be bootstrapped by copying the latest namespace snapshot from the active NameNode. This is used when first configuring an HA cluster. |
-recover [-force] | Recover lost metadata on a corrupt filesystem. See HDFS User Guide for the detail. |
-metadataVersion | Verify that configured directories exist, then print the metadata versions of the software and the image. |
Runs the namenode. More info about the upgrade, rollback and finalize is at Upgrade Rollback.
Usage: hdfs secondarynamenode [-checkpoint [force]] | [-format] | [-geteditsize]
COMMAND_OPTION | Description |
---|---|
-checkpoint [force] | Checkpoints the SecondaryNameNode if EditLog size >= fs.checkpoint.size. If force is used, checkpoint irrespective of EditLog size. |
-format | Format the local storage during startup. |
-geteditsize | Prints the number of uncheckpointed transactions on the NameNode. |
Runs the HDFS secondary namenode. See Secondary Namenode for more info.
Usage: hdfs storagepolicies
Lists out all storage policies. See the HDFS Storage Policy Documentation for more information.
Usage: hdfs zkfc [-formatZK [-force] [-nonInteractive]]
COMMAND_OPTION | Description |
---|---|
-formatZK | Format the Zookeeper instance |
-h | Display help |
This comamnd starts a Zookeeper Failover Controller process for use with HDFS HA with QJM.
Useful commands to help administrators debug HDFS issues, like validating block files and calling recoverLease.
Usage: hdfs debug verify [-meta <metadata-file>] [-block <block-file>]
COMMAND_OPTION | Description |
---|---|
-block block-file | Optional parameter to specify the absolute path for the block file on the local file system of the data node. |
-meta metadata-file | Absolute path for the metadata file on the local file system of the data node. |
Verify HDFS metadata and block files. If a block file is specified, we will verify that the checksums in the metadata file match the block file.
Usage: hdfs debug recoverLease [-path <path>] [-retries <num-retries>]
COMMAND_OPTION | Description |
---|---|
[-path path] | HDFS path for which to recover the lease. |
[-retries num-retries] | Number of times the client will retry calling recoverLease. The default number of retries is 1. |
Recover the lease on the specified path. The path must reside on an HDFS filesystem. The default number of retries is 1.
Hadoop中關於文件操做類基本上所有是在"org.apache.hadoop.fs"包中,這些API可以支持的操做包含:打開文件,讀寫文件,刪除文件等。
Hadoop類庫中最終面向用戶提供的接口類是FileSystem,該類是個抽象類,只能經過來類的get方法獲得具體類。get方法存在幾個重載版本,經常使用的是這個:
static FileSystem get(Configuration conf);
該類封裝了幾乎全部的文件操做,例如mkdir,delete等。綜上基本上能夠得出操做文件的程序庫框架:
operator()
{
獲得Configuration對象
獲得FileSystem對象
進行文件操做
}
經過"FileSystem.copyFromLocalFile(Path src,Patch dst)"可將本地文件上傳到HDFS的制定位置上,其中src和dst均爲文件的完整路徑。具體事例以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class CopyFile {
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
//本地文件
Path src =new Path("D:\\HebutWinOS");
//HDFS爲止
Path dst =new Path("/");
hdfs.copyFromLocalFile(src, dst);
System.out.println("Upload to"+conf.get("fs.default.name"));
FileStatus files[]=hdfs.listStatus(dst);
for(FileStatus file:files){
System.out.println(file.getPath());
}
}
}
運行結果能夠經過控制檯、項目瀏覽器和SecureCRT查看,如圖6-1-一、圖6-1-二、圖6-1-3所示。
1)控制檯結果
圖6-1-1 運行結果(1)
2)項目瀏覽器
圖6-1-2 運行結果(2)
3)SecureCRT結果
圖6-1-3 運行結果(3)
經過"FileSystem.create(Path f)"可在HDFS上建立文件,其中f爲文件的完整路徑。具體實現以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class CreateFile {
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
byte[] buff="hello hadoop world!\n".getBytes();
Path dfs=new Path("/test");
FSDataOutputStream outputStream=hdfs.create(dfs);
outputStream.write(buff,0,buff.length);
}
}
運行結果如圖6-2-1和圖6-2-2所示。
1)項目瀏覽器
圖6-2-1 運行結果(1)
2)SecureCRT結果
圖6-2-2 運行結果(2)
經過"FileSystem.mkdirs(Path f)"可在HDFS上建立文件夾,其中f爲文件夾的完整路徑。具體實現以下:
package com.hebut.dir;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class CreateDir {
public static void main(String[] args) throws Exception{
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
Path dfs=new Path("/TestDir");
hdfs.mkdirs(dfs);
}
}
運行結果如圖6-3-1和圖6-3-2所示。
1)項目瀏覽器
圖6-3-1 運行結果(1)
2)SecureCRT結果
圖6-3-2 運行結果(2)
經過"FileSystem.rename(Path src,Path dst)"可爲指定的HDFS文件重命名,其中src和dst均爲文件的完整路徑。具體實現以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Rename{
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
Path frpaht=new Path("/test"); //舊的文件名
Path topath=new Path("/test1"); //新的文件名
boolean isRename=hdfs.rename(frpaht, topath);
String result=isRename?"成功":"失敗";
System.out.println("文件重命名結果爲:"+result);
}
}
運行結果如圖6-4-1和圖6-4-2所示。
1)項目瀏覽器
圖6-4-1 運行結果(1)
2)SecureCRT結果
圖6-4-2 運行結果(2)
經過"FileSystem.delete(Path f,Boolean recursive)"可刪除指定的HDFS文件,其中f爲須要刪除文件的完整路徑,recuresive用來肯定是否進行遞歸刪除。具體實現以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class DeleteFile {
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
Path delef=new Path("/test1");
boolean isDeleted=hdfs.delete(delef,false);
//遞歸刪除
//boolean isDeleted=hdfs.delete(delef,true);
System.out.println("Delete?"+isDeleted);
}
}
運行結果如圖6-5-1和圖6-5-2所示。
1)控制檯結果
圖6-5-1 運行結果(1)
2)項目瀏覽器
圖6-5-2 運行結果(2)
同刪除文件代碼同樣,只是換成刪除目錄路徑便可,若是目錄下有文件,要進行遞歸刪除。
經過"FileSystem.exists(Path f)"可查看指定HDFS文件是否存在,其中f爲文件的完整路徑。具體實現以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class CheckFile {
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
Path findf=new Path("/test1");
boolean isExists=hdfs.exists(findf);
System.out.println("Exist?"+isExists);
}
}
運行結果如圖6-7-1和圖6-7-2所示。
1)控制檯結果
圖6-7-1 運行結果(1)
2)項目瀏覽器
圖6-7-2 運行結果(2)
經過"FileSystem.getModificationTime()"可查看指定HDFS文件的修改時間。具體實現以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class GetLTime {
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
Path fpath =new Path("/user/hadoop/test/file1.txt");
FileStatus fileStatus=hdfs.getFileStatus(fpath);
long modiTime=fileStatus.getModificationTime();
System.out.println("file1.txt的修改時間是"+modiTime);
}
}
運行結果如圖6-8-1所示。
圖6-8-1 控制檯結果
經過"FileStatus.getPath()"可查看指定HDFS中某個目錄下全部文件。具體實現以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class ListAllFile {
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
Path listf =new Path("/user/hadoop/test");
FileStatus stats[]=hdfs.listStatus(listf);
for(int i = 0; i < stats.length; ++i)
{
System.out.println(stats[i].getPath().toString());
}
hdfs.close();
}
}
運行結果如圖6-9-1和圖6-9-2所示。
1)控制檯結果
圖6-9-1 運行結果(1)
2)項目瀏覽器
圖6-9-2 運行結果(2)
經過"FileSystem.getFileBlockLocation(FileStatus file,long start,long len)"可查找指定文件在HDFS集羣上的位置,其中file爲文件的完整路徑,start和len來標識查找文件的路徑。具體實現以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.BlockLocation;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class FileLoc {
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem hdfs=FileSystem.get(conf);
Path fpath=new Path("/user/hadoop/cygwin");
FileStatus filestatus = hdfs.getFileStatus(fpath);
BlockLocation[] blkLocations = hdfs.getFileBlockLocations(filestatus, 0, filestatus.getLen());
int blockLen = blkLocations.length;
for(int i=0;i<blockLen;i++){
String[] hosts = blkLocations[i].getHosts();
System.out.println("block_"+i+"_location:"+hosts[0]);
}
}
}
運行結果如圖6-10-1和6.10.2所示。
1)控制檯結果
圖6-10-1 運行結果(1)
2)項目瀏覽器
圖6-10-2 運行結果(2)
經過"DatanodeInfo.getHostName()"可獲取HDFS集羣上的全部節點名稱。具體實現以下:
package com.hebut.file;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.hdfs.DistributedFileSystem;
import org.apache.hadoop.hdfs.protocol.DatanodeInfo;
public class GetList {
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
FileSystem fs=FileSystem.get(conf);
DistributedFileSystem hdfs = (DistributedFileSystem)fs;
DatanodeInfo[] dataNodeStats = hdfs.getDataNodeStats();
for(int i=0;i<dataNodeStats.length;i++){
System.out.println("DataNode_"+i+"_Name:"+dataNodeStats[i].getHostName());
}
}
}
運行結果如圖6-11-1所示。
圖6-11-1 控制檯結果
文件讀取的過程以下:
1)解釋一
2)解釋二
寫入文件的過程比讀取較爲複雜:
1)解釋一
2)解釋二