[Hadoop] HDFS - Hadoop Distributed File System

寫在前面


1、歷史演進

過期筆記:[Spark] 01 - What is Sparkhtml

官方文檔:https://hadoop.apache.org/docs/r2.7.3/node

 

三大模塊:HDFS, MapReduce, Yarnpython

大數據學習技術棧:c++

大數據業務實踐 項目實踐
Echarts.js, D3.js, 開源報表系統 可視化
Sqoop, DataX ETL
Mahout, Hive, Pig, R 語言, MLib 分析與挖掘

MapReduce, Storm, Impala, Tez, Presto, Spark, Spark Streamingapache

數據計算
HDFS, Hbase, Cassadra 數據存儲
Flume, Kafka, Scribe 數據收集

 

2、發起人

比此人更牛的天然是Google內部的兩個神祕人(博士Uri Lerner和工程師Mike Yar)。編程

有能力用c++語言級別寫出整套分佈式框架的公司很少,大部分仍是依賴於Java平臺。安全

Ref: [Distributed ML] Yi WANG's talk服務器

 

3、配置文件

hdfs-site.xml框架

<configuration>
    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/usr/local/hadoop/namenode_dir</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/usr/local/hadoop/datanode_dir</value>
    </property>

    <property>
            <name>dfs.replication</name>
            <value>3</value>
    </property>
</configuration>

 

 

 

NameNode


1、Metadata

連接:https://www.jianshu.com/p/4ee877c14957分佈式

元數據就是數據的數據。

元數據管理有兩種方式。集中式管理和分佈式管理。集中式管理是指在系統中有一個節點專門司職元數據管理,全部元數據都存儲在該節點的存儲設備上。全部客戶端對文件的請求前,都要先對該元數據管理器請求元數據。分佈式管理是指將元數據存放在系統的任意節點而且能動態的遷移。對元數據管理的職責也分佈到各個不一樣的節點上。大多數集羣文件系統都採用集中式的元數據管理。由於集中式管理實現簡單,一致性維護容易,在必定的操做頻繁度內能夠提供較滿意的性能。缺點是單一失效點問題,若該服務器失效,整個系統將沒法正常工做。並且,當對元數據的操做過於頻繁時,集中的元數據管理成爲整個系統的性能瓶頸。

分佈式元數據管理的好處是解決了集中式管理的單一失效點問題,並且性能不會隨着操做頻繁而出現瓶頸。其缺點是,實現複雜,一致性維護複雜,對性能有必定影響。

 

2、Secondary Namenode

 
 
Edits + Fsimage過程在 hdfs-site.xml 中指定路徑。
啓動服務時可能會刪除一些文件:sudo rm -r /usr/local/ hadoop_store/hdfs/datanode/current

current文件夾下對應的文件:

 

1. Fsimage文件:HDFS文件系統元數據的一個永久性的檢查點,其中包含HDFS文件系統的全部目錄和文件idnode的序列化信息;
2. Fsimage.md5文件:是鏡像文件的 md5 校驗文件,這個校驗文件是爲了判斷鏡像文件是否被修改;
3. Edits文件:存放HDFS文件系統的全部更新操做,文件系統客戶端執行的全部寫操做首先會被記錄到Edits文件中。
4. seen_txid文件:它表明的是 namenode 裏面的 edits_* 文件的尾數,namenode 重啓的時候,會按照 seen_txid 的數字, 循序從頭跑 edits_0000001~ 到 seen_txid 的數字。
5. VERSION文件:記錄了當前NameNode的一些信息。

3、讀寫流程

Writing 流程

NameNode提供了塊信息、冗餘備份信息、DataNode list等。

這裏涉及到:data queue, ack queue。

 

Reading 流程

 

4、Fault Recovery

涉及到:心跳機制、高可用和聯邦。

Ref: Hadoop NameNode 高可用 (High Availability) 實現解析

連接中內容適合另開一篇文章學習。

 

 

 

交互命令與編程


基本操做

Ref: HDFS Commands Guide

1、上傳本地文件

hdfs dfs -put alice.txt holmes.txt frankenstein.txt books

 

2、下載到本地

hdfs dfs -get /hdfsPath /localPath

 

3、常見問題

文件具體保存在哪一個物理塊?

Goto: Hadoop中的FileStatus、BlockLocation、LocatedBlocks、InputSplit

hadoop@node-master:~$ hdfs fsck /alice.txt -files -blocks -locations
Connecting to namenode via http://node-master:9870/fsck?ugi=hadoop&files=1&blocks=1&locations=1&path=%2Falice.txt
FSCK started by hadoop (auth:SIMPLE) from /192.168.56.2 for path /alice.txt at Thu Oct 24 23:12:07 AEDT 2019
/alice.txt 173595 bytes, replicated: replication=2, 1 block(s):  OK
0. BP-1744165533-192.168.56.2-1571910405666:blk_1073741825_1001 len=173595 Live_repl=2  [DatanodeInfoWithStorage[192.168.56.102:9866,DS-c46e8bf7-6e6b-4307-9fd3-87366ecb33de,DISK], DatanodeInfoWithStorage[192.168.56.101:9866,DS-67f3e0a5-e281-401f-b5d9-f1edae42eb33,DISK]]


Status: HEALTHY
 Number of data-nodes:    3
 Number of racks:        1
 Total dirs:            0
 Total symlinks:        0

Replicated Blocks:
 Total size:    173595 B
 Total files:    1
 Total blocks (validated):    1 (avg. block size 173595 B)
 Minimally replicated blocks:    1 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    2
 Average block replication:    2.0
 Missing blocks:        0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:    0
 Total block groups (validated):    0
 Minimally erasure-coded block groups:    0
 Over-erasure-coded block groups:    0
 Under-erasure-coded block groups:    0
 Unsatisfactory placement block groups:    0
 Average block group size:    0.0
 Missing block groups:        0
 Corrupt block groups:        0
 Missing internal blocks:    0
FSCK ended at Thu Oct 24 23:12:07 AEDT 2019 in 15 milliseconds


The filesystem under path '/alice.txt' is HEALTHY
交互命令

 

 

 

其餘操做

1、hdfs與getconf結合使用 

hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs getconf -namenodes
node-master

hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs getconf -confKey dfs.namenode.fs-limits.min-block-size 1048576
hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs getconf -nnRpcAddresses
node-master:9000

 

2、安全模式

三個datanode,爲什麼有一個沒啓動?

hadoop@node-master:/usr/local/hadoop/data/nameNode$   
Safe mode is ON

hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs dfsadmin -safemode enter
Safe mode is ON

hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs dfsadmin -report
Safe mode is ON

WARNING: 
Name node has detected blocks with generation stamps in future.
Forcing exit from safemode will cause 10368971 byte(s) to be deleted.
If you are sure that the NameNode was started with the correct metadata files then you may proceed with '-safemode forceExit'

Configured Capacity: 20014161920 (18.64 GB)
Present Capacity: 6477033472 (6.03 GB)
DFS Remaining: 6474620928 (6.03 GB)
DFS Used: 2412544 (2.30 MB)
DFS Used%: 0.04%
Replicated Blocks:
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0
    Missing blocks (with replication factor 1): 0
    Low redundancy blocks with highest priority to recover: 0
    Pending deletion blocks: 0
Erasure Coded Block Groups: 
    Low redundancy block groups: 0
    Block groups with corrupt internal blocks: 0
    Missing block groups: 0
    Low redundancy blocks with highest priority to recover: 0
    Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (2):

Name: 192.168.56.101:9866 (node1)
Hostname: node1
Decommission Status : Normal
Configured Capacity: 10007080960 (9.32 GB)
DFS Used: 811008 (792 KB)
Non DFS Used: 6239903744 (5.81 GB)
DFS Remaining: 3237834752 (3.02 GB)
DFS Used%: 0.01%
DFS Remaining%: 32.36%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Oct 24 20:15:20 AEDT 2019
Last Block Report: Thu Oct 24 20:11:35 AEDT 2019
Num of Blocks: 0


Name: 192.168.56.103:9866 (node3)
Hostname: node3
Decommission Status : Normal
Configured Capacity: 10007080960 (9.32 GB)
DFS Used: 1601536 (1.53 MB)
Non DFS Used: 6240161792 (5.81 GB)
DFS Remaining: 3236786176 (3.01 GB)
DFS Used%: 0.02%
DFS Remaining%: 32.34%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Oct 24 20:15:21 AEDT 2019
Last Block Report: Thu Oct 24 20:12:33 AEDT 2019
Num of Blocks: 3

  

3、顯示HDFS塊信息

hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs fsck / -files -blocks
Connecting to namenode via http://node-master:9870/fsck?ugi=hadoop&files=1&blocks=1&path=%2F
FSCK started by hadoop (auth:SIMPLE) from /192.168.56.2 for path / at Thu Oct 24 20:16:39 AEDT 2019
/ <dir>
/user <dir>
/user/hadoop <dir>
/user/hadoop/books <dir>
/user/hadoop/books/alice.txt 173595 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-1068893594-192.168.56.2-1571893848809:blk_1073741825_1001 len=173595 Live_repl=1

/user/hadoop/books/frankenstein.txt 450783 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-1068893594-192.168.56.2-1571893848809:blk_1073741827_1003 len=450783 Live_repl=1

/user/hadoop/books/holmes.txt 607788 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-1068893594-192.168.56.2-1571893848809:blk_1073741826_1002 len=607788 Live_repl=1


Status: HEALTHY
 Number of data-nodes:    2
 Number of racks:        1
 Total dirs:            4
 Total symlinks:        0

Replicated Blocks:
 Total size:    1232166 B
 Total files:    3
 Total blocks (validated):    3 (avg. block size 410722 B)
 Minimally replicated blocks:    3 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    2
 Average block replication:    1.0
 Missing blocks:        0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:    0
 Total block groups (validated):    0
 Minimally erasure-coded block groups:    0
 Over-erasure-coded block groups:    0
 Under-erasure-coded block groups:    0
 Unsatisfactory placement block groups:    0
 Average block group size:    0.0
 Missing block groups:        0
 Corrupt block groups:        0
 Missing internal blocks:    0
FSCK ended at Thu Oct 24 20:16:39 AEDT 2019 in 12 milliseconds


The filesystem under path '/' is HEALTHY

    

4、格式化名稱節點

(慎用,通常只在初次搭建集羣,使用一次)

hadoop namenode -format

 

 

 

Python對HDFS的操做


<Hadoop With Python>

Download: https://pythonizame.s3.amazonaws.com/media/Book/demo-demo/file/25755e12-ae04-11e6-ba9c-040196293901.pdf

 

/* Library 不是很穩定 */

 

End.

相關文章
相關標籤/搜索