過期筆記:[Spark] 01 - What is Sparkhtml
三大模塊:HDFS, MapReduce, Yarnpython
大數據學習技術棧:c++
大數據業務實踐 | 項目實踐 |
Echarts.js, D3.js, 開源報表系統 | 可視化 |
Sqoop, DataX | ETL |
Mahout, Hive, Pig, R 語言, MLib | 分析與挖掘 |
MapReduce, Storm, Impala, Tez, Presto, Spark, Spark Streamingapache |
數據計算 |
HDFS, Hbase, Cassadra | 數據存儲 |
Flume, Kafka, Scribe | 數據收集 |
比此人更牛的天然是Google內部的兩個神祕人(博士Uri Lerner和工程師Mike Yar)。編程
有能力用c++語言級別寫出整套分佈式框架的公司很少,大部分仍是依賴於Java平臺。安全
Ref: [Distributed ML] Yi WANG's talk服務器
hdfs-site.xml框架
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/namenode_dir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/datanode_dir</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
連接:https://www.jianshu.com/p/4ee877c14957分佈式
元數據就是數據的數據。
元數據管理有兩種方式。集中式管理和分佈式管理。集中式管理是指在系統中有一個節點專門司職元數據管理,全部元數據都存儲在該節點的存儲設備上。全部客戶端對文件的請求前,都要先對該元數據管理器請求元數據。分佈式管理是指將元數據存放在系統的任意節點而且能動態的遷移。對元數據管理的職責也分佈到各個不一樣的節點上。大多數集羣文件系統都採用集中式的元數據管理。由於集中式管理實現簡單,一致性維護容易,在必定的操做頻繁度內能夠提供較滿意的性能。缺點是單一失效點問題,若該服務器失效,整個系統將沒法正常工做。並且,當對元數據的操做過於頻繁時,集中的元數據管理成爲整個系統的性能瓶頸。
分佈式元數據管理的好處是解決了集中式管理的單一失效點問題,並且性能不會隨着操做頻繁而出現瓶頸。其缺點是,實現複雜,一致性維護複雜,對性能有必定影響。
current文件夾下對應的文件:
1. Fsimage文件:HDFS文件系統元數據的一個永久性的檢查點,其中包含HDFS文件系統的全部目錄和文件idnode的序列化信息;
2. Fsimage.md5文件:是鏡像文件的 md5 校驗文件,這個校驗文件是爲了判斷鏡像文件是否被修改;
3. Edits文件:存放HDFS文件系統的全部更新操做,文件系統客戶端執行的全部寫操做首先會被記錄到Edits文件中。
4. seen_txid文件:它表明的是 namenode 裏面的 edits_* 文件的尾數,namenode 重啓的時候,會按照 seen_txid 的數字, 循序從頭跑 edits_0000001~ 到 seen_txid 的數字。
5. VERSION文件:記錄了當前NameNode的一些信息。
NameNode提供了塊信息、冗餘備份信息、DataNode list等。
這裏涉及到:data queue, ack queue。
涉及到:心跳機制、高可用和聯邦。
Ref: Hadoop NameNode 高可用 (High Availability) 實現解析
連接中內容適合另開一篇文章學習。
Ref: HDFS Commands Guide
hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
hdfs dfs -get /hdfsPath /localPath
文件具體保存在哪一個物理塊?
Goto: Hadoop中的FileStatus、BlockLocation、LocatedBlocks、InputSplit
hadoop@node-master:~$ hdfs fsck /alice.txt -files -blocks -locations Connecting to namenode via http://node-master:9870/fsck?ugi=hadoop&files=1&blocks=1&locations=1&path=%2Falice.txt FSCK started by hadoop (auth:SIMPLE) from /192.168.56.2 for path /alice.txt at Thu Oct 24 23:12:07 AEDT 2019 /alice.txt 173595 bytes, replicated: replication=2, 1 block(s): OK 0. BP-1744165533-192.168.56.2-1571910405666:blk_1073741825_1001 len=173595 Live_repl=2 [DatanodeInfoWithStorage[192.168.56.102:9866,DS-c46e8bf7-6e6b-4307-9fd3-87366ecb33de,DISK], DatanodeInfoWithStorage[192.168.56.101:9866,DS-67f3e0a5-e281-401f-b5d9-f1edae42eb33,DISK]] Status: HEALTHY Number of data-nodes: 3 Number of racks: 1 Total dirs: 0 Total symlinks: 0 Replicated Blocks: Total size: 173595 B Total files: 1 Total blocks (validated): 1 (avg. block size 173595 B) Minimally replicated blocks: 1 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0 Missing blocks: 0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Erasure Coded Block Groups: Total size: 0 B Total files: 0 Total block groups (validated): 0 Minimally erasure-coded block groups: 0 Over-erasure-coded block groups: 0 Under-erasure-coded block groups: 0 Unsatisfactory placement block groups: 0 Average block group size: 0.0 Missing block groups: 0 Corrupt block groups: 0 Missing internal blocks: 0 FSCK ended at Thu Oct 24 23:12:07 AEDT 2019 in 15 milliseconds The filesystem under path '/alice.txt' is HEALTHY
hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs getconf -namenodes node-master hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs getconf -confKey dfs.namenode.fs-limits.min-block-size 1048576 hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs getconf -nnRpcAddresses
node-master:9000
三個datanode,爲什麼有一個沒啓動?
hadoop@node-master:/usr/local/hadoop/data/nameNode$ Safe mode is ON hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs dfsadmin -safemode enter Safe mode is ON hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs dfsadmin -report Safe mode is ON WARNING: Name node has detected blocks with generation stamps in future. Forcing exit from safemode will cause 10368971 byte(s) to be deleted. If you are sure that the NameNode was started with the correct metadata files then you may proceed with '-safemode forceExit' Configured Capacity: 20014161920 (18.64 GB) Present Capacity: 6477033472 (6.03 GB) DFS Remaining: 6474620928 (6.03 GB) DFS Used: 2412544 (2.30 MB) DFS Used%: 0.04% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 ------------------------------------------------- Live datanodes (2): Name: 192.168.56.101:9866 (node1) Hostname: node1 Decommission Status : Normal Configured Capacity: 10007080960 (9.32 GB) DFS Used: 811008 (792 KB) Non DFS Used: 6239903744 (5.81 GB) DFS Remaining: 3237834752 (3.02 GB) DFS Used%: 0.01% DFS Remaining%: 32.36% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Thu Oct 24 20:15:20 AEDT 2019 Last Block Report: Thu Oct 24 20:11:35 AEDT 2019 Num of Blocks: 0 Name: 192.168.56.103:9866 (node3) Hostname: node3 Decommission Status : Normal Configured Capacity: 10007080960 (9.32 GB) DFS Used: 1601536 (1.53 MB) Non DFS Used: 6240161792 (5.81 GB) DFS Remaining: 3236786176 (3.01 GB) DFS Used%: 0.02% DFS Remaining%: 32.34% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Thu Oct 24 20:15:21 AEDT 2019 Last Block Report: Thu Oct 24 20:12:33 AEDT 2019 Num of Blocks: 3
hadoop@node-master:/usr/local/hadoop/data/nameNode$ hdfs fsck / -files -blocks Connecting to namenode via http://node-master:9870/fsck?ugi=hadoop&files=1&blocks=1&path=%2F FSCK started by hadoop (auth:SIMPLE) from /192.168.56.2 for path / at Thu Oct 24 20:16:39 AEDT 2019 / <dir> /user <dir> /user/hadoop <dir> /user/hadoop/books <dir> /user/hadoop/books/alice.txt 173595 bytes, replicated: replication=1, 1 block(s): OK 0. BP-1068893594-192.168.56.2-1571893848809:blk_1073741825_1001 len=173595 Live_repl=1 /user/hadoop/books/frankenstein.txt 450783 bytes, replicated: replication=1, 1 block(s): OK 0. BP-1068893594-192.168.56.2-1571893848809:blk_1073741827_1003 len=450783 Live_repl=1 /user/hadoop/books/holmes.txt 607788 bytes, replicated: replication=1, 1 block(s): OK 0. BP-1068893594-192.168.56.2-1571893848809:blk_1073741826_1002 len=607788 Live_repl=1 Status: HEALTHY Number of data-nodes: 2 Number of racks: 1 Total dirs: 4 Total symlinks: 0 Replicated Blocks: Total size: 1232166 B Total files: 3 Total blocks (validated): 3 (avg. block size 410722 B) Minimally replicated blocks: 3 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 1.0 Missing blocks: 0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Erasure Coded Block Groups: Total size: 0 B Total files: 0 Total block groups (validated): 0 Minimally erasure-coded block groups: 0 Over-erasure-coded block groups: 0 Under-erasure-coded block groups: 0 Unsatisfactory placement block groups: 0 Average block group size: 0.0 Missing block groups: 0 Corrupt block groups: 0 Missing internal blocks: 0 FSCK ended at Thu Oct 24 20:16:39 AEDT 2019 in 12 milliseconds The filesystem under path '/' is HEALTHY
(慎用,通常只在初次搭建集羣,使用一次)
hadoop namenode -format
<Hadoop With Python>
/* Library 不是很穩定 */
End.