大數據學習之路之Hadoop

Hadoop介紹

1、簡介

Hadoop是一個開源的分佈式計算平臺,用於存儲大數據,並使用MapReduce來處理。Hadoop擅長於存儲各類格式的龐大的數據,任意的格式甚至非結構化的處理。兩個核心:前端

  • HDFS:Hadoop分佈式文件系統(Hadoop Distributed File System),具備高容錯性和伸縮性,使用java開發
  • MapReduce:Google MapReduce的開源實現,分佈式編程模型使用戶更方便的開發並行應用

使用Hadoop能夠輕鬆的組織計算機資源,從而搭建本身的分佈式計算平臺,而且能夠充分利用集羣的計算 和存儲能力,完成海量數據的處理。java

2、Hadoop的優點

  1. 高可靠性:Hadoop按位存儲和處理數據的能力具備很高的可靠性
  2. 高拓展性:Hadoop是在可用的計算機集簇間分配數據完成計算任務的,這些集簇能夠拓展到數以千計的節點中
  3. 高效性:Hadoop可以在節點之間動態地移動數據,以保證各個節點的動態平衡,所以其處理速度很是快
  4. 高容錯性:Hadoop可以自動保存數據的多份副本,而且可以自動將失敗的任務從新分配

3、關聯項目

  • Common:爲Hadoop及其子項目提供支持的經常使用工具,主要包括FileSystem,RPC和串行化庫。
  • Avro:Avro用於數據序列化的系統。提供了豐富的數據結構類型、快速可壓縮的二進制格式、存儲持久性數據的文件集、遠程調用RPC的功能和簡單的動態語言集成功能。
  • MapReduce:是一種編程模型,用於大規模數據集(大於1TB)的並行運算。
  • HDFS:分佈式文件系統。
  • YRAN:分佈式資源管理。
  • Chukwa:開源的數據收集系統,用於監控和分析大型分佈式系統的數據。
  • Hive:一個創建在Hadoop基礎之上的數據倉庫,它提供了一些用於對Hadoop文件中的數據集進行數據整理、特殊查詢和分析存儲的工具。Hive提供一種結構化數據的機制,支持相似傳統RDBMS的SQL語言的查詢語言來幫助那些熟悉SQL的用戶查詢Hadoop中的數據,該查詢語言成爲Hive SQL。
  • Hbase:一個分佈式的、面向列的開源數據庫,適合非結構化的數據存儲。主要用於須要隨機訪問、實時讀寫的大數據。
  • Pig:是一個對大型數據集進行分析、評估的平臺。Pig最突出的優點是它的結構可以經受住高度並行化的檢驗。
  • Zookeeper:爲分佈式應用設計的協調服務,主要爲用戶提供同步、配置管理、分組和命令等服務。

4、編譯安裝Hadoop

由於我是用的是32位系統,官方預編譯版本只有64位的,沒法使用,因此得編譯源代碼。node

根據編譯文件BUILDING.txt內容,安裝hadoop以前須要保證有如下工具:git

Hadoop編譯說明書

須要:
Unix 系統
JDK1.8
maven 3.3或更高
ProtoBuffer 2.5.0
CMake 3.1或更新(若是須要編譯本地代碼)
Zlib develop(若是須要編譯本地代碼)
openssl devel(若是編譯原生hadoop-pipe,並得到最佳的HDFS加密性能)
Linux FUSE(用戶空間的文件系統) 2.6或更高(若是編譯fuse_dfs)
第一次編譯須要網絡保持鏈接(獲取全部的maven和Hadoop須要的依賴)
Python(發佈文檔須要)
bats(用於shell代碼測試)
Node.js / bower / Ember-cli(用於編譯前端UI界面)
---------------------------------------------------------------------
得到具備全部工具的環境的最簡單方法是經過Docker提供的配置。
這就須要一本最近的docker版本1.4.1或者更高的能夠正常工做的版本

在Linux上,你能夠運行下面的命名安裝Docker
$ ./start-build-env.sh
接下來顯示的提示是位於源樹的已安裝版本,而且已安裝和配置了全部必需的測試和構建工具。
請注意,在此docker環境中,您只能從您開始的位置訪問Hadoop源樹。所以若是你想運行
dev-support/bin/test-patch /path/to/my.patch
那麼這個patch文件必須放在hadoop源樹中。

在ubuntu中清楚並安裝所需的軟件包:
Oracle JDK 1.8 (首選)
  $ sudo apt-get purge openjdk*
  $ sudo apt-get install software-properties-common
  $ sudo add-apt-repository ppa:webupd8team/java
  $ sudo apt-get update
  $ sudo apt-get install oracle-java8-installer
Maven
  $ sudo apt-get -y install maven
本地依賴包
  $ sudo apt-get -y install build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev
ProtocolBuffer 2.5.0 (必須)
  $ sudo apt-get -y install protobuf-compiler
# 1.下載源碼
wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.1.2/hadoop-3.1.2-src.tar.gz
# 2.解壓
tar -zxcf hadoop-3.1.2-src.tar.gz
cd hadoop-3.1.2-src
# 3.mvn編譯
mvn package -Pdist,native -DskipTests -Dtar

編譯這個玩意兒斷斷續續用了3天時間,下面是遇到的問題總結記錄一下。web

問題1:docker

mvn package -Pdist,native -DskipTests -Dtar的時候編譯失敗:shell

[ERROR] Failed to execute goal org.codehaus.mojo:native-maven-plugin:1.0-alpha-8:javah (default) on project hadoop-common: Error running javah command: Error executing command line. Exit code:2 -> [Help 1]

解決:數據庫

vim hadoop-common-project/hadoop-common/pom.xml將javah的執行路徑改成絕對路徑apache

<javahPath>${env.JAVA_HOME}/bin/javah</javahPath>
改成
<javahPath>/usr/bin/javah</javahPath>
# 具體的路徑須要對應你機器上的真實路徑

問題2:編程

mvn package -Pdist,native -DskipTests -Dtar的時候編譯失敗:

[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.1.2:cmake-compile (cmake-compile) on project hadoop-common: CMake failed with error code 1 -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.1.2:cmake-compile (cmake-compile) on project hadoop-common: CMake failed with error code 1

解決:

cmake版本不對,安裝cmake3.0版本:

# download
wget https://cmake.org/files/v3.0/cmake-3.0.0.tar.gz
tar -zxvf cmake-3.0.0.tar.gz
cd cmake-3.0.0
./configure
make
sudo apt-get install checkinstall
sudo checkinstall
sudo make install
# 創建軟連接
sudo ln -s bin/* /usr/bin/

仍是不行。使用mvn package -Pdist,native -DskipTests -Dtar -e -X打印全部日誌,能夠找到:

[INFO] Running cmake /home/wangjun/software/hadoop-3.1.2-src/hadoop-common-project/hadoop-common/src -DGENERATED_JAVAH=/home/wangjun/software/hadoop-3.1.2-src/hadoop-common-project/hadoop-common/target/native/javah -DJVM_ARCH_DATA_MODEL=32 -DREQUIRE_BZIP2=false -DREQUIRE_ISAL=false -DREQUIRE_OPENSSL=false -DREQUIRE_SNAPPY=false -DREQUIRE_ZSTD=false -G Unix Makefiles
[INFO] with extra environment variables {}
[WARNING] Soft-float JVM detected
[WARNING] CMake Error at /home/wangjun/software/hadoop-3.1.2-src/hadoop-common-project/hadoop-common/HadoopCommon.cmake:182 (message):
[WARNING]   Soft-float dev libraries required (e.g.  'apt-get install libc6-dev-armel'
[WARNING]   on Debian/Ubuntu)
[WARNING] Call Stack (most recent call first):
[WARNING]   CMakeLists.txt:26 (include)
[WARNING] 
[WARNING] 
[WARNING] -- Configuring incomplete, errors occurred!
[WARNING] See also "/home/wangjun/software/hadoop-3.1.2-src/hadoop-common-project/hadoop-common/target/native/CMakeFiles/CMakeOutput.log".
[WARNING] See also "/home/wangjun/software/hadoop-3.1.2-src/hadoop-common-project/hadoop-common/target/native/CMakeFiles/CMakeError.log".

查看hadoop-common-project/hadoop-common/target/native/CMakeFiles/CMakeError.log日誌,看到報錯:

gnu/stubs-soft.h: No such file or directory

解決方案:更改hadoop-common-project/hadoop-common/HadoopCommon.cmake,將兩處-mfloat-abi=softfp改成-mfloat-abi=hard,參考:https://blog.csdn.net/wuyusheng314/article/details/79428996https://stackoverflow.com/questions/49139125/fatal-error-gnu-stubs-soft-h-no-such-file-or-directory。(最好是從新解壓原始包更改完從新編譯,要否則可能會出錯)

這個改完又有了新問題,編譯Apache Hadoop MapReduce NativeTask是報錯

[WARNING] /usr/bin/ranlib libgtest.a
[WARNING] make[2]: Leaving directory '/home/wangjun/software/hadoop-3.1.2-src/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/target/native'
[WARNING] /usr/local/bin/cmake -E cmake_progress_report /home/wangjun/software/hadoop-3.1.2-src/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/target/native/CMakeFiles  1
[WARNING] [  7%] Built target gtest
[WARNING] make[1]: Leaving directory '/home/wangjun/software/hadoop-3.1.2-src/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/target/native'
[WARNING] /tmp/ccpXG9td.s: Assembler messages:
[WARNING] /tmp/ccpXG9td.s:2040: Error: bad instruction `bswap r5'
[WARNING] /tmp/ccpXG9td.s:2063: Error: bad instruction `bswap r1'
[WARNING] make[2]: *** [CMakeFiles/nativetask.dir/build.make:79: CMakeFiles/nativetask.dir/main/native/src/codec/BlockCodec.cc.o] Error 1
[WARNING] make[2]: *** Waiting for unfinished jobs....
[WARNING] make[1]: *** [CMakeFiles/Makefile2:96: CMakeFiles/nativetask.dir/all] Error 2
[WARNING] make[1]: *** Waiting for unfinished jobs....
[WARNING] /tmp/ccBbS5rL.s: Assembler messages:
[WARNING] /tmp/ccBbS5rL.s:1959: Error: bad instruction `bswap r5'
[WARNING] /tmp/ccBbS5rL.s:1982: Error: bad instruction `bswap r1'
[WARNING] make[2]: *** [CMakeFiles/nativetask_static.dir/build.make:79: CMakeFiles/nativetask_static.dir/main/native/src/codec/BlockCodec.cc.o] Error 1
[WARNING] make[2]: *** Waiting for unfinished jobs....
[WARNING] /tmp/cc6DHbGO.s: Assembler messages:
[WARNING] /tmp/cc6DHbGO.s:979: Error: bad instruction `bswap r2'
[WARNING] /tmp/cc6DHbGO.s:1003: Error: bad instruction `bswap r3'
[WARNING] make[2]: *** [CMakeFiles/nativetask_static.dir/build.make:125: CMakeFiles/nativetask_static.dir/main/native/src/codec/Lz4Codec.cc.o] Error 1
[WARNING] make[1]: *** [CMakeFiles/Makefile2:131: CMakeFiles/nativetask_static.dir/all] Error 2
[WARNING] make: *** [Makefile:77: all] Error 2

看錯誤應該是指令問題,google一番後,找到解決方案:https://issues.apache.org/jira/browse/HADOOP-14922https://issues.apache.org/jira/browse/HADOOP-11505

編輯primitives.h文件,根據https://issues.apache.org/jira/secure/attachment/12693989/HADOOP-11505.001.patch裏面的git log修改後從新編譯。

經歷了3天的折磨,終於成功了!來,看當作功後的顯示:

[INFO] No site descriptor found: nothing to attach.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Hadoop Main 3.1.2:
[INFO] 
[INFO] Apache Hadoop Main ................................. SUCCESS [  3.532 s]
[INFO] Apache Hadoop Build Tools .......................... SUCCESS [  6.274 s]
[INFO] Apache Hadoop Project POM .......................... SUCCESS [  3.668 s]
[INFO] Apache Hadoop Annotations .......................... SUCCESS [  5.743 s]
[INFO] Apache Hadoop Assemblies ........................... SUCCESS [  1.739 s]
[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [  4.782 s]
[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [ 10.777 s]
[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [  5.156 s]
[INFO] Apache Hadoop Auth ................................. SUCCESS [ 18.468 s]
[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [  8.293 s]
[INFO] Apache Hadoop Common ............................... SUCCESS [03:15 min]
[INFO] Apache Hadoop NFS .................................. SUCCESS [ 14.700 s]
[INFO] Apache Hadoop KMS .................................. SUCCESS [ 15.340 s]
[INFO] Apache Hadoop Common Project ....................... SUCCESS [  0.876 s]
[INFO] Apache Hadoop HDFS Client .......................... SUCCESS [ 46.540 s]
[INFO] Apache Hadoop HDFS ................................. SUCCESS [02:34 min]
[INFO] Apache Hadoop HDFS Native Client ................... SUCCESS [ 12.125 s]
[INFO] Apache Hadoop HttpFS ............................... SUCCESS [ 20.005 s]
[INFO] Apache Hadoop HDFS-NFS ............................. SUCCESS [  8.934 s]
[INFO] Apache Hadoop HDFS-RBF ............................. SUCCESS [01:08 min]
[INFO] Apache Hadoop HDFS Project ......................... SUCCESS [  0.892 s]
[INFO] Apache Hadoop YARN ................................. SUCCESS [  0.879 s]
[INFO] Apache Hadoop YARN API ............................. SUCCESS [ 25.531 s]
[INFO] Apache Hadoop YARN Common .......................... SUCCESS [01:57 min]
[INFO] Apache Hadoop YARN Registry ........................ SUCCESS [ 14.521 s]
[INFO] Apache Hadoop YARN Server .......................... SUCCESS [  0.920 s]
[INFO] Apache Hadoop YARN Server Common ................... SUCCESS [ 23.432 s]
[INFO] Apache Hadoop YARN NodeManager ..................... SUCCESS [ 28.782 s]
[INFO] Apache Hadoop YARN Web Proxy ....................... SUCCESS [  9.515 s]
[INFO] Apache Hadoop YARN ApplicationHistoryService ....... SUCCESS [ 14.077 s]
[INFO] Apache Hadoop YARN Timeline Service ................ SUCCESS [ 12.728 s]
[INFO] Apache Hadoop YARN ResourceManager ................. SUCCESS [ 51.338 s]
[INFO] Apache Hadoop YARN Server Tests .................... SUCCESS [  8.675 s]
[INFO] Apache Hadoop YARN Client .......................... SUCCESS [ 13.937 s]
[INFO] Apache Hadoop YARN SharedCacheManager .............. SUCCESS [ 10.853 s]
[INFO] Apache Hadoop YARN Timeline Plugin Storage ......... SUCCESS [ 12.546 s]
[INFO] Apache Hadoop YARN TimelineService HBase Backend ... SUCCESS [  1.069 s]
[INFO] Apache Hadoop YARN TimelineService HBase Common .... SUCCESS [ 17.176 s]
[INFO] Apache Hadoop YARN TimelineService HBase Client .... SUCCESS [ 15.662 s]
[INFO] Apache Hadoop YARN TimelineService HBase Servers ... SUCCESS [  0.901 s]
[INFO] Apache Hadoop YARN TimelineService HBase Server 1.2  SUCCESS [ 17.512 s]
[INFO] Apache Hadoop YARN TimelineService HBase tests ..... SUCCESS [ 17.327 s]
[INFO] Apache Hadoop YARN Router .......................... SUCCESS [ 14.430 s]
[INFO] Apache Hadoop YARN Applications .................... SUCCESS [  1.990 s]
[INFO] Apache Hadoop YARN DistributedShell ................ SUCCESS [ 10.400 s]
[INFO] Apache Hadoop YARN Unmanaged Am Launcher ........... SUCCESS [  7.210 s]
[INFO] Apache Hadoop MapReduce Client ..................... SUCCESS [  2.549 s]
[INFO] Apache Hadoop MapReduce Core ....................... SUCCESS [ 38.022 s]
[INFO] Apache Hadoop MapReduce Common ..................... SUCCESS [ 35.908 s]
[INFO] Apache Hadoop MapReduce Shuffle .................... SUCCESS [ 15.180 s]
[INFO] Apache Hadoop MapReduce App ........................ SUCCESS [ 18.915 s]
[INFO] Apache Hadoop MapReduce HistoryServer .............. SUCCESS [ 15.852 s]
[INFO] Apache Hadoop MapReduce JobClient .................. SUCCESS [ 12.987 s]
[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [ 12.106 s]
[INFO] Apache Hadoop YARN Services ........................ SUCCESS [  1.812 s]
[INFO] Apache Hadoop YARN Services Core ................... SUCCESS [  8.685 s]
[INFO] Apache Hadoop YARN Services API .................... SUCCESS [  9.236 s]
[INFO] Apache Hadoop YARN Site ............................ SUCCESS [  0.859 s]
[INFO] Apache Hadoop YARN UI .............................. SUCCESS [  0.840 s]
[INFO] Apache Hadoop YARN Project ......................... SUCCESS [ 34.971 s]
[INFO] Apache Hadoop MapReduce HistoryServer Plugins ...... SUCCESS [  7.376 s]
[INFO] Apache Hadoop MapReduce NativeTask ................. SUCCESS [02:07 min]
[INFO] Apache Hadoop MapReduce Uploader ................... SUCCESS [  9.915 s]
[INFO] Apache Hadoop MapReduce Examples ................... SUCCESS [ 14.651 s]
[INFO] Apache Hadoop MapReduce ............................ SUCCESS [ 15.959 s]
[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [ 11.747 s]
[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [ 16.314 s]
[INFO] Apache Hadoop Archives ............................. SUCCESS [  7.115 s]
[INFO] Apache Hadoop Archive Logs ......................... SUCCESS [  8.686 s]
[INFO] Apache Hadoop Rumen ................................ SUCCESS [ 12.413 s]
[INFO] Apache Hadoop Gridmix .............................. SUCCESS [ 10.490 s]
[INFO] Apache Hadoop Data Join ............................ SUCCESS [  7.894 s]
[INFO] Apache Hadoop Extras ............................... SUCCESS [  7.098 s]
[INFO] Apache Hadoop Pipes ................................ SUCCESS [ 19.457 s]
[INFO] Apache Hadoop OpenStack support .................... SUCCESS [ 12.452 s]
[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [04:55 min]
[INFO] Apache Hadoop Kafka Library support ................ SUCCESS [ 36.248 s]
[INFO] Apache Hadoop Azure support ........................ SUCCESS [ 43.752 s]
[INFO] Apache Hadoop Aliyun OSS support ................... SUCCESS [ 34.905 s]
[INFO] Apache Hadoop Client Aggregator .................... SUCCESS [ 17.099 s]
[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 18.819 s]
[INFO] Apache Hadoop Resource Estimator Service ........... SUCCESS [ 29.363 s]
[INFO] Apache Hadoop Azure Data Lake support .............. SUCCESS [ 30.145 s]
[INFO] Apache Hadoop Image Generation Tool ................ SUCCESS [  8.970 s]
[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [ 46.265 s]
[INFO] Apache Hadoop Tools ................................ SUCCESS [  0.883 s]
[INFO] Apache Hadoop Client API ........................... SUCCESS [08:41 min]
[INFO] Apache Hadoop Client Runtime ....................... SUCCESS [06:39 min]
[INFO] Apache Hadoop Client Packaging Invariants .......... SUCCESS [  4.040 s]
[INFO] Apache Hadoop Client Test Minicluster .............. SUCCESS [13:29 min]
[INFO] Apache Hadoop Client Packaging Invariants for Test . SUCCESS [  1.937 s]
[INFO] Apache Hadoop Client Packaging Integration Tests ... SUCCESS [  1.865 s]
[INFO] Apache Hadoop Distribution ......................... SUCCESS [01:56 min]
[INFO] Apache Hadoop Client Modules ....................... SUCCESS [  5.050 s]
[INFO] Apache Hadoop Cloud Storage ........................ SUCCESS [  6.457 s]
[INFO] Apache Hadoop Cloud Storage Project ................ SUCCESS [  0.829 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:06 h
[INFO] Finished at: 2019-09-03T14:14:45+08:00
[INFO] ------------------------------------------------------------------------

編譯完成後的內容在hadoop-dist裏面。感覺一下爲了編譯這個玩意兒嘗試了多少個版本:

cmake-3.0.0         
cmake-3.3.0       
hadoop-2.7.7-src.tar.gz  
hadoop-2.9.2-src         
hadoop-3.1.2-src                   
protobuf-2.5.0    
cmake-3.1.0         
hadoop-2.8.5-src     
hadoop-2.9.2-src.tar.gz  
hadoop-3.1.2-src.tar.gz  
cmake-3.1.0.tar.gz  
hadoop-2.7.7-src  
hadoop-2.8.5-src.tar.gz  
hadoop-3.1.2             
hadoop-3.1.2.tar.gz

5、啓動運行hadoop

hadoop-dist/target裏面的hadoop-3.1.2.tar.gz拷貝到你要安裝的位置,解壓。

# 進入bin目錄,啓動前先格式化HDFS系統
cd hadoop-3.1.2/bin
./hdfs namenode -format
......
......
2019-09-03 14:35:53,356 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at raspberrypi/127.0.1.1
************************************************************/
# 啓動全部服務
cd ../sbin/
./start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as wangjun in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [raspberrypi]
Starting datanodes
Starting secondary namenodes [raspberrypi]
Starting resourcemanager
Starting nodemanagers

訪問8088端口http://localhost:8088就能夠看到hadoop的管理界面了!

hadoop的web界面:

# All Applications
http://localhost:8088
# DataNode Information
http://localhost:9864
# Namenode Information
http://localhost:9870
# node
http://localhost:8042
# SecondaryNamenode information
http://localhost:9868

問題1:啓動時報錯:

$ ./start-all.sh 
WARNING: Attempting to start all Apache Hadoop daemons as wangjun in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [raspberrypi]
raspberrypi: ERROR: JAVA_HOME is not set and could not be found.
Starting datanodes
localhost: ERROR: JAVA_HOME is not set and could not be found.
Starting secondary namenodes [raspberrypi]
raspberrypi: ERROR: JAVA_HOME is not set and could not be found.
Starting resourcemanager
Starting nodemanagers
localhost: ERROR: JAVA_HOME is not set and could not be found.

解決方案:

vim ./etc/hadoop/hadoop-env.sh
# export JAVA_HOME=
改成具體的java安裝路徑,好比
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf
相關文章
相關標籤/搜索