hadoop streaming -archives 解壓jar、zip、tar.gz的驗證

時間 2019-11-19

標籤 hadoop streaming archives 解壓 jar zip tar.gz tar 驗證欄目 Hadoop 简体版

原文原文鏈接

一、archives做用描述：html

Hadoop中DistributedCache方法之一（其餘參考文章後面的參考文章），做用是將指定文件分發到各個Task的工做目錄下，並對名稱後綴爲「.jar」、「.zip」，「.tar.gz」、「.tgz」的文件自動解壓，默認狀況下，解壓後的內容存放到工做目錄下名稱爲解壓前文件名的目錄中，好比壓縮包爲dict.zip,則解壓後內容存放到目錄dict.zip中。爲此，你能夠給文件起個別名/軟連接，好比dict.zip#dict，這樣，壓縮包會被解壓到目錄dict中。java

二、測試jar文件（基本直接摘抄參考文檔的）apache

$ ls test_jar/
file  file1    file2 
file = this is file1(實驗的時候這裏搞錯了，應該是用file1，對結果無影響，不作修改了)
file2 = this is file2
$ jar cvf cache.jar -C test_jar/ .
$ hdfs dfs -put cache.jar /user/work/cachefile
#touch 一個input.txt文件，而後put到 /user/work/cachefile
$ hdfs dfs -cat /user/work/cachefile/input.txt
cache/file   （cache是解壓後的目錄名，用#重定義的別名，參加下面的）
cache/file2

HADOOP_HOME=/home/hadoop/hadoop-2.3.0-cdh5.1.3
$HADOOP_HOME/bin/hadoop fs -rmr /cacheout/

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.3.0-cdh5.1.3.jar \
 -archives  /user/work/cachefile/cache.jar#cache \
 -Dmapred.map.tasks=1 \
 -Dmapred.reduce.tasks=1 \
 -Dmapred.job.name="Experiment" \
 -input "cachefile/input2.txt"  \
 -output "/cacheout/" \
 -mapper "xargs cat" \
 -reducer "cat"
 
hadoop fs -cat /cacheout/*
this is file 2
this is file1

三、測試zip & tar.gzapp

分別打包zip ， tar.gz的壓縮包，put到hdfs繼續測試。
oop

-archives  /user/work/cachefile/cache.tar.gz#cache \    只修改後綴名，會報文件找不到的錯誤

查錯：確認是否能解壓，將mapper 改爲：測試

-mapper "ls cache" \

發現：jar文件：結果有4個文件，分別是META-INF、file、file一、file2this

zip & tar.gz：只有一個，是test_jar的目錄名spa

而後查看3種包的壓縮文件，顯然是解壓成功了，找不到文件的緣由是目錄問題，這個就要詳細研究下3中打包方式了，再也不贅述。：code

總結：-archives是一個很實用的參數，可是使用中尤爲要注意目錄問題。htm

參考：

http://blog.javachen.com/2015/02/12/hadoop-streaming.html

http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html#Working_with_Large_Files_and_Archives

http://dongxicheng.org/mapreduce-nextgen/hadoop-distributedcache-details/

相關標籤/搜索