Hadoop-Streaming(Python篇)


問題反饋

  • 部署或使用時有不明白的能夠聯繫我
    • Wechat:Leo-sunhailin
    • QQ: 379978424


搭建環境

  • 系統環境: Win10(64位) <-- 必定要64位
    • Linux暫時不講,由於部署起來比Windows簡單
  • Java版本: Java 1.8.0_144
  • Hadoop版本: Apache Hadoop 2.7.4
    • (接下來的教程使用的時2.7.4做爲"栗子")
  • Python版本: Python 3.4.4

下載方式

  • Hadoop下載地址:
    • 儘可能選擇國內源,若是須要歷史版本的話只能去官方的源下載.
    • 一、阿里源: 地址
    • 二、清華源: 地址
    • 三、官方: 地址
  • Winutils(Linux的能夠略過):
    • 一、winutils這個版本須要和Hadoop版本進行對應
    • 二、若是不下載的話就須要去網上找別人編譯好的winutils的包.(機率容易出bug)
  • Java和Python的下載自行百度
    • JDK1.8
    • Python 3.4.4

部署和測試

  • Python環境自行搭建,不做闡述
  • 步驟0(Java環境變量)
    • 最最最重要的問題!
      • Windows下,Java配置路徑千萬不要有空格.
      • 安裝完Java後本身去安裝目錄拷出來到一個徹底沒空格的路徑.
      • 配置Java環境變量
  • 步驟1(Hadoop環境搭建)
    • 注: 如下部署爲單機單節點部署
    • 1.一、解壓到你本身歸類的目錄下
      • 我本身在D盤創建個路徑: D:/bigdata/
      • 解壓完後hadoop的路徑爲: D:/bigdata/hadoop-2.7.4
    • 1.二、進入到hadoop根目錄下的etc\hadoop\中
    • 1.三、修改core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml(其中mapred的本來後綴有個.template,重名去掉後綴)
      • core-site.xml:
      <configuration>
          <!--指定namenode的地址-->
          <property>
              <name>fs.defaultFS</name>
              <value>hdfs://localhost:9000</value>
              <description>HDFS的URI,文件系統://namenode標識:端口號</description>
          </property>
          
          <!--用來指定使用hadoop時產生文件的存放目錄-->
          <property>
              <name>hadoop.tmp.dir</name>
              <value>/D:/bigdata/hadoop-2.7.4/workplace/tmp</value>
              <description>namenode上本地的hadoop臨時文件夾</description>
          </property>
      </configuration>
      複製代碼
      • hdfs-site.xml:
      <configuration>
          <!--指定hdfs保存數據的副本數量-->
          <property>
          <name>dfs.replication</name>
          <value>1</value>
          <description>副本個數,配置默認是3,應小於datanode機器數量</description>
      </property>
      
      <property>
          <name>dfs.name.dir</name>
          <value>/D:/bigdata/hadoop-2.7.4/workplace/name</value>
          <description>namenode上存儲hdfs名字空間元數據 </description>
      </property>
      
      <property>
          <name>dfs.data.dir</name>
          <value>/D:/bigdata/hadoop-2.7.4/workplace/data</value>
          <description>datanode上數據塊的物理存儲位置</description>
      </property>
      
      <property>
          <name>dfs.webhdfs.enabled</name>
          <value>true</value>
          <description>WebHDFS接口</description>
      </property>
      
      <property>
          <name>dfs.permissions</name>
          <value>false</value>
          <description>HDFS權限控制</description>
      </property>
      </configuration>
      複製代碼
      • yarn-site.xml:
      <configuration>
          <!--nomenodeManager獲取數據的方式是shuffle-->
          <property>
              <name>yarn.nodemanager.aux-services</name>
              <value>mapreduce_shuffle</value>
          </property>
          
          <!-- 我電腦的內存是8G,通常設置到6G左右,我爲了方便直接寫了8G的實際大小 -->
          <property>
              <name>yarn.nodemanager.resource.memory-mb</name>
              <value>8192</value>
          </property>
          
          <!-- 這個參數是MR任務最小的分配內存 -->
          <property>
              <name>yarn.scheduler.minimum-allocation-mb</name>
          <value>1536</value>
          </property>
          
          <!-- 這個參數是MR任務最大的分配內存 -->
          <property>
              <name>yarn.scheduler.maximum-allocation-mb</name>
          <value>4096</value>
          </property>
          
          <!-- 這個參數是MR任務使用的CPU數 -->
          <property>
              <name>yarn.nodemanager.resource.cpu-vcores</name>
              <value>2</value>
          </property>
          
          <property>                                                                
              <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
              <value>org.apache.hadoop.mapred.ShuffleHandler</value>
          </property>
      </configuration>
      複製代碼
      • mapred-site.xml:
      <configuration>
          <!--告訴hadoop之後MR運行在YARN上--> 
          <property>
              <name>mapreduce.framework.name</name>
              <value>yarn</value>
          </property>
          <property>
              <name>mapreduce.map.memory.mb</name>
              <value>2048</value>
          </property>
          <property>
              <name>mapreduce.reduce.memory.mb</name>
              <value>2048</value>
          </property>
          <property>
              <name>mapreduce.jobtracker.http.address</name>
              <value>localhost:50030</value>
          </property>
          <property>
              <name>mapreduce.jobhistory.address</name>
              <value>localhost:10020</value>
          </property>
          <property>
              <name>mapreduce.jobhistory.webapp.address</name>
              <value>localhost:19888</value>
          </property>
          <property>
              <name>mapred.job.tracker</name>
              <value>http://localhost:9001</value>
          </property>
      </configuration>
      複製代碼
    • 1.四、回到hadoop根目錄建立hdfs-site和core-site中指定目錄的文件夾
    • 1.五、配置環境變量:
      • 1.5.一、添加系統變量HADOOP_CONF_DIR
      D:\bigdata\hadoop-2.7.4\etc\hadoop\
      複製代碼
      • 1.5.二、添加系統變量HADOOP_HOME
      D:\bigdata\hadoop-2.7.4
      複製代碼
      • 1.5.三、Path路徑下添加:
      D:\bigdata\hadoop-2.7.4\bin
      複製代碼
    • 1.六、運行namenode的初始化
      hadoop namenode -format
      複製代碼
    • 1.七、(1.6的步驟沒有報錯以後)啓動hadoop
      cd /d D:\bigdata\hadoop-2.7.4\sbin
      dir
      
      ## 推薦啓動步驟
      start-dfs.cmd
      start-yarn.cmd
      
      ## 粗暴
      start-all.cmd
      複製代碼
    • 1.八、(以上步驟都沒有報錯後)打開新的cmd窗口
      jps
      
      # 看看對應的組件是否啓動成功並佔有進程ID號
      複製代碼
    • 1.九、測試訪問連接(訪問無誤後執行下一步): http://localhost:50070 http://localhost:8088
    • 2.0、測試hadoop mapreduce example可否運行:
      # 能夠查看有什麼測試的方法
      hadoop jar D:\bigdata\hadoop-2.7.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.4.jar -info
      
      # 選用最常用的PI測試(3個Task, 100個取樣個數,兩數相乘爲總樣本數)
      hadoop jar D:\bigdata\hadoop-2.7.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.4.jar PI 3 100
      複製代碼
    • 2.一、成功運行後,開始編寫python的mapper和reducer

代碼案例

  • Hadoop Streaming的官方Jar包註釋:
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]
Options:
  -input          <path> DFS input file(s) for the Map step.
  -output         <path> DFS output directory for the Reduce step.
  -mapper         <cmd|JavaClassName> Optional. Command to be run as mapper.
  -combiner       <cmd|JavaClassName> Optional. Command to be run as combiner.
  -reducer        <cmd|JavaClassName> Optional. Command to be run as reducer.
  -file           <file> Optional. File/dir to be shipped in the Job jar file.
                  Deprecated. Use generic option "-files" instead.
  -inputformat    <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName>
                  Optional. The input format class.
  -outputformat   <TextOutputFormat(default)|JavaClassName>
                  Optional. The output format class.
  -partitioner    <JavaClassName>  Optional. The partitioner class.
  -numReduceTasks <num> Optional. Number of reduce tasks.
  -inputreader    <spec> Optional. Input recordreader spec.
  -cmdenv         <n>=<v> Optional. Pass env.var to streaming commands.
  -mapdebug       <cmd> Optional. To run this script when a map task fails.
  -reducedebug    <cmd> Optional. To run this script when a reduce task fails.
  -io             <identifier> Optional. Format to use for input to and output
                  from mapper/reducer commands
  -lazyOutput     Optional. Lazily create Output.
  -background     Optional. Submit the job and don't wait till it completes.
  -verbose        Optional. Print verbose output.
  -info           Optional. Print detailed usage.
  -help           Optional. Print help message.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]


Usage tips:
In -input: globbing on <path> is supported and can have multiple -input

Default Map input format: a line is a record in UTF-8 the key part ends at first
  TAB, the rest of the line is the value

To pass a Custom input format:
  -inputformat package.MyInputFormat

Similarly, to pass a custom output format:
  -outputformat package.MyOutputFormat

The files with extensions .class and .jar/.zip, specified for the -file
  argument[s], end up in "classes" and "lib" directories respectively inside
  the working directory when the mapper and reducer are run. All other files
  specified for the -file argument[s] end up in the working directory when the
  mapper and reducer are run. The location of this working directory is
  unspecified.

To set the number of reduce tasks (num. of output files) as, say 10:
  Use -numReduceTasks 10
To skip the sort/combine/shuffle/sort/reduce step:
  Use -numReduceTasks 0
  Map output then becomes a 'side-effect output' rather than a reduce input.
  This speeds up processing. This also feels more like "in-place" processing
  because the input filename and the map input order are preserved.
  This is equivalent to -reducer NONE

To speed up the last maps:
  -D mapreduce.map.speculative=true
To speed up the last reduces:
  -D mapreduce.reduce.speculative=true
To name the job (appears in the JobTracker Web UI):
  -D mapreduce.job.name='My Job'
To change the local temp directory:
  -D dfs.data.dir=/tmp/dfs
  -D stream.tmpdir=/tmp/streaming
Additional local temp directories with -jt local:
  -D mapreduce.cluster.local.dir=/tmp/local
  -D mapreduce.jobtracker.system.dir=/tmp/system
  -D mapreduce.cluster.temp.dir=/tmp/temp
To treat tasks with non-zero exit status as SUCCEDED:
  -D stream.non.zero.exit.is.failure=false
Use a custom hadoop streaming build along with standard hadoop install:
  $HADOOP_PREFIX/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\
    [...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jar
For more details about jobconf parameters see:
  http://wiki.apache.org/hadoop/JobConfFile
To set an environement variable in a streaming command:
   -cmdenv EXAMPLE_DIR=/home/example/dictionaries/

Shortcut:
   setenv HSTREAMING "$HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar"

Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
           -file /local/filter.pl -input "/logs/0604*/*" [...]
  Ships a script, invokes the non-shipped perl interpreter. Shipped files go to
  the working directory so filter.pl is found by perl. Input files are all the
  daily logs for days in month 2006-04
複製代碼
  • 數據準備(我在網上找的一個books.json的數據)
["milton-paradise.txt", "[ Paradise Lost by John Milton 1667 ] Book I Of Man ' s first disobedience , and the fruit Of that forbidden tree whose mortal taste Brought death into the World , and all our woe , With loss of Eden , till one greater Man Restore us , and regain the blissful seat , Sing , Heavenly Muse , that , on the secret top Of Oreb , or of Sinai , didst inspire That shepherd who first taught the chosen seed In the beginning how the heavens and earth Rose out of Chaos : or , if Sion hill Delight thee more , and Siloa ' s brook that flowed Fast by the oracle of God , I thence Invoke thy aid to my adventurous song , That with no middle flight intends to soar Above th ' Aonian mount , while it pursues Things unattempted yet in prose or rhyme ."]
["edgeworth-parents.txt", "[ The Parent ' s Assistant , by Maria Edgeworth ] THE ORPHANS . Near the ruins of the castle of Rossmore , in Ireland , is a small cabin , in which there once lived a widow and her four children . As long as she was able to work , she was very industrious , and was accounted the best spinner in the parish ; but she overworked herself at last , and fell ill , so that she could not sit to her wheel as she used to do , and was obliged to give it up to her eldest daughter , Mary ."]
["austen-emma.txt", "[ Emma by Jane Austen 1816 ] VOLUME I CHAPTER I Emma Woodhouse , handsome , clever , and rich , with a comfortable home and happy disposition , seemed to unite some of the best blessings of existence ; and had lived nearly twenty - one years in the world with very little to distress or vex her . She was the youngest of the two daughters of a most affectionate , indulgent father ; and had , in consequence of her sister ' s marriage , been mistress of his house from a very early period . Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses ; and her place had been supplied by an excellent woman as governess , who had fallen little short of a mother in affection ."]
["chesterton-ball.txt", "[ The Ball and The Cross by G . K . Chesterton 1909 ] I . A DISCUSSION SOMEWHAT IN THE AIR The flying ship of Professor Lucifer sang through the skies like a silver arrow ; the bleak white steel of it , gleaming in the bleak blue emptiness of the evening . That it was far above the earth was no expression for it ; to the two men in it , it seemed to be far above the stars . The professor had himself invented the flying machine , and had also invented nearly everything in it ."]
["bible-kjv.txt", "[ The King James Bible ] The Old Testament of the King James Bible The First Book of Moses : Called Genesis 1 : 1 In the beginning God created the heaven and the earth . 1 : 2 And the earth was without form , and void ; and darkness was upon the face of the deep . And the Spirit of God moved upon the face of the waters . 1 : 3 And God said , Let there be light : and there was light . 1 : 4 And God saw the light , that it was good : and God divided the light from the darkness . 1 : 5 And God called the light Day , and the darkness he called Night . And the evening and the morning were the first day ."]
["chesterton-thursday.txt", "[ The Man Who Was Thursday by G . K . Chesterton 1908 ] To Edmund Clerihew Bentley A cloud was on the mind of men , and wailing went the weather , Yea , a sick cloud upon the soul when we were boys together . Science announced nonentity and art admired decay ; The world was old and ended : but you and I were gay ; Round us in antic order their crippled vices came -- Lust that had lost its laughter , fear that had lost its shame . Like the white lock of Whistler , that lit our aimless gloom , Men showed their own white feather as proudly as a plume . Life was a fly that faded , and death a drone that stung ; The world was very old indeed when you and I were young ."]
["blake-poems.txt", "[ Poems by William Blake 1789 ] SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL SONGS OF INNOCENCE INTRODUCTION Piping down the valleys wild , Piping songs of pleasant glee , On a cloud I saw a child , And he laughing said to me : \" Pipe a song about a Lamb !\" So I piped with merry cheer . \" Piper , pipe that song again ;\" So I piped : he wept to hear . \" Drop thy pipe , thy happy pipe ; Sing thy songs of happy cheer :!\" So I sang the same again , While he wept with joy to hear . \" Piper , sit thee down and write In a book , that all may read .\" So he vanish ' d from my sight ; And I pluck ' d a hollow reed , And I made a rural pen , And I stain ' d the water clear , And I wrote my happy songs Every child may joy to hear ."]
["shakespeare-caesar.txt", "[ The Tragedie of Julius Caesar by William Shakespeare 1599 ] Actus Primus . Scoena Prima . Enter Flauius , Murellus , and certaine Commoners ouer the Stage . Flauius . Hence : home you idle Creatures , get you home : Is this a Holiday ? What , know you not ( Being Mechanicall ) you ought not walke Vpon a labouring day , without the signe Of your Profession ? Speake , what Trade art thou ? Car . Why Sir , a Carpenter Mur . Where is thy Leather Apron , and thy Rule ? What dost thou with thy best Apparrell on ? You sir , what Trade are you ? Cobl . Truely Sir , in respect of a fine Workman , I am but as you would say , a Cobler Mur . But what Trade art thou ? Answer me directly Cob . A Trade Sir , that I hope I may vse , with a safe Conscience , which is indeed Sir , a Mender of bad soules Fla ."]
["whitman-leaves.txt", "[ Leaves of Grass by Walt Whitman 1855 ] Come , said my soul , Such verses for my Body let us write , ( for we are one ,) That should I after return , Or , long , long hence , in other spheres , There to some group of mates the chants resuming , ( Tallying Earth ' s soil , trees , winds , tumultuous waves ,) Ever with pleas ' d smile I may keep on , Ever and ever yet the verses owning -- as , first , I here and now Signing for Soul and Body , set to them my name , Walt Whitman [ BOOK I . INSCRIPTIONS ] } One ' s - Self I Sing One ' s - self I sing , a simple separate person , Yet utter the word Democratic , the word En - Masse ."]
["melville-moby_dick.txt", "[ Moby Dick by Herman Melville 1851 ] ETYMOLOGY . ( Supplied by a Late Consumptive Usher to a Grammar School ) The pale Usher -- threadbare in coat , heart , body , and brain ; I see him now . He was ever dusting his old lexicons and grammars , with a queer handkerchief , mockingly embellished with all the gay flags of all the known nations of the world . He loved to dust his old grammars ; it somehow mildly reminded him of his mortality ."]
複製代碼
  • 任務需求:node

    • 一、json格式的數據,對對每一個txt的內容進行分詞
    • 二、將每一個詞抽取出來,按照某個詞對應一個文本或多個文本按行輸出,例:
    Data:
        ["test_1.txt", "[ apple pipe ]"]
        ["test_2.txt", "[ apple company ]"]
    
    Result:
        apple    ["test_1.txt", "test_2.txt"]
        pipe    ["test_1.txt"]
        company    ["test_2.txt"]
    複製代碼
  • mapper.pypython

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
""" Created on 2017年10月30日 @author: Leo """

# Python內部庫
import sys
import json


for line in sys.stdin:
    line = line.strip()
    record = json.loads(line)
    file_name = record[0]
    value = record[1]
    words = value.split()
    for word in words:
        print("%s\t%s" % (word, file_name))

複製代碼
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
""" Created on 2017年10月30日 @author: Leo """

# Python內部庫
import sys

media = {}
word_in_media = {}

# maps words to their counts
for line in sys.stdin:
    (word, file_name) = line.strip().split('\t', 1)
    media.setdefault(word, [])
    media[word].append(file_name)

for word in media:
    word_in_media.setdefault(word, list(set(media[word])))

for word in word_in_media:
    print("%s\t%s" % (word, word_in_media[word]))

複製代碼
  • 將books.json上傳到HDFSlinux

    • 一、上傳命令
    # 若是還沒在HDFS上建立過文件夾的話,須要先建立文件夾
    # 如下是本人建立的方式,能夠自行建立
    hdfs dfs -mkdir -p /user/Leo/input
    
    # 從本地上傳到HDFS中
    hdfs dfs -copyFromLocal <絕對路徑>\books.json /user/Leo/input/
    
    # 打開localhost:50070的HDFS頁面後查看文件是否存在
    複製代碼
    • 二、刪除Output文件夾
    # 這個是若是Mapreduce執行過程當中出錯,解決後再出錯的時候記得刪除output文件夾
    hdfs dfs -rm -r /user/Leo/output
    複製代碼
    • 三、關閉HDFS Safemode模式
    # 不正常的操做觸發了HDFS啓動了安全模式
    hdfs dfsadmin -safemode leave
    複製代碼
    • 四、執行hadoop streaming的命令(記得別漏了橫槓! 記得別漏了橫槓!)
    # 如下代碼過長,我用linux命令換行的方式進行展現
    hadoop jar D:/bigdata/hadoop-2.7.4/share/hadoop/tools/lib/hadoop-streaming-2.7.4.jar \ 
    -D stream.non.zero.exit.is.failure=false \
    -input /user/Leo/input/books.json \ 
    -output /user/Leo/output \ 
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -file C:/Users/Administrator/Desktop/MingDong_Work/Work_2/mapper.py \
    -file C:/Users/Administrator/Desktop/MingDong_Work/Work_2/reducer.py
    
    # 解釋:
    一、jar 後面跟的是Jar包的路徑,官方提倡用環境變量加路徑的方式,我這爲了演示用了絕對路徑進行展現
    二、-D stream.non.zero.exit.is.failure=false 這句話的意思是若是函數返回值(即mapper或reducer沒有return 0,則函數爲異常結果.加了這句就能夠跳過檢查.)
    三、input: 就是HDFS的文件
    四、output: 就是M-R任務結束後的文件存放的地方
    五、mapper: 指定執行mapper的腳本或代碼
    六、reducer: 指定執行reducer的腳本或代碼
    七、-file: 指定代碼的位置(多個文件用-files,爲了展現更清晰,我用了舊版的-file的形式進行展現)
    複製代碼
    • 五、執行事後再HDFS指定的output路徑下會出現一個名爲part-00000的文件,結果存儲在裏面,能夠直接在網頁上下載到本地或用代碼下載到本地.
    hdfs -dfs -get <HDFS文件路徑>(/user/Leo/output/part-00000) <本地路徑>
    複製代碼

總結

  • 整體來講,Hadoop Streaming上手容易,主要難在對於Map-reduce的模式的理解上.
  • 須要理解以下幾點:
    • Map-reduce(即Map-Shuffle-Reduce)的工做流程
    • Hadoop streaming key/value的劃分方式
    • 結合本身的業務需求
    • 知足業務所使用的腳本語言的特色進行編寫.

優勢: (方便,快捷)git

  • 只要支持stdin,stdout的語言均可以(Unix風格)
  • 靈活,通常多用於處理一些臨時任務,不用改動項目的代碼結構
  • 本地調試

缺點: (主要都是性能問題)github

  • 也正是因爲由於stdin,stdout,數據傳輸交換的過程當中,不免要對數據類型進行轉換,因此會增長代碼的執行時間.

補充

  • 在Windows環境下出現的一個奇怪的Error
    1. 默認管理員能夠建立符號表,可使用管理員命令行啓動 hadoop的應用
    2. 經過修改用戶策略, 步驟以下:
      1. win+R -> gpedit.msc
      2. 計算機配置 -> windows設置 -> 安全設置 -> 本地策略 -> 用戶權限分配 -> 建立符號連接
      3. 把用戶添加進去,重啓或者註銷
相關文章