使用Python操做Hadoop，Python-MapReduce

時間 2019-12-21

標籤使用 python hadoop mapreduce 欄目 Python 简体版

原文原文鏈接

環境

環境使用：hadoop3.1，Python3.6，ubuntu18.04python

Hadoop是使用Java開發的，推薦使用Java操做HDFS。編程

有時候也須要咱們使用Python操做HDFS。ubuntu

本次咱們來討論如何使用Python操做HDFS，進行文件上傳，下載，查看文件夾，以及如何使用Python進行MapReduce編程。bash

使用Python操做HDFS

首先須要安裝和導入hdfs庫，使用pip install hdfs。app

1. 鏈接並查看指定路徑下的數據

from hdfs import * 
client = Client('http://ip:port')  #2.X版本port 使用50070  3.x版本port 使用9870
client.list('/')   #查看hdfs /下的目錄

2. 建立目錄

client.makedirs('/test')
client.makedirs('/test',permision = 777 ) # permision能夠設置參數

3. 重命名、刪除

client.rename('/test','123')  #將/test 目錄更名爲123
client.delete('/test',True)  #第二個參數表示遞歸刪除

4.下載

將/test/log.txt 文件下載至/home目錄下。函數

client.download('/test/log.txt','/home')

5. 讀取

with client.read("/test/[PPT]Google Protocol Buffers.pdf") as reader:    
    print reader.read()

其餘參數：oop

read(args, *kwds)
hdfs_path：hdfs路徑
offset：設置開始的字節位置
l- ength：讀取的長度（字節爲單位）
buffer_size：用於傳輸數據的字節的緩衝區的大小。默認值設置在HDFS配置。
encoding：指定編碼
chunk_size：若是設置爲正數，上下文管理器將返回一個發生器產生的每一chunk_size字節而不是一個相似文件的對象
delimiter：若是設置，上下文管理器將返回一個發生器產生每次遇到分隔符。此參數要求指定的編碼。
progress：回調函數來跟蹤進度，爲每一chunk_size字節（不可用，若是塊大小不是指定）。它將傳遞兩個參數，文件上傳的路徑和傳輸的字節數。稱爲一次與- 1做爲第二個參數。

6.上傳數據

將文件上傳至hdfs的 /test下。測試

client.upload(‘/test’,’/home/test/a.log’)

Python-MapReduce

編寫mapper代碼，map.py：編碼

import sys

for line in sys.stdin:
    fields = line.strip().split()
    for item in fields:
        print(item + ' ' + '1')

編寫reducer代碼，reduce.py：spa

import sys

result = {}
for line in sys.stdin:
    kvs = line.strip().split(' ')
    k = kvs[0]
    v = kvs[1]
    if k in result:
        result[k]+=1
    else:
        result[k] = 1
for k,v in result.items():
    print("%s\t%s" %(k,v))

添加測試文本，test1.txt：

tale as old as time
true as it can be
beauty and the beast

本地測試執行`map`代碼：

`
cat test1.txt | python map.py
`
結果：

tale 1
as 1
old 1
as 1
time 1
true 1
as 1
it 1
can 1
be 1
beauty 1
and 1
the 1
beast 1

本地測試執行`reduce`代碼：

cat test1.txt | python map.py | sort -k1,1 | python reduce.py

執行結果：

and    1
be    1
old    1
beauty    1
true    1
it    1
beast    1
as    3
can    1
time    1
the    1
tale    1

在Hadoop平臺執行`map-reduce`程序

本地測試完畢，編寫腳本在HDFS中執行程序

腳本：run.sh （請根據本機環境修改）

HADOOP_CMD="/app/hadoop-3.1.2/bin/hadoop"

STREAM_JAR_PATH="/app/hadoop-3.1.2/share/hadoop/tools/lib/hadoop-streaming-3.1.2.jar"

INPUT_FILE_PATH_1="/py/input/"

OUTPUT_PATH="/output"

$HADOOP_CMD fs -rmr-skipTrash $OUTPUT_PATH

# Step 1.

$HADOOP_CMD jar $STREAM_JAR_PATH   \
-input $INPUT_FILE_PATH_1   \
-output $OUTPUT_PATH   \
-mapper "python  map.py"   \
-reducer "python reduce.py"  \
-file ./map.py   \
-file ./reduce.py  \

添加執行權限chmod a+x run.sh；
執行測試：bash run.sh，查看結果：

練習

1. 文件合併去重

輸入文件file1的樣例以下：
20150101 x
20150102 y
20150103 x
20150104 y
20150105 z
20150106 x

輸入文件file2的樣例以下：
20150101 y
20150102 y
20150103 x
20150104 z
20150105 y

根據輸入文件file1和file2合併獲得的輸出文件file3的樣例以下：

20150101 x
20150101 y
20150102 y
20150103 x
20150104 y
20150104 z
20150105 y
20150105 z
20150106 x

對於兩個輸入文件，即文件file1和文件file2，請編寫MapReduce程序，對兩個文件進行合併，並剔除其中重複的內容，獲得一個新的輸出文件file3。
爲了完成文件合併去重的任務，你編寫的程序要能將含有重複內容的不一樣文件合併到一個沒有重複的整合文件，規則以下：

第一列按學號排列；
學號相同，按x,y,z排列。

2. 挖掘父子關係

輸入文件內容以下：
child parent
Steven Lucy
Steven Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Frank
Jack Alice
Jack Jesse
David Alice
David Jesse
Philip David
Philip Alma
Mark David
Mark Alma

輸出文件內容以下：

grandchild grandparent
Steven Alice
Steven Jesse
Jone Alice
Jone Jesse
Steven Mary
Steven Frank
Jone Mary
Jone Frank
Philip Alice
Philip Jesse
Mark Alice
Mark Jesse