項目實戰從0到1之hive(34)大數據項目之電商數倉(用戶行爲數據採集)(二)

第4章 數據採集模塊

4.1 Hadoop安裝

1)集羣規劃: imgnode

注意:儘可能使用離線方式安裝apache

4.1.1 項目經驗之HDFS存儲多目錄

若HDFS存儲空間緊張,須要對DataNode進行磁盤擴展。 1)在DataNode節點增長磁盤並進行掛載。app

img 2)在hdfs-site.xml文件中配置多目錄,注意新掛載磁盤的訪問權限問題。ide

<property>
   <name>dfs.datanode.data.dir</name>
<value>file:///${hadoop.tmp.dir}/dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value>
</property>

4.1.2 項目經驗之支持LZO壓縮配置

1)hadoop自己並不支持lzo壓縮,故須要使用twitter提供的hadoop-lzo開源組件。hadoop-lzo需依賴hadoop和lzo進行編譯,編譯步驟以下。oop

2)將編譯好後的hadoop-lzo-0.4.20.jar 放入hadoop-2.7.2/share/hadoop/common/性能

[kgg@hadoop101 common]$ pwd
/opt/module/hadoop-2.7.2/share/hadoop/common
[kgg@hadoop101 common]$ ls
hadoop-lzo-0.4.20.jar

3)同步hadoop-lzo-0.4.20.jar到hadoop10二、hadoop103測試

[kgg@hadoop101 common]$ xsync hadoop-lzo-0.4.20.jar

4)core-site.xml增長配置支持LZO壓縮url

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
   <name>io.compression.codec.lzo.class</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
</configuration>

5)同步core-site.xml到hadoop10二、hadoop103code

[kgg@hadoop101 hadoop]$ xsync core-site.xml

6)啓動及查看集羣orm

[kgg@hadoop101 hadoop-2.7.2]$ sbin/start-dfs.sh
[kgg@hadoop102 hadoop-2.7.2]$ sbin/start-yarn.sh

7)測試

yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec /input /output

8)爲lzo文件建立索引

hadoop jar ./share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /output

4.1.3 項目經驗之基準測試

1) 測試HDFS寫性能 測試內容:向HDFS集羣寫10個128M的文件

[kgg@hadoop101 mapreduce]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.2-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 128MB​19/05/02 11:44:26 INFO fs.TestDFSIO: TestDFSIO.1.819/05/02 11:44:26 INFO fs.TestDFSIO: nrFiles = 1019/05/02 11:44:26 INFO fs.TestDFSIO: nrBytes (MB) = 128.019/05/02 11:44:26 INFO fs.TestDFSIO: bufferSize = 100000019/05/02 11:44:26 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO19/05/02 11:44:28 INFO fs.TestDFSIO: creating control file: 134217728 bytes, 10 files19/05/02 11:44:30 INFO fs.TestDFSIO: created control files for: 10 files19/05/02 11:44:30 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.103:803219/05/02 11:44:31 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.103:803219/05/02 11:44:32 INFO mapred.FileInputFormat: Total input paths to process : 1019/05/02 11:44:32 INFO mapreduce.JobSubmitter: number of splits:1019/05/02 11:44:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1556766549220_000319/05/02 11:44:34 INFO impl.YarnClientImpl: Submitted application application_1556766549220_000319/05/02 11:44:34 INFO mapreduce.Job: The url to track the job: http://hadoop102:8088/proxy/application_1556766549220_0003/19/05/02 11:44:34 INFO mapreduce.Job: Running job: job_1556766549220_000319/05/02 11:44:47 INFO mapreduce.Job: Job job_1556766549220_0003 running in uber mode : false19/05/02 11:44:47 INFO mapreduce.Job:  map 0% reduce 0%19/05/02 11:45:05 INFO mapreduce.Job:  map 13% reduce 0%19/05/02 11:45:06 INFO mapreduce.Job:  map 27% reduce 0%19/05/02 11:45:08 INFO mapreduce.Job:  map 43% reduce 0%​19/05/02 11:45:09 INFO mapreduce.Job:  map 60% reduce 0%19/05/02 11:45:10 INFO mapreduce.Job:  map 73% reduce 0%19/05/02 11:45:15 INFO mapreduce.Job:  map 77% reduce 0%19/05/02 11:45:18 INFO mapreduce.Job:  map 87% reduce 0%19/05/02 11:45:19 INFO mapreduce.Job:  map 100% reduce 0%19/05/02 11:45:21 INFO mapreduce.Job:  map 100% reduce 100%19/05/02 11:45:22 INFO mapreduce.Job: Job job_1556766549220_0003 completed successfully19/05/02 11:45:22 INFO mapreduce.Job: Counters: 51        File System Counters                FILE: Number of bytes read=856                FILE: Number of bytes written=1304826                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=2350                HDFS: Number of bytes written=1342177359                HDFS: Number of read operations=43                HDFS: Number of large read operations=0                HDFS: Number of write operations=12        Job Counters                 Killed map tasks=1                Launched map tasks=10                Launched reduce tasks=1                Data-local map tasks=8                Rack-local map tasks=2                Total time spent by all maps in occupied slots (ms)=263635                Total time spent by all reduces in occupied slots (ms)=9698                Total time spent by all map tasks (ms)=263635                Total time spent by all reduce tasks (ms)=9698                Total vcore-milliseconds taken by all map tasks=263635                Total vcore-milliseconds taken by all reduce tasks=9698                Total megabyte-milliseconds taken by all map tasks=269962240                Total megabyte-milliseconds taken by all reduce tasks=9930752        Map-Reduce Framework                Map input records=10                Map output records=50                Map output bytes=750                Map output materialized bytes=910                Input split bytes=1230                Combine input records
相關文章
相關標籤/搜索