1:網站點擊流數據分析項目推薦書籍:java
能夠看看百度如何實現這個功能的:https://tongji.baidu.com/web/welcome/loginnode
1 網站點擊流數據分析,業務知識,推薦書籍: 2 《網站分析實戰——如何以數據驅動決策,提高網站價值》王彥平,吳盛鋒編著 http://download.csdn.net/download/biexiansheng/10160197
2:總體技術流程及架構: mysql
2.1 數據處理流程
該項目是一個純粹的數據分析項目,其總體流程基本上就是依據數據的處理流程進行,依此有如下幾個大的步驟:
(1):數據採集
首先,經過頁面嵌入JS代碼的方式獲取用戶訪問行爲,併發送到web服務的後臺記錄日誌(假設已經獲取到數據); 而後,將各服務器上生成的點擊流日誌經過實時或批量的方式匯聚到HDFS文件系統中 ,固然,一個綜合分析系統,數據源可能不只包含點擊流數據,還有數據庫中的業務數據(如用戶信息、商品信息、訂單信息等)及對分析有益的外部數據。
(2):數據預處理
經過mapreduce程序對採集到的點擊流數據進行預處理,好比清洗,格式整理,濾除髒數據等;造成明細表,即寬表,多個表,以空間換時間。
(3):數據入庫
將預處理以後的數據導入到HIVE倉庫中相應的庫和表中;
(4):數據分析
項目的核心內容,即根據需求開發ETL分析語句,得出各類統計結果;
(5):數據展示
將分析所得數據進行可視化;jquery
2.2 項目結構:
因爲本項目是一個純粹數據分析項目,其總體結構亦跟分析流程匹配,並無特別複雜的結構,以下圖:nginx
其中,須要強調的是:
系統的數據分析不是一次性的,而是按照必定的時間頻率反覆計算,於是整個處理鏈條中的各個環節須要按照必定的前後依賴關係緊密銜接,即涉及到大量任務單元的管理調度,因此,項目中須要添加一個任務調度模塊git
2.3 數據展示
數據展示的目的是將分析所得的數據進行可視化,以便運營決策人員能更方便地獲取數據,更快更簡單地理解數據;github
3:模塊開發——數據採集
3.1 需求
數據採集的需求廣義上來講分爲兩大部分。
1)是在頁面採集用戶的訪問行爲,具體開發工做:
a、開發頁面埋點js,採集用戶訪問行爲
b、後臺接受頁面js請求記錄日誌,此部分工做也能夠歸屬爲「數據源」,其開發工做一般由web開發團隊負責
2)是從web服務器上匯聚日誌到HDFS,是數據分析系統的數據採集,此部分工做由數據分析平臺建設團隊負責,具體的技術實現有不少方式:
Shell腳本
優勢:輕量級,開發簡單
缺點:對日誌採集過程當中的容錯處理不便控制
Java採集程序
優勢:可對採集過程實現精細控制
缺點:開發工做量大
Flume日誌採集框架
成熟的開源日誌採集系統,且自己就是hadoop生態體系中的一員,與hadoop體系中的各類框架組件具備天生的親和力,可擴展性強
3.2 技術選型
在點擊流日誌分析這種場景中,對數據採集部分的可靠性、容錯能力要求一般不會很是嚴苛,所以使用通用的flume日誌採集框架徹底能夠知足需求。
本項目即便用flume來實現日誌採集。
web
3.3 Flume日誌採集系統搭建
a、數據源信息
本項目分析的數據用nginx服務器所生成的流量日誌,存放在各臺nginx服務器上,省略。
b、數據內容樣例
數據的具體內容在採集階段其實不用太關心。sql
1 字段解析: 2 1、訪客ip地址: 58.215.204.118 3 2、訪客用戶信息: - - 4 3、請求時間:[18/Sep/2013:06:51:35 +0000] 5 4、請求方式:GET 6 5、請求的url:/wp-includes/js/jquery/jquery.js?ver=1.10.2 7 6、請求所用協議:HTTP/1.1 8 7、響應碼:304 9 8、返回的數據流量:0 10 9、訪客的來源url:http://blog.fens.me/nodejs-socketio-chat/ 11 10、訪客所用瀏覽器:Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0
開始實際操做,現學現賣,使用flume採集數據以下所示:shell
因爲是直接使用現成的數據,因此省略獲取原始數據的操做:
《默認hadoop,fLume,hive,azkaban,mysql等工具所有安裝完成,配置完成,必須的都配置完成》 第一步:假設已經獲取到數據,這裏使用已經獲取到的數據,若是你學習過此套課程,知道此數據文件名稱爲access.log.fensi,這裏修改成access.log文件名稱; 第二步:獲取到數據之後就可使用Flume日誌採集系統採集數據。 第三步:採集規則配置詳情,以下所示 fLume的文件名稱如:tail-hdfs.conf 用tail命令獲取數據,下沉到hdfs,將數據存放到hdfs上面. 啓動命令: bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1 ######## # Name the components on this agent # 定義這個agent中各組件的名字,給那三個組件sources,sinks,channels取個名字,是一個邏輯代號: # a1是agent的表明。 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source 描述和配置source組件:r1 類型, 從網絡端口接收數據,在本機啓動, 因此localhost, type=exec採集目錄源,目錄裏有就採 # exec用來執行要執行的命令 a1.sources.r1.type = exec # -F根據文件名稱來追蹤,採集文件的路徑及其文件名稱. a1.sources.r1.command = tail -F /home/hadoop/log/test.log a1.sources.r1.channels = c1 # Describe the sink 描述和配置sink組件:k1 # type,下沉類型,使用hdfs,將數據下沉到hdfs分佈式文件系統裏面。 a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1 # 下沉的路徑,flume會進行格式的替換. a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/ # 文件的前綴 a1.sinks.k1.hdfs.filePrefix = events- # 10分鐘修改一個新的目錄. a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute # 3秒種滾動一次.能夠方便查看效果,文件滾動週期默認30秒 a1.sinks.k1.hdfs.rollInterval = 3 # 文件滾動的大小限制,500字節滾動一次. a1.sinks.k1.hdfs.rollSize = 500 # 多少個事件,寫入多少個event數據後滾動文件即事件個數. a1.sinks.k1.hdfs.rollCount = 20 # 多少個事件寫一次 a1.sinks.k1.hdfs.batchSize = 5 # 是否從本地獲取時間useLocalTimeStamp a1.sinks.k1.hdfs.useLocalTimeStamp = true # 生成的文件類型,默認是Sequencefile,可用DataStream,則爲普通文本 a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory # 描述和配置channel組件,此處使用是內存緩存的方式 # type類型是內存memory。 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel # 描述和配置source channel sink之間的鏈接關係 # 將sources和sinks綁定到channel上面。 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
具體操做以下所示:
1 [root@master soft]# cd flume/conf/ 2 [root@master conf]# ls 3 flume-conf.properties.template flume-env.ps1.template flume-env.sh flume-env.sh.template log4j.properties 4 [root@master conf]# vim tail-hdfs.conf
內容以下所示:
1 # Name the components on this agent 2 a1.sources = r1 3 a1.sinks = k1 4 a1.channels = c1 5 6 # Describe/configure the source 7 a1.sources.r1.type = exec 8 a1.sources.r1.command = tail -F /home/hadoop/data_hadoop/access.log 9 a1.sources.r1.channels = c1 10 11 # Describe the sink 12 a1.sinks.k1.type = hdfs 13 a1.sinks.k1.channel = c1 14 a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/ 15 a1.sinks.k1.hdfs.filePrefix = events- 16 a1.sinks.k1.hdfs.round = true 17 a1.sinks.k1.hdfs.roundValue = 10 18 a1.sinks.k1.hdfs.roundUnit = minute 19 a1.sinks.k1.hdfs.rollInterval = 3 20 a1.sinks.k1.hdfs.rollSize = 500 21 a1.sinks.k1.hdfs.rollCount = 20 22 a1.sinks.k1.hdfs.batchSize = 5 23 a1.sinks.k1.hdfs.useLocalTimeStamp = true 24 a1.sinks.k1.hdfs.fileType = DataStream 25 26 27 28 # Use a channel which buffers events in memory 29 a1.channels.c1.type = memory 30 a1.channels.c1.capacity = 1000 31 a1.channels.c1.transactionCapacity = 100 32 33 # Bind the source and sink to the channel 34 a1.sources.r1.channels = c1 35 a1.sinks.k1.channel = c1
而後啓動你的hdfs,yarn能夠不啓動,這裏都啓動起來了:
[root@master hadoop]# start-dfs.sh
[root@master hadoop]# start-yarn.sh
啓動起來之後,能夠查看一下hdfs是否正常工做,以下所示:
[root@master hadoop]# hdfs dfsadmin -report
1 Configured Capacity: 56104357888 (52.25 GB) 2 Present Capacity: 39446368256 (36.74 GB) 3 DFS Remaining: 39438364672 (36.73 GB) 4 DFS Used: 8003584 (7.63 MB) 5 DFS Used%: 0.02% 6 Under replicated blocks: 0 7 Blocks with corrupt replicas: 0 8 Missing blocks: 0 9 10 ------------------------------------------------- 11 Live datanodes (3): 12 13 Name: 192.168.199.130:50010 (master) 14 Hostname: master 15 Decommission Status : Normal 16 Configured Capacity: 18611974144 (17.33 GB) 17 DFS Used: 3084288 (2.94 MB) 18 Non DFS Used: 7680802816 (7.15 GB) 19 DFS Remaining: 10928087040 (10.18 GB) 20 DFS Used%: 0.02% 21 DFS Remaining%: 58.72% 22 Configured Cache Capacity: 0 (0 B) 23 Cache Used: 0 (0 B) 24 Cache Remaining: 0 (0 B) 25 Cache Used%: 100.00% 26 Cache Remaining%: 0.00% 27 Xceivers: 1 28 Last contact: Sat Dec 16 13:31:03 CST 2017 29 30 31 Name: 192.168.199.132:50010 (slaver2) 32 Hostname: slaver2 33 Decommission Status : Normal 34 Configured Capacity: 18746191872 (17.46 GB) 35 DFS Used: 1830912 (1.75 MB) 36 Non DFS Used: 4413718528 (4.11 GB) 37 DFS Remaining: 14330642432 (13.35 GB) 38 DFS Used%: 0.01% 39 DFS Remaining%: 76.45% 40 Configured Cache Capacity: 0 (0 B) 41 Cache Used: 0 (0 B) 42 Cache Remaining: 0 (0 B) 43 Cache Used%: 100.00% 44 Cache Remaining%: 0.00% 45 Xceivers: 1 46 Last contact: Sat Dec 16 13:31:03 CST 2017 47 48 49 Name: 192.168.199.131:50010 (slaver1) 50 Hostname: slaver1 51 Decommission Status : Normal 52 Configured Capacity: 18746191872 (17.46 GB) 53 DFS Used: 3088384 (2.95 MB) 54 Non DFS Used: 4563468288 (4.25 GB) 55 DFS Remaining: 14179635200 (13.21 GB) 56 DFS Used%: 0.02% 57 DFS Remaining%: 75.64% 58 Configured Cache Capacity: 0 (0 B) 59 Cache Used: 0 (0 B) 60 Cache Remaining: 0 (0 B) 61 Cache Used%: 100.00% 62 Cache Remaining%: 0.00% 63 Xceivers: 1 64 Last contact: Sat Dec 16 13:31:03 CST 2017 65 66 67 [root@master hadoop]#
若是hdfs正常啓動,而後呢,用tail命令獲取數據,下沉到hdfs,將數據存放到hdfs上面:
啓動命令,啓動採集,啓動flume的agent,以及操做以下所示(注意:啓動命令中的 -n 參數要給配置文件中配置的agent名稱):
bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
1 [root@master conf]# cd /home/hadoop/soft/flume/ 2 [root@master flume]# ls 3 bin CHANGELOG conf DEVNOTES docs lib LICENSE NOTICE README RELEASE-NOTES tools 4 [root@master flume]# bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
出現以下說明已經清洗完畢:
1 [root@master flume]# bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1 2 Info: Sourcing environment configuration script /home/hadoop/soft/flume/conf/flume-env.sh 3 Info: Including Hadoop libraries found via (/home/hadoop/soft/hadoop-2.6.4/bin/hadoop) for HDFS access 4 Info: Excluding /home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/slf4j-api-1.7.5.jar from classpath 5 Info: Excluding /home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar from classpath 6 Info: Including Hive libraries found via (/home/hadoop/soft/apache-hive-1.2.1-bin) for Hive access 7 + exec /home/hadoop/soft/jdk1.7.0_65/bin/java -Xmx20m -cp '/home/hadoop/soft/flume/conf:/home/hadoop/soft/flume/lib/*:/home/hadoop/soft/hadoop-2.6.4/etc/hadoop:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/activation-1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/asm-3.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/avro-1.7.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-cli-1.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-codec-1.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-collections-3.2.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-compress-1.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-configuration-1.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-digester-1.8.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-el-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-httpclient-3.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-io-2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-lang-2.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-logging-1.1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-math3-3.1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-net-3.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/curator-client-2.6.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/curator-framework-2.6.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/curator-recipes-2.6.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/gson-2.2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/guava-11.0.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/hadoop-annotations-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/hadoop-auth-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/hamcrest-core-1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/htrace-core-3.0.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/httpclient-4.2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/httpcore-4.2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jersey-core-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jersey-json-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jersey-server-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jets3t-0.9.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jettison-1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jetty-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jetty-util-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jsch-0.1.42.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jsp-api-2.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jsr305-1.3.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/junit-4.11.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/log4j-1.2.17.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/mockito-all-1.8.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/netty-3.6.2.Final.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/paranamer-2.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/servlet-api-2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/stax-api-1.0-2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/xmlenc-0.52.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/xz-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/zookeeper-3.4.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/hadoop-common-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/hadoop-common-2.6.4-tests.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/hadoop-nfs-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/jdiff:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/sources:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/templates:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/asm-3.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-el-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-io-2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/guava-11.0.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/htrace-core-3.0.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jasper-runtime-5.5.23.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jsp-api-2.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jsr305-1.3.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/xercesImpl-2.9.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/xml-apis-1.3.04.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/hadoop-hdfs-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/hadoop-hdfs-2.6.4-tests.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/hadoop-hdfs-nfs-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/jdiff:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/sources:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/templates:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/webapps:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/activation-1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/aopalliance-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/asm-3.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-cli-1.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-codec-1.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-collections-3.2.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-httpclient-3.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-io-2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-lang-2.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/guice-3.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/javax.inject-1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-client-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-core-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-json-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-server-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jettison-1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jetty-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jline-2.12.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jsr305-1.3.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/log4j-1.2.17.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/servlet-api-2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/xz-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-api-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-client-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-common-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-registry-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-common-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-tests-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/sources:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/test:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/asm-3.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/avro-1.7.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/guice-3.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/hadoop-annotations-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/javax.inject-1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/junit-4.11.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/xz-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.6.4-tests.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib-examples:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/sources:/home/hadoop/soft/hadoop-2.6.4/contrib/capacity-scheduler/*.jar:/home/hadoop/soft/apache-hive-1.2.1-bin/lib/*' -Djava.library.path=:/home/hadoop/soft/hadoop-2.6.4/lib/native org.apache.flume.node.Application -f conf/tail-hdfs.conf -n a1
而後呢,能夠查看一下,使用命令或者瀏覽器查看,以下所示:
若是/home/hadoop/data_hadoop/access.log文件不斷生成日誌,那麼下面的清洗的也不斷生成。
1 [root@master hadoop]# hadoop fs -ls /flume/events/17-12-16
4:模塊開發——數據預處理:
4.1 主要目的: 過濾「不合規」數據 格式轉換和規整 根據後續的統計需求,過濾分離出各類不一樣主題(不一樣欄目path)的基礎數據 4.2 實現方式: 開發一個mapreduce程序WeblogPreProcess(不貼代碼了,詳細見github代碼);
開發程序,在window操做系統的eclipse工具,導入的jar包包含hadoop的jar包(以前說過,這裏很少作囉嗦了)和hive的jar包(apache-hive-1.2.1-bin\lib的jar包):
學習的過程當中,也許要查看hadoop的源碼,以前弄出來,今天按ctrl鍵查看hadoop的時候沒辦法看了,也忘記咋弄的了,這裏記錄一下,我趕忙最方便快捷,操做如:右鍵項目--》Build Path--》Configure Build Path--》Source--》Link Source而後選擇hadoop-2.6.4-src便可。
若是沒法查看類的話,以下操做:選中此jar包,而後屬性properties,而後java source attachment,而後external location,而後external floder便可。
程序開發完畢能夠運行一下,對數據進行預處理操做(即清洗日誌數據):
[root@master data_hadoop]# hadoop jar webLogPreProcess.java.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /flume/events/17-12-16 /flume/filterOutput
執行的結果以下所示:
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.java.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /flume/events/17-12-16 /flume/filterOutput 2 17/12/16 17:57:25 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.199.130:8032 3 17/12/16 17:57:57 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 17/12/16 17:58:03 INFO input.FileInputFormat: Total input paths to process : 3 5 17/12/16 17:58:08 INFO mapreduce.JobSubmitter: number of splits:3 6 17/12/16 17:58:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1513402019656_0001 7 17/12/16 17:58:19 INFO impl.YarnClientImpl: Submitted application application_1513402019656_0001 8 17/12/16 17:58:20 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1513402019656_0001/ 9 17/12/16 17:58:20 INFO mapreduce.Job: Running job: job_1513402019656_0001 10 17/12/16 17:59:05 INFO mapreduce.Job: Job job_1513402019656_0001 running in uber mode : false 11 17/12/16 17:59:05 INFO mapreduce.Job: map 0% reduce 0% 12 17/12/16 18:00:25 INFO mapreduce.Job: map 100% reduce 0% 13 17/12/16 18:00:27 INFO mapreduce.Job: Job job_1513402019656_0001 completed successfully 14 17/12/16 18:00:27 INFO mapreduce.Job: Counters: 30 15 File System Counters 16 FILE: Number of bytes read=0 17 FILE: Number of bytes written=318342 18 FILE: Number of read operations=0 19 FILE: Number of large read operations=0 20 FILE: Number of write operations=0 21 HDFS: Number of bytes read=1749 22 HDFS: Number of bytes written=1138 23 HDFS: Number of read operations=15 24 HDFS: Number of large read operations=0 25 HDFS: Number of write operations=6 26 Job Counters 27 Launched map tasks=3 28 Data-local map tasks=3 29 Total time spent by all maps in occupied slots (ms)=212389 30 Total time spent by all reduces in occupied slots (ms)=0 31 Total time spent by all map tasks (ms)=212389 32 Total vcore-milliseconds taken by all map tasks=212389 33 Total megabyte-milliseconds taken by all map tasks=217486336 34 Map-Reduce Framework 35 Map input records=10 36 Map output records=10 37 Input split bytes=381 38 Spilled Records=0 39 Failed Shuffles=0 40 Merged Map outputs=0 41 GC time elapsed (ms)=3892 42 CPU time spent (ms)=3820 43 Physical memory (bytes) snapshot=160026624 44 Virtual memory (bytes) snapshot=1093730304 45 Total committed heap usage (bytes)=33996800 46 File Input Format Counters 47 Bytes Read=1368 48 File Output Format Counters 49 Bytes Written=1138 50 [root@master data_hadoop]#
可使用命令進行查看操做:
1 [root@master data_hadoop]# hadoop fs -cat /flume/filterOutput/part-m-00000 2 3 [root@master data_hadoop]# hadoop fs -cat /flume/filterOutput/part-m-00001 4 5 [root@master data_hadoop]# hadoop fs -cat /flume/filterOutput/part-m-00002
作到這裏,發現本身好像作懵了,因爲數據採集過程,我並沒作,因此flume採集數據,就沒有這個過程了,這裏使用flume對access.log數據進行採集,發現採集沒多少條,這才發現本身思考錯誤了,access.log文件裏面的數據就是採集好的。數據採集,數據預處理,數據入庫,數據分析,數據展示;那麼數據採集就算使用現成的數據文件access.log了。因此,從數據預處理開始就能夠了。
那麼,數據預處理操做,將寫好的程序能夠在window的eclipse跑一下,結果以下所示(因爲上面的flume算是練習了,沒有刪,在這篇博客裏面屬於閹割的。因此看到的小夥伴選擇性看便可):
1 2017-12-16 21:51:18,078 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1129)) - session.id is deprecated. Instead, use dfs.metrics.session-id 2 2017-12-16 21:51:18,083 INFO [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId= 3 2017-12-16 21:51:18,469 WARN [main] mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(64)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 2017-12-16 21:51:18,481 WARN [main] mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(171)) - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 5 2017-12-16 21:51:18,616 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 1 6 2017-12-16 21:51:18,719 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(199)) - number of splits:1 7 2017-12-16 21:51:18,931 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(288)) - Submitting tokens for job: job_local616550674_0001 8 2017-12-16 21:51:19,258 INFO [main] mapreduce.Job (Job.java:submit(1301)) - The url to track the job: http://localhost:8080/ 9 2017-12-16 21:51:19,259 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1346)) - Running job: job_local616550674_0001 10 2017-12-16 21:51:19,261 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null 11 2017-12-16 21:51:19,273 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 12 2017-12-16 21:51:19,355 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks 13 2017-12-16 21:51:19,355 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local616550674_0001_m_000000_0 14 2017-12-16 21:51:19,412 INFO [LocalJobRunner Map Task Executor #0] util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(181)) - ProcfsBasedProcessTree currently is supported only on Linux. 15 2017-12-16 21:51:19,479 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:initialize(587)) - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@75805410 16 2017-12-16 21:51:19,487 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:runNewMapper(753)) - Processing split: file:/C:/Users/bhlgo/Desktop/input/access.log.fensi:0+3025757 17 2017-12-16 21:51:20,273 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1367)) - Job job_local616550674_0001 running in uber mode : false 18 2017-12-16 21:51:20,275 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1374)) - map 0% reduce 0% 19 2017-12-16 21:51:21,240 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 20 2017-12-16 21:51:21,242 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1001)) - Task:attempt_local616550674_0001_m_000000_0 is done. And is in the process of committing 21 2017-12-16 21:51:21,315 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 22 2017-12-16 21:51:21,315 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:commit(1162)) - Task attempt_local616550674_0001_m_000000_0 is allowed to commit now 23 2017-12-16 21:51:21,377 INFO [LocalJobRunner Map Task Executor #0] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(439)) - Saved output of task 'attempt_local616550674_0001_m_000000_0' to file:/C:/Users/bhlgo/Desktop/output/_temporary/0/task_local616550674_0001_m_000000 24 2017-12-16 21:51:21,395 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map 25 2017-12-16 21:51:21,395 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local616550674_0001_m_000000_0' done. 26 2017-12-16 21:51:21,395 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local616550674_0001_m_000000_0 27 2017-12-16 21:51:21,405 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete. 28 2017-12-16 21:51:22,303 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1374)) - map 100% reduce 0% 29 2017-12-16 21:51:22,304 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1385)) - Job job_local616550674_0001 completed successfully 30 2017-12-16 21:51:22,321 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1392)) - Counters: 18 31 File System Counters 32 FILE: Number of bytes read=3025930 33 FILE: Number of bytes written=2898908 34 FILE: Number of read operations=0 35 FILE: Number of large read operations=0 36 FILE: Number of write operations=0 37 Map-Reduce Framework 38 Map input records=14619 39 Map output records=14619 40 Input split bytes=116 41 Spilled Records=0 42 Failed Shuffles=0 43 Merged Map outputs=0 44 GC time elapsed (ms)=40 45 CPU time spent (ms)=0 46 Physical memory (bytes) snapshot=0 47 Virtual memory (bytes) snapshot=0 48 Total committed heap usage (bytes)=162529280 49 File Input Format Counters 50 Bytes Read=3025757 51 File Output Format Counters 52 Bytes Written=2647097
生成的文件,切記輸出文件,例如output文件是自動生成的:
4.3 點擊流模型數據梳理(預處理程序和模型梳理程序處理的生成三份數據,這裏都須要使用,hive建表映射。預處理階段的mapReduce程序的調度腳本.):
因爲大量的指標統計從點擊流模型中更容易得出,因此在預處理階段,可使用mr程序來生成點擊流模型的數據;
4.3.1 點擊流模型pageviews表,Pageviews表模型數據生成:
4.3.2 點擊流模型visit信息表
注:「一次訪問」=「N次連續請求」;
直接從原始數據中用hql語法得出每一個人的「次」訪問信息比較困難,可先用mapreduce程序分析原始數據得出「次」信息數據,而後再用hql進行更多維度統計;
用MR程序從pageviews數據中,梳理出每一次visit的起止時間、頁面信息;
方法一,以下所示: 開發到此處,有出現一點小問題,你將寫好的程序能夠手動執行,即以下所示: hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /data/weblog/preprocess/input /data/weblog/preprocess/output hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreValid /data/weblog/preprocess/input /data/weblog/preprocess/valid_output hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.ClickStream /data/weblog/preprocess/output /data/weblog/preprocess/click_pv_out hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.ClickStreamVisit /data/weblog/preprocess/click_pv_out /data/weblog/preprocess/click_visit_out 方法二:使用azkaban進行任務調度:
接下來啓動個人azkaban任務調度工具:
1 [root@master flume]# cd /home/hadoop/azkabantools/server/ 2 [root@master server]# nohup bin/azkaban-web-start.sh 1>/tmp/azstd.out 2>/tmp/azerr.out& 3 [root@master server]# jps 4 [root@master server]# cd ../executor/ 5 [root@master executor]# bin/azkaban-executor-start.sh
而後在瀏覽器登錄azkaban客戶端:https://master:8443,帳號和密碼都是本身設置好的,個人是admin,admin。
1 預先啓動你的集羣,以下所示 2 [root@master hadoop]# start-dfs.sh 3 [root@master hadoop]# start-yarn.sh 4 將事先使用的輸入目錄建立好,以下所示,輸出目錄不用建立,不然報錯: 5 [root@master hadoop]# hadoop fs -mkdir -p /data/weblog/preprocess/input 6 而後將採集好的數據上傳到這個input目錄裏面便可: 7 [root@master data_hadoop]# hadoop fs -put access.log /data/weblog/preprocess/input
我這裏使用azkaban遇到一點小問題,先使用手動對數據進行處理了。真是問題不斷......
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /data/weblog/preprocess/input /data/weblog/preprocess/output 2 Exception in thread "main" java.io.IOException: No FileSystem for scheme: C 3 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) 4 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) 5 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) 6 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) 7 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) 8 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) 9 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) 10 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:498) 11 at com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess.main(WeblogPreProcess.java:94) 12 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 13 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 14 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 15 at java.lang.reflect.Method.invoke(Method.java:606) 16 at org.apache.hadoop.util.RunJar.run(RunJar.java:221) 17 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
手動執行過程當中遇到如上問題,是由於個人主方法裏面路徑寫成了下面的那種在window運行的了,解決方法,修改之後,從新打包便可;
這篇博客,從下面開始,才具備意義,以上全是摸索式進行的。這裏仍是最原始手動執行的。以先作出來爲主吧。
1 FileInputFormat.setInputPaths(job, new Path(args[0])); 2 FileOutputFormat.setOutputPath(job, new Path(args[1])); 3 4 //FileInputFormat.setInputPaths(job, new Path("c:/weblog/pageviews")); 5 //FileOutputFormat.setOutputPath(job, new Path("c:/weblog/visitout")); 6
運行結果以下所示:
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /data/weblog/preprocess/input /data/weblog/preprocess/output 2 17/12/17 14:37:29 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.199.130:8032 3 17/12/17 14:37:44 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 17/12/17 14:37:54 INFO input.FileInputFormat: Total input paths to process : 1 5 17/12/17 14:38:07 INFO mapreduce.JobSubmitter: number of splits:1 6 17/12/17 14:38:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1513489846377_0001 7 17/12/17 14:38:19 INFO impl.YarnClientImpl: Submitted application application_1513489846377_0001 8 17/12/17 14:38:19 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1513489846377_0001/ 9 17/12/17 14:38:19 INFO mapreduce.Job: Running job: job_1513489846377_0001 10 17/12/17 14:39:51 INFO mapreduce.Job: Job job_1513489846377_0001 running in uber mode : false 11 17/12/17 14:39:51 INFO mapreduce.Job: map 0% reduce 0% 12 17/12/17 14:40:16 INFO mapreduce.Job: map 100% reduce 0% 13 17/12/17 14:40:29 INFO mapreduce.Job: Job job_1513489846377_0001 completed successfully 14 17/12/17 14:40:30 INFO mapreduce.Job: Counters: 30 15 File System Counters 16 FILE: Number of bytes read=0 17 FILE: Number of bytes written=106127 18 FILE: Number of read operations=0 19 FILE: Number of large read operations=0 20 FILE: Number of write operations=0 21 HDFS: Number of bytes read=3025880 22 HDFS: Number of bytes written=2626565 23 HDFS: Number of read operations=5 24 HDFS: Number of large read operations=0 25 HDFS: Number of write operations=2 26 Job Counters 27 Launched map tasks=1 28 Data-local map tasks=1 29 Total time spent by all maps in occupied slots (ms)=15389 30 Total time spent by all reduces in occupied slots (ms)=0 31 Total time spent by all map tasks (ms)=15389 32 Total vcore-milliseconds taken by all map tasks=15389 33 Total megabyte-milliseconds taken by all map tasks=15758336 34 Map-Reduce Framework 35 Map input records=14619 36 Map output records=14619 37 Input split bytes=123 38 Spilled Records=0 39 Failed Shuffles=0 40 Merged Map outputs=0 41 GC time elapsed (ms)=201 42 CPU time spent (ms)=990 43 Physical memory (bytes) snapshot=60375040 44 Virtual memory (bytes) snapshot=364576768 45 Total committed heap usage (bytes)=17260544 46 File Input Format Counters 47 Bytes Read=3025757 48 File Output Format Counters 49 Bytes Written=2626565 50 [root@master data_hadoop]#
瀏覽器查看以下所示:
點擊流模型數據梳理
因爲大量的指標統計從點擊流模型中更容易得出,因此在預處理階段,可使用mr程序來生成點擊流模型的數據
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.ClickStream /data/weblog/preprocess/output /data/weblog/preprocess/click_pv_out 2 17/12/17 14:47:33 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.199.130:8032 3 17/12/17 14:47:43 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 17/12/17 14:48:16 INFO input.FileInputFormat: Total input paths to process : 1 5 17/12/17 14:48:18 INFO mapreduce.JobSubmitter: number of splits:1 6 17/12/17 14:48:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1513489846377_0002 7 17/12/17 14:48:22 INFO impl.YarnClientImpl: Submitted application application_1513489846377_0002 8 17/12/17 14:48:22 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1513489846377_0002/ 9 17/12/17 14:48:22 INFO mapreduce.Job: Running job: job_1513489846377_0002 10 17/12/17 14:48:44 INFO mapreduce.Job: Job job_1513489846377_0002 running in uber mode : false 11 17/12/17 14:48:45 INFO mapreduce.Job: map 0% reduce 0% 12 17/12/17 14:48:58 INFO mapreduce.Job: map 100% reduce 0% 13 17/12/17 14:49:39 INFO mapreduce.Job: map 100% reduce 100% 14 17/12/17 14:49:42 INFO mapreduce.Job: Job job_1513489846377_0002 completed successfully 15 17/12/17 14:49:43 INFO mapreduce.Job: Counters: 49 16 File System Counters 17 FILE: Number of bytes read=17187 18 FILE: Number of bytes written=247953 19 FILE: Number of read operations=0 20 FILE: Number of large read operations=0 21 FILE: Number of write operations=0 22 HDFS: Number of bytes read=2626691 23 HDFS: Number of bytes written=18372 24 HDFS: Number of read operations=6 25 HDFS: Number of large read operations=0 26 HDFS: Number of write operations=2 27 Job Counters 28 Launched map tasks=1 29 Launched reduce tasks=1 30 Data-local map tasks=1 31 Total time spent by all maps in occupied slots (ms)=10414 32 Total time spent by all reduces in occupied slots (ms)=38407 33 Total time spent by all map tasks (ms)=10414 34 Total time spent by all reduce tasks (ms)=38407 35 Total vcore-milliseconds taken by all map tasks=10414 36 Total vcore-milliseconds taken by all reduce tasks=38407 37 Total megabyte-milliseconds taken by all map tasks=10663936 38 Total megabyte-milliseconds taken by all reduce tasks=39328768 39 Map-Reduce Framework 40 Map input records=14619 41 Map output records=76 42 Map output bytes=16950 43 Map output materialized bytes=17187 44 Input split bytes=126 45 Combine input records=0 46 Combine output records=0 47 Reduce input groups=53 48 Reduce shuffle bytes=17187 49 Reduce input records=76 50 Reduce output records=76 51 Spilled Records=152 52 Shuffled Maps =1 53 Failed Shuffles=0 54 Merged Map outputs=1 55 GC time elapsed (ms)=327 56 CPU time spent (ms)=1600 57 Physical memory (bytes) snapshot=205991936 58 Virtual memory (bytes) snapshot=730013696 59 Total committed heap usage (bytes)=127045632 60 Shuffle Errors 61 BAD_ID=0 62 CONNECTION=0 63 IO_ERROR=0 64 WRONG_LENGTH=0 65 WRONG_MAP=0 66 WRONG_REDUCE=0 67 File Input Format Counters 68 Bytes Read=2626565 69 File Output Format Counters 70 Bytes Written=18372 71 [root@master data_hadoop]#
執行結果以下所示:
點擊流模型visit信息表:
注:「一次訪問」=「N次連續請求」
直接從原始數據中用hql語法得出每一個人的「次」訪問信息比較困難,可先用mapreduce程序分析原始數據得出「次」信息數據,而後再用hql進行更多維度統計
用MR程序從pageviews數據中,梳理出每一次visit的起止時間、頁面信息:
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.ClickStreamVisit /data/weblog/preprocess/click_pv_out /data/weblog/preprocess/click_visit_out 2 17/12/17 15:06:30 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.199.130:8032 3 17/12/17 15:06:32 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 17/12/17 15:06:33 INFO input.FileInputFormat: Total input paths to process : 1 5 17/12/17 15:06:33 INFO mapreduce.JobSubmitter: number of splits:1 6 17/12/17 15:06:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1513489846377_0003 7 17/12/17 15:06:35 INFO impl.YarnClientImpl: Submitted application application_1513489846377_0003 8 17/12/17 15:06:35 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1513489846377_0003/ 9 17/12/17 15:06:35 INFO mapreduce.Job: Running job: job_1513489846377_0003 10 17/12/17 15:06:47 INFO mapreduce.Job: Job job_1513489846377_0003 running in uber mode : false 11 17/12/17 15:06:47 INFO mapreduce.Job: map 0% reduce 0% 12 17/12/17 15:07:44 INFO mapreduce.Job: map 100% reduce 0% 13 17/12/17 15:08:15 INFO mapreduce.Job: map 100% reduce 100% 14 17/12/17 15:08:18 INFO mapreduce.Job: Job job_1513489846377_0003 completed successfully 15 17/12/17 15:08:18 INFO mapreduce.Job: Counters: 49 16 File System Counters 17 FILE: Number of bytes read=6 18 FILE: Number of bytes written=213705 19 FILE: Number of read operations=0 20 FILE: Number of large read operations=0 21 FILE: Number of write operations=0 22 HDFS: Number of bytes read=18504 23 HDFS: Number of bytes written=0 24 HDFS: Number of read operations=6 25 HDFS: Number of large read operations=0 26 HDFS: Number of write operations=2 27 Job Counters 28 Launched map tasks=1 29 Launched reduce tasks=1 30 Data-local map tasks=1 31 Total time spent by all maps in occupied slots (ms)=55701 32 Total time spent by all reduces in occupied slots (ms)=22157 33 Total time spent by all map tasks (ms)=55701 34 Total time spent by all reduce tasks (ms)=22157 35 Total vcore-milliseconds taken by all map tasks=55701 36 Total vcore-milliseconds taken by all reduce tasks=22157 37 Total megabyte-milliseconds taken by all map tasks=57037824 38 Total megabyte-milliseconds taken by all reduce tasks=22688768 39 Map-Reduce Framework 40 Map input records=76 41 Map output records=0 42 Map output bytes=0 43 Map output materialized bytes=6 44 Input split bytes=132 45 Combine input records=0 46 Combine output records=0 47 Reduce input groups=0 48 Reduce shuffle bytes=6 49 Reduce input records=0 50 Reduce output records=0 51 Spilled Records=0 52 Shuffled Maps =1 53 Failed Shuffles=0 54 Merged Map outputs=1 55 GC time elapsed (ms)=325 56 CPU time spent (ms)=1310 57 Physical memory (bytes) snapshot=203296768 58 Virtual memory (bytes) snapshot=730161152 59 Total committed heap usage (bytes)=126246912 60 Shuffle Errors 61 BAD_ID=0 62 CONNECTION=0 63 IO_ERROR=0 64 WRONG_LENGTH=0 65 WRONG_MAP=0 66 WRONG_REDUCE=0 67 File Input Format Counters 68 Bytes Read=18372 69 File Output Format Counters 70 Bytes Written=0 71 [root@master data_hadoop]#
運行結果以下所示:
5:模塊開發——數據倉庫設計(注:採用星型模型,數據倉庫概念知識以及星型模型和雪花模型的區別請自行腦補。):
星型模型是採用事實表和維度表的模型的。下面建立事實表,維度表這裏省略,不作處理。
原始數據表:t_origin_weblog |
||
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
來源url |
http_user_agent |
string |
訪客終端信息 |
ETL中間表:t_etl_referurl |
||
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url |
request_host |
string |
請求的域名 |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
來源url |
http_user_agent |
string |
訪客終端信息 |
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
外鏈url |
http_user_agent |
string |
訪客終端信息 |
host |
string |
外鏈url的域名 |
path |
string |
外鏈url的路徑 |
query |
string |
外鏈url的參數 |
query_id |
string |
外鏈url的參數值 |
訪問日誌明細寬表:t_ods_access_detail |
||
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url整串 |
request_level1 |
string |
請求的一級欄目 |
request_level2 |
string |
請求的二級欄目 |
request_level3 |
string |
請求的三級欄目 |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
來源url |
http_user_agent |
string |
訪客終端信息 |
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
外鏈url |
http_user_agent |
string |
訪客終端信息整串 |
http_user_agent_browser |
string |
訪客終端瀏覽器 |
http_user_agent_sys |
string |
訪客終端操做系統 |
http_user_agent_dev |
string |
訪客終端設備 |
host |
string |
外鏈url的域名 |
path |
string |
外鏈url的路徑 |
query |
string |
外鏈url的參數 |
query_id |
string |
外鏈url的參數值 |
daystr |
string |
日期整串 |
tmstr |
string |
時間整串 |
month |
string |
月份 |
day |
string |
日 |
hour |
string |
時 |
minute |
string |
分 |
## |
## |
## |
mm |
string |
分區字段--月 |
dd |
string |
分區字段--日 |
6 :模塊開發——ETL
該項目的數據分析過程在hadoop集羣上實現,主要應用hive數據倉庫工具,所以,採集並通過預處理後的數據,須要加載到hive數據倉庫中,以進行後續的挖掘分析。
6.1:建立原始數據表:
--在hive倉庫中建貼源數據表 ods_weblog_origin 下面開始建立hive的數據庫和數據表,操做以下所示: [root@master soft]# cd apache-hive-1.2.1-bin/ [root@master apache-hive-1.2.1-bin]# ls [root@master apache-hive-1.2.1-bin]# cd bin/ [root@master apache-hive-1.2.1-bin]# ls [root@master bin]# ./hive hive> show databases; hive> create database webLog; hive> show databases; #按照日期來分區 hive> create table ods_weblog_origin(valid string,remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,http_referer string,http_user_agent string) > partitioned by (datestr string) > row format delimited > fields terminated by '\001'; hive> show tables; hive> desc ods_weblog_origin; #點擊流模型pageviews表 ods_click_pageviews hive> create table ods_click_pageviews( > Session string, > remote_addr string, > remote_user string, > time_local string, > request string, > visit_step string, > page_staylong string, > http_referer string, > http_user_agent string, > body_bytes_sent string, > status string) > partitioned by (datestr string) > row format delimited > fields terminated by '\001'; hive> show tables; #點擊流visit模型表 click_stream_visit hive> create table click_stream_visit( > session string, > remote_addr string, > inTime string, > outTime string, > inPage string, > outPage string, > referal string, > pageVisits int) > partitioned by (datestr string); hive> show tables;
6.2:導入數據,操做以下所示:
1 1:導入清洗結果數據到貼源數據表ods_weblog_origin 2 3 hive> load data inpath '/data/weblog/preprocess/output/part-m-00000' overwrite into table ods_weblog_origin partition(datestr='2017-12-17'); 4 5 hive> show partitions ods_weblog_origin; 6 hive> select count(*) from ods_weblog_origin; 7 hive> select * from ods_weblog_origin; 8 9 2:導入點擊流模型pageviews數據到ods_click_pageviews表 10 hive> load data inpath '/data/weblog/preprocess/click_pv_out/part-r-00000' overwrite into table ods_click_pageviews partition(datestr='2017-12-17'); 11 12 hive> select count(1) from ods_click_pageviews; 13 14 3:導入點擊流模型visit數據到click_stream_visit表 15 hive> load data inpath '/data/weblog/preprocess/click_visit_out/part-r-00000' overwrite into table click_stream_visit partition(datestr='2017-12-17'); 16 17 hive> select count(1) from click_stream_visit;
待續......