【慕課網實戰】8、以慕課網日誌分析爲例進入大數據 Spark SQL 的世界

時間 2019-12-01

標籤慕課網實戰日誌分析爲例進入數據 spark sql 世界欄目 Spark 简体版

原文原文鏈接

用戶行爲日誌：用戶每次訪問網站時全部的行爲數據（訪問、瀏覽、搜索、點擊...）

用戶行爲軌跡、流量日誌

日誌數據內容：

1）訪問的系統屬性：操做系統、瀏覽器等等

2）訪問特徵：點擊的url、從哪一個url跳轉過來的(referer)、頁面上的停留時間等

3）訪問信息：session_id、訪問ip(訪問城市)等

2013-05-19 13:00:00 http://www.taobao.com/17/?tracker_u=1624169&type=1 B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1 http://hao.360.cn/ 1.196.34.243

數據處理流程

1）數據採集

Flume： web日誌寫入到HDFS

2）數據清洗

髒數據

Spark、Hive、MapReduce 或者是其餘的一些分佈式計算框架

清洗完以後的數據能夠存放在HDFS(Hive/Spark SQL)

3）數據處理

按照咱們的須要進行相應業務的統計和分析

Spark、Hive、MapReduce 或者是其餘的一些分佈式計算框架

4）處理結果入庫

結果能夠存放到RDBMS、NoSQL

5）數據的可視化

經過圖形化展現的方式展示出來：餅圖、柱狀圖、地圖、折線圖

ECharts、HUE、Zeppelin

通常的日誌處理方式，咱們是須要進行分區的，

按照日誌中的訪問時間進行相應的分區，好比：d,h,m5(每5分鐘一個分區)

輸入：訪問時間、訪問URL、耗費的流量、訪問IP地址信息

輸出：URL、cmsType(video/article)、cmsId(編號)、流量、ip、城市信息、訪問時間、天

使用github上已有的開源項目

1）git clone https://github.com/wzhe06/ipdatabase.git

2）編譯下載的項目：mvn clean package -DskipTests

3）安裝jar包到本身的maven倉庫

mvn install:install-file -Dfile=/Users/rocky/source/ipdatabase/target/ipdatabase-1.0-SNAPSHOT.jar -DgroupId=com.ggstar -DartifactId=ipdatabase -Dversion=1.0 -Dpackaging=jar

java.io.FileNotFoundException:

file:/Users/rocky/maven_repos/com/ggstar/ipdatabase/1.0/ipdatabase-1.0.jar!/ipRegion.xlsx (No such file or directory)

調優勢：

1) 控制文件輸出的大小： coalesce

2) 分區字段的數據類型調整：spark.sql.sources.partitionColumnTypeInference.enabled

3) 批量插入數據庫數據，提交使用batch操做

create table day_video_access_topn_stat (

day varchar(8) not null,

cms_id bigint(10) not null,

times bigint(10) not null,

primary key (day, cms_id)

);

create table day_video_city_access_topn_stat (

day varchar(8) not null,

cms_id bigint(10) not null,

city varchar(20) not null,

times bigint(10) not null,

times_rank int not null,

primary key (day, cms_id, city)

);

create table day_video_traffics_topn_stat (

day varchar(8) not null,

cms_id bigint(10) not null,

traffics bigint(20) not null,

primary key (day, cms_id)

);

數據可視化：一副圖片最偉大的價值莫過於它可以使得咱們實際看到的比咱們指望看到的內容更加豐富

常見的可視化框架

1）echarts

2）highcharts

3）D3.js

4）HUE

5）Zeppelin

在Spark中，支持4種運行模式：

1）Local：開發時使用

2）Standalone：是Spark自帶的，若是一個集羣是Standalone的話，那麼就須要在多臺機器上同時部署Spark環境

3）YARN：建議你們在生產上使用該模式，統一使用YARN進行整個集羣做業(MR、Spark)的資源調度

4）Mesos

無論使用什麼模式，Spark應用程序的代碼是如出一轍的，只須要在提交的時候經過--master參數來指定咱們的運行模式便可

Client

Driver運行在Client端(提交Spark做業的機器)

Client會和請求到的Container進行通訊來完成做業的調度和執行，Client是不能退出的

日誌信息會在控制檯輸出：便於咱們測試

Cluster

Driver運行在ApplicationMaster中

Client只要提交完做業以後就能夠關掉，由於做業已經在YARN上運行了

日誌是在終端看不到的，由於日誌是在Driver上，只能經過yarn logs -applicationIdapplication_id

./bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master yarn \

--executor-memory 1G \

--num-executors 1 \

/home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.1.0.jar \

此處的yarn就是咱們的yarn client模式

若是是yarn cluster模式的話，yarn-cluster

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

若是想運行在YARN之上，那麼就必需要設置HADOOP_CONF_DIR或者是YARN_CONF_DIR

1） export HADOOP_CONF_DIR=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop

2) $SPARK_HOME/conf/spark-env.sh

./bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master yarn-cluster \

--executor-memory 1G \

--num-executors 1 \

/home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.1.0.jar \

yarn logs -applicationId application_1495632775836_0002

打包時要注意，pom.xml中須要添加以下plugin

<artifactId>maven-assembly-plugin</artifactId>