日誌數據分析:
1.背景
1.1 hm論壇日誌,數據分爲兩部分組成,原來是一個大文件,是56GB;之後天天生成一個文件,大約是150-200MB之間;
1.2 日誌格式是apache common日誌格式;每行記錄有5部分組成:訪問ip、訪問時間、訪問資源、訪問狀態、本次流量;27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
1.3 分析一些核心指標,供運營決策者使用;
1.4 開發該系統的目的是分了獲取一些業務相關的指標,這些指標在第三方工具中沒法得到的;(第三方工具:百度統計)php
2.開發步驟
2.1 把日誌數據上傳到HDFS中進行處理
若是是日誌服務器數據較小、壓力較小,能夠直接使用shell命令把數據上傳到HDFS中;
若是是日誌服務器數據較大、壓力較大,使用NFS在另外一臺服務器上上傳數據;(NFS(Network File System)即網絡文件系統,是FreeBSD支持的文件系統中的一種,它容許網絡中的計算機之間經過TCP/IP網絡共享資源。在NFS的應用中,本地NFS的客戶端應用能夠透明地讀寫位於遠端NFS服務器上的文件,就像訪問本地文件同樣。)
若是日誌服務器很是多、數據量大,使用flume進行數據處理;
2.2 使用MapReduce對HDFS中的原始數據進行清洗;
2.3 使用Hive對清洗後的數據進行統計分析;
2.4 使用Sqoop把Hive產生的統計結果導出到mysql中;指標查詢--mysql
2.5 若是用戶須要查看詳細數據的話,可使用HBase進行展示;明細查詢--HBasejava
3.流程代碼(具體實際操做步驟見下面)
3.1 使用shell命令把數據從linux磁盤上傳到HDFS中
3.1.1 在hdfs中建立目錄,命令以下
$HADOOP_HOME/bin/hadoop fs -mkdir /hmbbs_logs
3.1.2 寫一個shell腳本,叫作upload_to_hdfs.sh,內容大致以下
yesterday=`date --date='1 days ago' +%Y_%m_%d`
hadoop fs -put /apache_logs/access_${yesterday}.log /hmbbs_logs
3.1.3 把腳本upload_to_hdfs.sh配置到crontab中,執行命令crontab -e, 寫法以下
* 1 * * * upload_to_hdfs.shmysql
3.2 使用MapReduce對數據進行清洗,把原始處理清洗後,放到hdfs的/hmbbs_cleaned目錄下,天天產生一個子目錄。 linux
3.3 使用hive對清洗後的數據進行統計。
3.3.1 創建一個外部分區表,腳本以下
CREATE EXTERNAL TABLE hmbbs(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/hmbbs_cleaned';
3.3.2 增長分區,腳本以下
ALTER TABLE hmbbs ADD PARTITION(logdate='2013_05_30') LOCATION '/hmbbs_cleaned/2013_05_30';
把代碼增長到upload_to_hdfs.sh中,內容以下
hive -e "ALTER TABLE hmbbs ADD PARTITION(logdate='${yesterday}') LOCATION '/hmbbs_cleaned/${yesterday}';"
3.3.3 統計每日的pv,代碼以下
CREATE TABLE hmbbs_pv_2013_05_30 AS SELECT COUNT(1) AS PV FROM hmbbs WHERE logdate='2013_05_30';
統計每日的註冊用戶數,代碼以下
CREATE TABLE hmbbs_reguser_2013_05_30 AS SELECT COUNT(1) AS REGUSER FROM hmbbs WHERE logdate='2013_05_30' AND INSTR(url,'member.php?mod=register')>0;
統計每日的獨立ip,代碼以下
CREATE TABLE hmbbs_ip_2013_05_30 AS SELECT COUNT(DISTINCT ip) AS IP FROM hmbbs WHERE logdate='2013_05_30';
統計每日的跳出用戶,代碼以下
CREATE TABLE hmbbs_jumper_2013_05_30 AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM hmbbs WHERE logdate='2013_05_30' GROUP BY ip HAVING times=1) e;
把天天統計的數據放入一張表
CREATE TABLE hmbbs_2013_05_30 AS SELECT '2013_05_30', a.pv, b.reguser, c.ip, d.jumper FROM hmbbs_pv_2013_05_30 a JOIN hmbbs_reguser_2013_05_30 b ON 1=1 JOIN hmbbs_ip_2013_05_30 c ON 1=1 JOIN hmbbs_jumper_2013_05_30 d ON 1=1 ;
3.4 使用sqoop把數據導出到mysql中sql
*********************************************
日誌數據分析詳細步驟(本身實際操做成功的步驟):
一、使用shell把數據從Linux磁盤上上傳到HDFS中
在Linux上/usr/local/下建立一個目錄:mkdir apache_logs/,而後複製兩天的日誌數據放到此文件夾下。
在HDFS中建立存放數據的目錄:hadoop fs -mkdir /hmbbs_logs
hadoop fs -put /usr/local/apache_logs/* /hmbbs_logs
上傳結束了,在hadoop0:50070中觀察到在/hmbbs/目錄下有兩個日誌文件。shell
在/apache_logs目錄下建立一個上傳數據的shell腳本:vi upload_to_hdfs.sh
#!/bin/sh
#get yesterday format string
yesterday=`date --date='1 days ago' +%Y_%m_%d`數據庫
#upload logs to hdfs
hadoop fs -put /apache_logs/access_${yesterday}.log /hmbbs_logsapache
把腳本upload_to_hdfs.sh配置到crontab中,執行命令crontab -e(在天天的1點鐘會準時執行腳本文件)
* 1 * * * upload_to_hdfs.sh數組
二、在eclipse中書寫代碼,使用MapReduce清洗數據。打包cleaned.jar導出到linux下的/apache_logs目錄下。瀏覽器
1 package hmbbs; 2 3 import java.text.ParseException; 4 import java.text.SimpleDateFormat; 5 import java.util.Date; 6 import java.util.Locale; 7 8 import org.apache.hadoop.conf.Configuration; 9 import org.apache.hadoop.conf.Configured; 10 import org.apache.hadoop.fs.Path; 11 import org.apache.hadoop.io.LongWritable; 12 import org.apache.hadoop.io.NullWritable; 13 import org.apache.hadoop.io.Text; 14 import org.apache.hadoop.mapreduce.Job; 15 import org.apache.hadoop.mapreduce.Mapper; 16 import org.apache.hadoop.mapreduce.Reducer; 17 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 18 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 19 import org.apache.hadoop.util.Tool; 20 import org.apache.hadoop.util.ToolRunner; 21 /** 22 * 源數據的清洗 23 * @author ahu_lichang 24 * 25 */ 26 public class HmbbsCleaner extends Configured implements Tool { 27 public int run(String[] args) throws Exception { 28 final Job job = new Job(new Configuration(), 29 HmbbsCleaner.class.getSimpleName()); 30 job.setJarByClass(HmbbsCleaner.class); 31 FileInputFormat.setInputPaths(job, args[0]); 32 job.setMapperClass(MyMapper.class); 33 job.setMapOutputKeyClass(LongWritable.class); 34 job.setMapOutputValueClass(Text.class); 35 job.setReducerClass(MyReducer.class); 36 job.setOutputKeyClass(Text.class); 37 job.setOutputValueClass(NullWritable.class); 38 FileOutputFormat.setOutputPath(job, new Path(args[1])); 39 job.waitForCompletion(true); 40 return 0; 41 } 42 43 public static void main(String[] args) throws Exception { 44 ToolRunner.run(new HmbbsCleaner(), args); 45 } 46 47 static class MyMapper extends 48 Mapper<LongWritable, Text, LongWritable, Text> { 49 LogParser logParser = new LogParser(); 50 Text v2 = new Text(); 51 52 protected void map( 53 LongWritable key, 54 Text value, 55 org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context) 56 throws java.io.IOException, InterruptedException { 57 final String[] parsed = logParser.parse(value.toString()); 58 59 // 過濾掉靜態信息 60 if (parsed[2].startsWith("GET /static/") 61 || parsed[2].startsWith("GET /uc_server")) { 62 return; 63 } 64 65 // 過掉開頭的特定格式字符串 66 if (parsed[2].startsWith("GET /")) { 67 parsed[2] = parsed[2].substring("GET /".length()); 68 } else if (parsed[2].startsWith("POST /")) { 69 parsed[2] = parsed[2].substring("POST /".length()); 70 } 71 72 // 過濾結尾的特定格式字符串 73 if (parsed[2].endsWith(" HTTP/1.1")) { 74 parsed[2] = parsed[2].substring(0, parsed[2].length() 75 - " HTTP/1.1".length()); 76 } 77 78 v2.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]); 79 context.write(key, v2); 80 }; 81 } 82 83 static class MyReducer extends 84 Reducer<LongWritable, Text, Text, NullWritable> { 85 protected void reduce( 86 LongWritable k2, 87 java.lang.Iterable<Text> v2s, 88 org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context) 89 throws java.io.IOException, InterruptedException { 90 for (Text v2 : v2s) { 91 context.write(v2, NullWritable.get()); 92 } 93 }; 94 } 95 96 static class LogParser { 97 public static final SimpleDateFormat FORMAT = new SimpleDateFormat( 98 "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH); 99 public static final SimpleDateFormat dateformat1 = new SimpleDateFormat( 100 "yyyyMMddHHmmss"); 101 102 public static void main(String[] args) throws ParseException { 103 final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127"; 104 LogParser parser = new LogParser(); 105 final String[] array = parser.parse(S1); 106 System.out.println("樣例數據: " + S1); 107 System.out.format( 108 "解析結果: ip=%s, time=%s, url=%s, status=%s, traffic=%s", 109 array[0], array[1], array[2], array[3], array[4]); 110 } 111 112 /** 113 * 解析英文時間字符串 114 * 115 * @param string 116 * @return 117 * @throws ParseException 118 */ 119 private Date parseDateFormat(String string) { 120 Date parse = null; 121 try { 122 parse = FORMAT.parse(string); 123 } catch (ParseException e) { 124 e.printStackTrace(); 125 } 126 return parse; 127 } 128 129 /** 130 * 解析日誌的行記錄 131 * 132 * @param line 133 * @return 數組含有5個元素,分別是ip、時間、url、狀態、流量 134 */ 135 public String[] parse(String line) { 136 String ip = parseIP(line); 137 String time = parseTime(line); 138 String url = parseURL(line); 139 String status = parseStatus(line); 140 String traffic = parseTraffic(line); 141 142 return new String[] { ip, time, url, status, traffic }; 143 } 144 145 private String parseTraffic(String line) { 146 final String trim = line.substring(line.lastIndexOf("\"") + 1) 147 .trim(); 148 String traffic = trim.split(" ")[1]; 149 return traffic; 150 } 151 152 private String parseStatus(String line) { 153 final String trim = line.substring(line.lastIndexOf("\"") + 1) 154 .trim(); 155 String status = trim.split(" ")[0]; 156 return status; 157 } 158 159 private String parseURL(String line) { 160 final int first = line.indexOf("\""); 161 final int last = line.lastIndexOf("\""); 162 String url = line.substring(first + 1, last); 163 return url; 164 } 165 166 private String parseTime(String line) { 167 final int first = line.indexOf("["); 168 final int last = line.indexOf("+0800]"); 169 String time = line.substring(first + 1, last).trim(); 170 Date date = parseDateFormat(time); 171 return dateformat1.format(date); 172 } 173 174 private String parseIP(String line) { 175 String ip = line.split("- -")[0].trim(); 176 return ip; 177 } 178 } 179 180 }
vi upload_to_hdfs.sh
#!/bin/sh
#get yesterday format string
#yesterday=`date --date='1 days ago' +%Y_%m_%d`
#testing cleaning data
yesterday=$1
#upload logs to hdfs
hadoop fs -put /apache_logs/access_${yesterday}.log /hmbbs_logs
#cleanning data
hadoop jar cleaned.jar /hmbbs_logs/access_${yesterday}.log /hmbbs_cleaned/${yesterday}
權限chmod u+x upload_to_hdfs.sh
執行upload_to_hdfs.sh 2013_05_30
而後在瀏覽器中hadoop0:50070中就能觀察到上傳到HDFS中的清洗事後的數據了。
三、使用hive對清洗後的數據進行統計。
創建一個外部分區表,腳本以下
CREATE EXTERNAL TABLE hmbbs(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/hmbbs_cleaned';
增長分區,腳本以下
ALTER TABLE hmbbs ADD PARTITION(logdate='2013_05_30') LOCATION '/hmbbs_cleaned/2013_05_30';
把代碼增長到upload_to_hdfs.sh中,內容以下(天天產生一個分區)
#alter hive table and then add partition to existed table
hive -e "ALTER TABLE hmbbs ADD PARTITION(logdate='${yesterday}') LOCATION '/hmbbs_cleaned/${yesterday}';"
------hive -e "執行語句;" hive -e的做用就是不用在hive命令行下,能夠在外面執行。
能夠在外面執行hive -e "ALTER TABLE hmbbs ADD PARTITION(logdate='2013_05_31') LOCATION '/hmbbs_cleaned/2013_05_31';"
這樣在/hmbbs表下面就多了一個2013_05_31文件
select count(1) form hmbbs -----經過觀察數字大小變化,就可判斷出是否添加成功。
統計每日的pv,代碼以下
CREATE TABLE hmbbs_pv_2013_05_30 AS SELECT COUNT(1) AS PV FROM hmbbs WHERE logdate='2013_05_30';
執行hive -e "SELECT COUNT(1) FROM hmbbs WHERE logdate='2013_05_30';" 獲得表中的數據大小,待後面驗證用。
執行hive -e "CREATE TABLE hmbbs_pv_2013_05_30 AS SELECT COUNT(1) AS PV FROM hmbbs WHERE logdate='2013_05_30';" 將查詢到的PV(別名)數據存到表hmbbs_pv_2013_05_30中。
驗證表中是否添加成功了數據:hive -e "select * from hmbbs_pv_2013_05_30;"
統計每日的註冊用戶數,代碼以下
CREATE TABLE hmbbs_reguser_2013_05_30 AS SELECT COUNT(1) AS REGUSER FROM hmbbs WHERE logdate='2013_05_30' AND INSTR(url,'member.php?mod=register')>0;
INSTR(url,'member.php?mod=register')是一個函數,用來判斷url字符串中所包含的子串member.php?mod=register的個數
執行hive -e "SELECT COUNT(1) AS REGUSER FROM hmbbs WHERE logdate='2013_05_30' AND INSTR(url,'member.php?mod=register')>0;" 能夠統計出其中一天的用戶註冊數。這個數字確定比以前的pv數小!
統計每日的獨立ip(去重),代碼以下
CREATE TABLE hmbbs_ip_2013_05_30 AS SELECT COUNT(DISTINCT ip) AS IP FROM hmbbs WHERE logdate='2013_05_30';
在hive中查詢有多少個獨立ip:SELECT COUNT(DISTINCT ip) AS IP FROM hmbbs WHERE logdate='2013_05_30';
執行hive -e "CREATE TABLE hmbbs_ip_2013_05_30 AS SELECT COUNT(DISTINCT ip) AS IP FROM hmbbs WHERE logdate='2013_05_30';"
統計每日的跳出用戶,代碼以下
CREATE TABLE hmbbs_jumper_2013_05_30 AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM hmbbs WHERE logdate='2013_05_30' GROUP BY ip HAVING times=1) e;
在hive下查詢登陸次數只有一次的ip有哪些:SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM hmbbs WHERE logdate='2013_05_30' GROUP BY ip HAVING times=1) e; ---e是別名
執行hive -e "CREATE TABLE hmbbs_jumper_2013_05_30 AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM hmbbs WHERE logdate='2013_05_30' GROUP BY ip HAVING times=1) e;"
把天天統計的數據放入一張表 (錶鏈接)
CREATE TABLE hmbbs_2013_05_30 AS SELECT '2013_05_30', a.pv, b.reguser, c.ip, d.jumper FROM hmbbs_pv_2013_05_30 a JOIN hmbbs_reguser_2013_05_30 b ON 1=1 JOIN hmbbs_ip_2013_05_30 c ON 1=1 JOIN hmbbs_jumper_2013_05_30 d ON 1=1 ;
建立完了,查看一下:
show tables;
select * from hmbbs_2013_05_30 ;
使用sqoop把hmbbs_2013_05_30表中數據導出到mysql中。(數據導出成功了之後,就能夠刪除掉以前的5個表了)
在MySQL第三方工具上鍊接hadoop0,在裏面建立一個數據庫hmbbs,再建立一個表hmbbs_logs_stat,表中有導出數據的5個字段:logdate varchar 非空 ,pv int, reguser int, ip int, jumper int
注意:建立數據庫時,出現錯誤:遠程登陸權限問題!
sqoop export --connect jdbc:mysql://hadoop0:3306/hmbbs --username root --password admin --table hmbbs_logs_stat --fields-terminated-by '\001'--export-dir ‘/hive/hmbbs_2013_05_30’
----'\001'是默認的列分隔符 /user/hive/warehouse/hmbbs_2013_05_30這個目錄根據本身的設置,不必定都是這樣的!
導出成功之後,能夠在工具中刷新表,就能觀察到表中的數據了。
統計數據和導出操做也都應該放在腳本文件中:
vi upload_to_hdfs.sh
#create hive table everyday
hive -e "CREATE TABLE hmbbs_pv_${yesterday} AS SELECT COUNT(1) AS PV FROM hmbbs WHERE logdate='${yesterday}';"
hive -e "SELECT COUNT(1) AS REGUSER FROM hmbbs WHERE logdate='${yesterday}' AND INSTR(url,'member.php?mod=register')>0;"
hive -e "CREATE TABLE hmbbs_ip_${yesterday} AS SELECT COUNT(DISTINCT ip) AS IP FROM hmbbs WHERE logdate='${yesterday}';"
hive -e "CREATE TABLE hmbbs_jumper_${yesterday} AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM hmbbs WHERE logdate='${yesterday}' GROUP BY ip HAVING times=1) e;"
hive -e "CREATE TABLE hmbbs_${yesterday} AS SELECT '${yesterday}', a.pv, b.reguser, c.ip, d.jumper FROM hmbbs_pv_${yesterday} a JOIN hmbbs_reguser_${yesterday} b ON 1=1 JOIN hmbbs_ip_${yesterday} c ON 1=1 JOIN hmbbs_jumper_${yesterday} d ON 1=1 ;"
#delete hive tables
hive -e "drop table hmbbs_pv_${yesterday}"
hive -e "drop table hmbbs_reguser_${yesterday}"
hive -e "drop table hmbbs_ip_${yesterday}"
hive -e "drop table hmbbs_jumper_${yesterday}"
#sqoop export to mysql
sqoop export --connect jdbc:mysql://hadoop0:3306/hmbbs --username root --password admin --table hmbbs_logs_stat --fields-terminated-by '\001'--export-dir ‘/hive/hmbbs_${yesterday}’
#delete hive tables
hive -e "drop table hmbbs_${yesterday}"
完善執行的shell腳本:
一、初始化數據的腳本(歷史數據)
二、每日執行的腳本
mv upload_to_hdfs.sh hmbbs_core.sh
vi hmbbs_daily.sh
#!/bin/sh
yesterday=`date --date='1 days ago' +%Y_%m_%d`
hmbbs_core.sh $yesterday
chmod u+x hmbbs_daily.sh
crontab -e
* 1 * * * /apache_logs/hmbbs_daily.sh
vi hmbbs_init.sh
#!/bin/sh
#hive -e "CREATE EXTERNAL TABLE hmbbs(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/hmbbs_cleaned';"
s1=`date --date="$1" +%s`
s2=`date +%s`
s3=$(((s2-s1)/3600/24))
for ((i=$s3;i>0;i--))
do
tmp=`date --date="$i days ago" +%Y_%m_%d`
echo $tmp
done