網站日誌分析項目案例(二)數據清洗 1、數據狀況分析 1.1 數據狀況回顧 該論壇數據有兩部分: (1)歷史數據約56GB,統計到2012-05-29。這也說明,在2012-05-29以前,日誌文件都在一個文件裏邊,採用了追加寫入的方式。 (2)自2013-05-30起,天天生成一個數據文件,約150MB左右。這也說明,從2013-05-30以後,日誌文件再也不是在一個文件裏邊。 圖1展現了該日誌數據的記錄格式,其中每行記錄有5部分組成:訪問者IP、訪問時間、訪問資源、訪問狀態(HTTP狀態碼)、本次訪問流量。 log 圖1 日誌記錄數據格式 本次使用數據來自於兩個2013年的日誌文件,分別爲access_2013_05_30.log與access_2013_05_31.log,下載地址爲:http://pan.baidu.com/s/1pJE7XR9 1.2 要清理的數據 (1)根據前一篇的關鍵指標的分析,咱們所要統計分析的均不涉及到訪問狀態(HTTP狀態碼)以及本次訪問的流量,因而咱們首先能夠將這兩項記錄清理掉; (2)根據日誌記錄的數據格式,咱們須要將日期格式轉換爲日常所見的普通格式如20150426這種,因而咱們能夠寫一個類將日誌記錄的日期進行轉換; (3)因爲靜態資源的訪問請求對咱們的數據分析沒有意義,因而咱們能夠將"GET /staticsource/"開頭的訪問記錄過濾掉,又由於GET和POST字符串對咱們也沒有意義,所以也能夠將其省略掉; 2、數據清洗過程 2.1 按期上傳日誌至HDFS 首先,把日誌數據上傳到HDFS中進行處理,能夠分爲如下幾種狀況: (1)若是是日誌服務器數據較小、壓力較小,能夠直接使用shell命令把數據上傳到HDFS中; (2)若是是日誌服務器數據較大、壓力較大,使用NFS在另外一臺服務器上上傳數據; (3)若是日誌服務器很是多、數據量大,使用flume進行數據處理; 這裏咱們的實驗數據文件較小,所以直接採用第一種Shell命令方式。又由於日誌文件時天天產生的,所以須要設置一個定時任務,在次日的1點鐘自動將前一天產生的log文件上傳到HDFS的指定目錄中。因此,咱們經過shell腳本結合crontab建立一個定時任務techbbs_core.sh,內容以下: #!/bin/sh #step1.get yesterday format string yesterday=$(date --date='1 days ago' +%Y_%m_%d) #step2.upload logs to hdfs hadoop fs -put /usr/local/files/apache_logs/access_${yesterday}.log /project/techbbs/data 結合crontab設置爲天天1點鐘自動執行的按期任務:crontab -e,內容以下(其中1表明天天1:00,techbbs_core.sh爲要執行的腳本文件): * 1 * * * techbbs_core.sh 驗證方式:經過命令 crontab -l 能夠查看已經設置的定時任務 2.2 編寫MapReduce程序清理日誌 (1)編寫日誌解析類對每行記錄的五個組成部分進行單獨的解析 複製代碼 複製代碼 static class LogParser { public static final SimpleDateFormat FORMAT = new SimpleDateFormat( "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH); public static final SimpleDateFormat dateformat1 = new SimpleDateFormat( "yyyyMMddHHmmss");/** * 解析英文時間字符串 * * @param string * @return * @throws ParseException */ private Date parseDateFormat(String string) { Date parse = null; try { parse = FORMAT.parse(string); } catch (ParseException e) { e.printStackTrace(); } return parse; } /** * 解析日誌的行記錄 * * @param line * @return 數組含有5個元素,分別是ip、時間、url、狀態、流量 */ public String[] parse(String line) { String ip = parseIP(line); String time = parseTime(line); String url = parseURL(line); String status = parseStatus(line); String traffic = parseTraffic(line); return new String[] { ip, time, url, status, traffic }; } private String parseTraffic(String line) { final String trim = line.substring(line.lastIndexOf("\"") + 1) .trim(); String traffic = trim.split(" ")[1]; return traffic; } private String parseStatus(String line) { final String trim = line.substring(line.lastIndexOf("\"") + 1) .trim(); String status = trim.split(" ")[0]; return status; } private String parseURL(String line) { final int first = line.indexOf("\""); final int last = line.lastIndexOf("\""); String url = line.substring(first + 1, last); return url; } private String parseTime(String line) { final int first = line.indexOf("["); final int last = line.indexOf("+0800]"); String time = line.substring(first + 1, last).trim(); Date date = parseDateFormat(time); return dateformat1.format(date); } private String parseIP(String line) { String ip = line.split("- -")[0].trim(); return ip; } } 複製代碼 複製代碼 (2)編寫MapReduce程序對指定日誌文件的全部記錄進行過濾 Mapper類: 複製代碼 複製代碼 static class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> { LogParser logParser = new LogParser(); Text outputValue = new Text(); protected void map( LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context) throws java.io.IOException, InterruptedException { final String[] parsed = logParser.parse(value.toString()); // step1.過濾掉靜態資源訪問請求 if (parsed[2].startsWith("GET /static/") || parsed[2].startsWith("GET /uc_server")) { return; } // step2.過濾掉開頭的指定字符串 if (parsed[2].startsWith("GET /")) { parsed[2] = parsed[2].substring("GET /".length()); } else if (parsed[2].startsWith("POST /")) { parsed[2] = parsed[2].substring("POST /".length()); } // step3.過濾掉結尾的特定字符串 if (parsed[2].endsWith(" HTTP/1.1")) { parsed[2] = parsed[2].substring(0, parsed[2].length() - " HTTP/1.1".length()); } // step4.只寫入前三個記錄類型項 outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]); context.write(key, outputValue); } } 複製代碼 複製代碼 Reducer類: 複製代碼 複製代碼 static class MyReducer extends Reducer<LongWritable, Text, Text, NullWritable> { protected void reduce( LongWritable k2, java.lang.Iterable<Text> v2s, org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context) throws java.io.IOException, InterruptedException { for (Text v2 : v2s) { context.write(v2, NullWritable.get()); } }; } 複製代碼 複製代碼 (3)LogCleanJob.java的完整示例代碼 View Code (4)導出jar包,並將其上傳至Linux服務器指定目錄中 2.3 按期清理日誌至HDFS 這裏咱們改寫剛剛的定時任務腳本,將自動執行清理的MapReduce程序加入腳本中,內容以下: #!/bin/sh #step1.get yesterday format string yesterday=$(date --date='1 days ago' +%Y_%m_%d) #step2.upload logs to hdfs hadoop fs -put /usr/local/files/apache_logs/access_${yesterday}.log /project/techbbs/data #step3.clean log data hadoop jar /usr/local/files/apache_logs/mycleaner.jar /project/techbbs/data/access_${yesterday}.log /project/techbbs/cleaned/${yesterday} 這段腳本的意思就在於天天1點將日誌文件上傳到HDFS後,執行數據清理程序對已存入HDFS的日誌文件進行過濾,並將過濾後的數據存入cleaned目錄下。 2.4 定時任務測試 (1)由於兩個日誌文件是2013年的,所以這裏將其名稱改成2015年當天以及前一天的,以便這裏可以測試經過。 (2)執行命令:techbbs_core.sh 2014_04_26 控制檯的輸出信息以下所示,能夠看到過濾後的記錄減小了不少: 15/04/26 04:27:20 INFO input.FileInputFormat: Total input paths to process : 1 15/04/26 04:27:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library 15/04/26 04:27:20 WARN snappy.LoadSnappy: Snappy native library not loaded 15/04/26 04:27:22 INFO mapred.JobClient: Running job: job_201504260249_0002 15/04/26 04:27:23 INFO mapred.JobClient: map 0% reduce 0% 15/04/26 04:28:01 INFO mapred.JobClient: map 29% reduce 0% 15/04/26 04:28:07 INFO mapred.JobClient: map 42% reduce 0% 15/04/26 04:28:10 INFO mapred.JobClient: map 57% reduce 0% 15/04/26 04:28:13 INFO mapred.JobClient: map 74% reduce 0% 15/04/26 04:28:16 INFO mapred.JobClient: map 89% reduce 0% 15/04/26 04:28:19 INFO mapred.JobClient: map 100% reduce 0% 15/04/26 04:28:49 INFO mapred.JobClient: map 100% reduce 100% 15/04/26 04:28:50 INFO mapred.JobClient: Job complete: job_201504260249_0002 15/04/26 04:28:50 INFO mapred.JobClient: Counters: 29 15/04/26 04:28:50 INFO mapred.JobClient: Job Counters 15/04/26 04:28:50 INFO mapred.JobClient: Launched reduce tasks=1 15/04/26 04:28:50 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=58296 15/04/26 04:28:50 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/04/26 04:28:50 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/04/26 04:28:50 INFO mapred.JobClient: Launched map tasks=1 15/04/26 04:28:50 INFO mapred.JobClient: Data-local map tasks=1 15/04/26 04:28:50 INFO mapred.JobClient:SLOTS_MILLIS_REDUCES=25238 15/04/26 04:28:50 INFO mapred.JobClient: File Output Format Counters 15/04/26 04:28:50 INFO mapred.JobClient: Bytes Written=12794925 15/04/26 04:28:50 INFO mapred.JobClient: FileSystemCounters 15/04/26 04:28:50 INFO mapred.JobClient: FILE_BYTES_READ=14503530 15/04/26 04:28:50 INFO mapred.JobClient: HDFS_BYTES_READ=61084325 15/04/26 04:28:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=29111500 15/04/26 04:28:50 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=12794925 15/04/26 04:28:50 INFO mapred.JobClient: File Input Format Counters 15/04/26 04:28:50 INFO mapred.JobClient: Bytes Read=61084192 15/04/26 04:28:50 INFO mapred.JobClient: Map-Reduce Framework 15/04/26 04:28:50 INFO mapred.JobClient: Map output materialized bytes=14503530 15/04/26 04:28:50 INFO mapred.JobClient: Map input records=548160 15/04/26 04:28:50 INFO mapred.JobClient: Reduce shuffle bytes=14503530 15/04/26 04:28:50 INFO mapred.JobClient: Spilled Records=339714 15/04/26 04:28:50 INFO mapred.JobClient: Map output bytes=14158741 15/04/26 04:28:50 INFO mapred.JobClient: CPU time spent (ms)=21200 15/04/26 04:28:50 INFO mapred.JobClient: Total committed heap usage (bytes)=229003264 15/04/26 04:28:50 INFO mapred.JobClient: Combine input records=0 15/04/26 04:28:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=133 15/04/26 04:28:50 INFO mapred.JobClient: Reduce input records=169857 15/04/26 04:28:50 INFO mapred.JobClient: Reduce input groups=169857 15/04/26 04:28:50 INFO mapred.JobClient: Combine output records=0 15/04/26 04:28:50 INFO mapred.JobClient: Physical memory (bytes) snapshot=154001408 15/04/26 04:28:50 INFO mapred.JobClient: Reduce output records=169857 15/04/26 04:28:50 INFO mapred.JobClient: Virtual memory (bytes) snapshot=689442816 15/04/26 04:28:50 INFO mapred.JobClient: Map output records=169857 Clean process success! (3)經過Web接口查看HDFS中的日誌數據: 存入的未過濾的日誌數據:/project/techbbs/data/ 存入的已過濾的日誌數據:/project/techbbs/cleaned/