大數據hadoop之大型互聯網電商公司網站日誌分析

        博主今天分享大型項目的分析demo,下面是假設某個公司的網站項目想要獲得活動日、節假日的網站流量分析。此類需求使用hadoop離線平臺來分析性價比百分百完美。下面博主直接上教程。。。java

        操做步驟:mysql

        1、下載日誌jquery

               因爲不少時候,網站項目是部署在客戶內網的,而hadoop平臺部署在本身的內部,兩邊網絡不通,沒法作日誌自動收集。因此,此處博主就手動下載了;你們須要的話,能夠自行編寫腳本或flume來收集。              web

#1.下載日誌
lcd E:\xhexx-logs-anlily\2018-11-11
get /opt/xxx/apache-tomcat-7.0.57/logs/localhost_access_log.2018-11-11.txt
get /opt/xxx/t1/logs/localhost_access_log.2018-11-11.txt
get /opt/xxx/apache-tomcat-7.0.57_msg/logs/localhost_access_log.2018-11-11.txt
#2.修改日誌文件名
localhost_access_log_10_2_4_155_1_2018-11-11.txt
localhost_access_log_10_2_4_155_2_2018-11-11.txt

        日誌格式樣例:spring

1x.x.x.xxx - - [11/Nov/2018:15:58:27 +0800] "GET /xxx/xhexx_ui/bbs/js/swiper-3.3.1.min.js HTTP/1.0" 200 78313
1x.x.x.xxx - - [11/Nov/2018:15:58:37 +0800] "GET /xxx/doSign/showSign HTTP/1.0" 200 2397
1x.x.x.xxx - - [11/Nov/2018:15:58:41 +0800] "GET /xxx/ui/metro/js/metro.min.js HTTP/1.0" 200 92107
1x.x.x.xxx - - [11/Nov/2018:15:58:41 +0800] "GET /xxx/mMember/savePage HTTP/1.0" 200 3898
1x.x.x.xxx - - [11/Nov/2018:15:58:45 +0800] "POST /xxx/mMember/checkCode HTTP/1.0" 200 77

        2、上傳日誌到hadoop集羣sql

cd  ~/xhexx-logs-anlily/2018-11-11
hdfs dfs -put *  /webloginput
[hadoop@centos-aaron-h1 2018-11-11]$ hdfs dfs -ls /webloginput
Found 47 items
-rw-r--r--   2 hadoop supergroup    2542507 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2621956 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup       5943 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2610415 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2613782 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2591445 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup       5474 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2585817 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup     428110 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    4139168 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    4119252 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2602771 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    4019158 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2597577 2019-01-19 22:09 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    4036010 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2591541 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2622123 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    4081492 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    4139018 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2541915 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    3994434 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2054366 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2087420 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    1970492 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    1999238 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2097946 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2113500 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2065582 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    1981415 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2054112 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2084308 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    1983759 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    1990587 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2049399 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2061087 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2040909 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2068085 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2061532 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2040548 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2070062 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2092143 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2040414 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2075960 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2070758 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    2063688 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    1964264 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_1_2018-11-11.txt
-rw-r--r--   2 hadoop supergroup    1977532 2019-01-19 22:38 /webloginput/localhost_access_log_1x_x_x_1xx_2_2018-11-11.txt

        3、編寫MR程序apache

package com.empire.hadoop.mr.xhexxweblogwash;

/**
 * 類 WebLogBean.java的實現描述:web日誌bean類
 * 
 * @author arron 2019年x月xx日 上午xx:xx:xx
 */
public class WebLogBean {

    private String  remoteAddr;   // 記錄客戶端的ip地址
    private String  timeLocal;    // 記錄訪問時間與時區
    private String  requestType;  // 請求類型1-POST/2-GET
    private String  url;          // 記錄請求的url
    private String  protocol;     // 協議
    private String  status;       // 記錄請求狀態;成功是200
    private String  traffic;      // 流量
    private boolean valid = true; // 判斷數據是否合法

    public String getRemoteAddr() {
        return remoteAddr;
    }

    public void setRemoteAddr(String remoteAddr) {
        this.remoteAddr = remoteAddr;
    }

    public String getTimeLocal() {
        return timeLocal;
    }

    public void setTimeLocal(String timeLocal) {
        this.timeLocal = timeLocal;
    }

    public String getRequestType() {
        return requestType;
    }

    public void setRequestType(String requestType) {
        this.requestType = requestType;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public String getProtocol() {
        return protocol;
    }

    public void setProtocol(String protocol) {
        this.protocol = protocol;
    }

    public String getStatus() {
        return status;
    }

    public void setStatus(String status) {
        this.status = status;
    }

    public String getTraffic() {
        return traffic;
    }

    public void setTraffic(String traffic) {
        this.traffic = traffic;
    }

    public boolean isValid() {
        return valid;
    }

    public void setValid(boolean valid) {
        this.valid = valid;
    }

    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder();
        sb.append(this.remoteAddr);
        sb.append(",").append(this.timeLocal);
        sb.append(",").append(this.requestType);
        sb.append(",").append(this.url);
        sb.append(",").append(this.protocol);
        sb.append(",").append(this.status);
        sb.append(",").append(this.traffic);
        return sb.toString();
    }
}
package com.empire.hadoop.mr.xhexxweblogwash;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

/**
 * 類 WebLogParser.java的實現描述:日誌解析類
 * 
 * @author arron 2019年x月xx日 上午xx:xx:xx
 */
public class WebLogParser {
    /**
     * 解析前的格式
     */
    private static SimpleDateFormat sd1 = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);
    /**
     * 解析後的格式
     */
    private static SimpleDateFormat sd2 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    /**
     * 解析一行日誌
     * 
     * @param line
     * @return
     */
    public static WebLogBean parser(String line) {
        WebLogBean webLogBean = new WebLogBean();
        String[] arr = line.split(" ");
        if (arr.length == 10) {
            webLogBean.setRemoteAddr(arr[0]);
            webLogBean.setTimeLocal(parseTime(arr[3].substring(1)));
            webLogBean.setRequestType(arr[5].substring(1, arr[5].length()));
            webLogBean.setUrl(arr[6]);
            webLogBean.setProtocol(arr[7].substring(0, arr[7].length() - 1));
            webLogBean.setStatus(arr[8]);
            if ("-".equals(arr[9])) {
                webLogBean.setTraffic("0");
            } else {
                webLogBean.setTraffic(arr[9]);
            }
            webLogBean.setValid(true);
        } else {
            webLogBean.setValid(false);
        }
        return webLogBean;
    }

    /**
     * 時間解析
     * 
     * @param dt
     * @return
     */
    public static String parseTime(String dt) {

        String timeString = "";
        try {
            Date parse = sd1.parse(dt);
            timeString = sd2.format(parse);

        } catch (ParseException e) {
            e.printStackTrace();
        }

        return timeString;
    }

    public static void main(String[] args) {
        WebLogParser wp = new WebLogParser();
        String parseTime = WebLogParser.parseTime("18/Sep/2013:06:49:48");
        System.out.println(parseTime);
        String line = "1x.x.x.1xx - - [11/Nov/2018:09:45:31 +0800] \"POST /xxx/wxxlotteryactivitys/xotterxClick HTTP/1.0\" 200 191";
        WebLogBean webLogBean = WebLogParser.parser(line);
        if (!webLogBean.isValid())
            return;
        System.out.println(webLogBean);
    }

}

           mapreduce主程序centos

package com.empire.hadoop.mr.xhexxweblogwash;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * 類 WeblogPreProcess.java的實現描述:web日誌清洗主類
 * 
 * @author arron 2019年x月xx日 上午xx:xx:xx
 */
public class WeblogPreProcess {

    static class WeblogPreProcessMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
        Text         k = new Text();
        NullWritable v = NullWritable.get();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String line = value.toString();
            WebLogBean webLogBean = WebLogParser.parser(line);
            if (!webLogBean.isValid())
                return;
            k.set(webLogBean.toString());
            context.write(k, v);

        }

    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(WeblogPreProcess.class);

        job.setMapperClass(WeblogPreProcessMapper.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);

    }
}

        將上面的程序打包而且上傳到hadoop集羣的yarn程序執行節點tomcat

        4、執行程序服務器

[hadoop@centos-aaron-h1 ~]$  hadoop jar  weblogclear.jar com.empire.hadoop.mr.xhexxweblogwash.WeblogPreProcess /webloginput /weblogout
19/01/20 02:10:53 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032
19/01/20 02:10:53 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/01/20 02:10:54 INFO input.FileInputFormat: Total input files to process : 47
19/01/20 02:10:54 INFO mapreduce.JobSubmitter: number of splits:47
19/01/20 02:10:54 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/01/20 02:10:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547906537546_0004
19/01/20 02:10:54 INFO impl.YarnClientImpl: Submitted application application_1547906537546_0004
19/01/20 02:10:55 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1547906537546_0004/
19/01/20 02:10:55 INFO mapreduce.Job: Running job: job_1547906537546_0004
19/01/20 02:11:03 INFO mapreduce.Job: Job job_1547906537546_0004 running in uber mode : false
19/01/20 02:11:03 INFO mapreduce.Job:  map 0% reduce 0%
19/01/20 02:11:41 INFO mapreduce.Job:  map 7% reduce 0%
19/01/20 02:11:44 INFO mapreduce.Job:  map 10% reduce 0%
19/01/20 02:11:45 INFO mapreduce.Job:  map 13% reduce 0%
19/01/20 02:11:58 INFO mapreduce.Job:  map 20% reduce 0%
19/01/20 02:12:01 INFO mapreduce.Job:  map 23% reduce 0%
19/01/20 02:12:02 INFO mapreduce.Job:  map 25% reduce 0%
19/01/20 02:12:03 INFO mapreduce.Job:  map 29% reduce 0%
19/01/20 02:12:04 INFO mapreduce.Job:  map 35% reduce 0%
19/01/20 02:12:05 INFO mapreduce.Job:  map 37% reduce 0%
19/01/20 02:12:07 INFO mapreduce.Job:  map 39% reduce 0%
19/01/20 02:12:09 INFO mapreduce.Job:  map 41% reduce 0%
19/01/20 02:12:10 INFO mapreduce.Job:  map 47% reduce 0%
19/01/20 02:12:20 INFO mapreduce.Job:  map 55% reduce 0%
19/01/20 02:12:23 INFO mapreduce.Job:  map 56% reduce 0%
19/01/20 02:12:24 INFO mapreduce.Job:  map 59% reduce 0%
19/01/20 02:12:25 INFO mapreduce.Job:  map 60% reduce 0%
19/01/20 02:12:46 INFO mapreduce.Job:  map 62% reduce 20%
19/01/20 02:12:49 INFO mapreduce.Job:  map 66% reduce 20%
19/01/20 02:12:51 INFO mapreduce.Job:  map 79% reduce 20%
19/01/20 02:12:52 INFO mapreduce.Job:  map 83% reduce 21%
19/01/20 02:12:53 INFO mapreduce.Job:  map 84% reduce 21%
19/01/20 02:12:54 INFO mapreduce.Job:  map 85% reduce 21%
19/01/20 02:12:55 INFO mapreduce.Job:  map 89% reduce 21%
19/01/20 02:12:58 INFO mapreduce.Job:  map 89% reduce 30%
19/01/20 02:12:59 INFO mapreduce.Job:  map 91% reduce 30%
19/01/20 02:13:00 INFO mapreduce.Job:  map 95% reduce 30%
19/01/20 02:13:01 INFO mapreduce.Job:  map 96% reduce 30%
19/01/20 02:13:02 INFO mapreduce.Job:  map 98% reduce 30%
19/01/20 02:13:03 INFO mapreduce.Job:  map 100% reduce 30%
19/01/20 02:13:04 INFO mapreduce.Job:  map 100% reduce 40%
19/01/20 02:13:07 INFO mapreduce.Job:  map 100% reduce 100%
19/01/20 02:13:07 INFO mapreduce.Job: Job job_1547906537546_0004 completed successfully
19/01/20 02:13:08 INFO mapreduce.Job: Counters: 51
        File System Counters
                FILE: Number of bytes read=98621200
                FILE: Number of bytes written=206697260
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=110656205
                HDFS: Number of bytes written=96528650
                HDFS: Number of read operations=144
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Killed map tasks=2
                Launched map tasks=48
                Launched reduce tasks=1
                Data-local map tasks=44
                Rack-local map tasks=4
                Total time spent by all maps in occupied slots (ms)=2281022
                Total time spent by all reduces in occupied slots (ms)=61961
                Total time spent by all map tasks (ms)=2281022
                Total time spent by all reduce tasks (ms)=61961
                Total vcore-milliseconds taken by all map tasks=2281022
                Total vcore-milliseconds taken by all reduce tasks=61961
                Total megabyte-milliseconds taken by all map tasks=2335766528
                Total megabyte-milliseconds taken by all reduce tasks=63448064
        Map-Reduce Framework
                Map input records=911941
                Map output records=911941
                Map output bytes=96661657
                Map output materialized bytes=98621470
                Input split bytes=7191
                Combine input records=0
                Combine output records=0
                Reduce input groups=855824
                Reduce shuffle bytes=98621470
                Reduce input records=911941
                Reduce output records=911941
                Spilled Records=1823882
                Shuffled Maps =47
                Failed Shuffles=0
                Merged Map outputs=47
                GC time elapsed (ms)=64058
                CPU time spent (ms)=130230
                Physical memory (bytes) snapshot=4865814528
                Virtual memory (bytes) snapshot=40510705664
                Total committed heap usage (bytes)=5896761344
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=110649014
        File Output Format Counters 
                Bytes Written=96528650
[hadoop@centos-aaron-h1 ~]$ hdfs dfs -ls /weblogout
Found 2 items
-rw-r--r--   2 hadoop supergroup          0 2019-01-20 02:13 /weblogout/_SUCCESS
-rw-r--r--   2 hadoop supergroup   96528650 2019-01-20 02:13 /weblogout/part-r-00000
[hadoop@centos-aaron-h1 ~]$
[hadoop@centos-aaron-h1 ~]$  hdfs dfs -cat   /weblogout/part-r-00000 |more
1x.x.4.x0x,2018-11-11 00:00:01,GET,/xxx/xxxAppoint/showAppointmentList,HTTP/1.0,200,1670
1x.x.4.x0x,2018-11-11 00:00:02,GET,/xxx/xxxSign/showSign,HTTP/1.0,200,2384
1x.x.4.x0x,2018-11-11 00:00:05,POST,/xxx/xxxSign/addSign,HTTP/1.0,200,224
1x.x.4.x0x,2018-11-11 00:00:08,GET,/xxx/xxxMember/mycard,HTTP/1.0,200,1984
1x.x.4.x0x,2018-11-11 00:00:08,POST,/xxx/xxx/activity/rotarydisc,HTTP/1.0,200,230
1x.x.4.x0x,2018-11-11 00:00:09,POST,/xxx/xxxMember/mycardresult,HTTP/1.0,200,1104
1x.x.4.x0x,2018-11-11 00:00:10,GET,/xxx/resources/dojo/dojo.js,HTTP/1.0,200,257598
1x.x.4.x0x,2018-11-11 00:00:10,GET,/xxx/resources/spring/Spring-Dojo.js,HTTP/1.0,200,9520
1x.x.4.x0x,2018-11-11 00:00:10,GET,/xxx/resources/spring/Spring.js,HTTP/1.0,200,3170
1x.x.4.x0x,2018-11-11 00:00:10,GET,/xxx/wcclotteryactivitys/showActivity?id=268&code=081AvJud1SCu4t0Bactd11aGud1AvJuM&state=272,HTTP/1.0,200,30334
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/js/date/WdatePicker.js,HTTP/1.0,200,10235
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/js/hcharts/js/highcharts-more.js,HTTP/1.0,200,23172
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/js/hcharts/js/highcharts.js,HTTP/1.0,200,153283
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/js/hcharts/js/modules/exporting.js,HTTP/1.0,200,7449
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/js/jquery.min.js,HTTP/1.0,200,83477
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/js/jquery.widget.min.js,HTTP/1.0,200,6520
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/resources/dojo/nls/dojo_zh-cn.js,HTTP/1.0,200,6757
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/ui/dtGrid/i18n/zh-cn.js,HTTP/1.0,200,8471
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/ui/dtGrid/jquery.dtGrid.js,HTTP/1.0,200,121247
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/ui/metro/js/metro.min.js,HTTP/1.0,200,92107
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/ui/zTree/js/jquery.ztree.core-3.5.js,HTTP/1.0,200,55651
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/ui/zTree/js/jquery.ztree.excheck-3.5.js,HTTP/1.0,200,21486
1x.x.4.x0x,2018-11-11 00:00:11,GET,/xxx/ui/zTree/js/jquery.ztree.exedit-3.5.js,HTTP/1.0,200,42910
1x.x.4.x0x,2018-11-11 00:00:11,POST,/xxx/xxxMember/mycardresult,HTTP/1.0,200,1104
1x.x.4.x0x,2018-11-11 00:00:12,GET,/xxx/js/lottery/jQueryRotate.2.2.js,HTTP/1.0,200,11500
1x.x.4.x0x,2018-11-11 00:00:12,GET,/xxx/js/lottery/jquery.easing.min.js,HTTP/1.0,200,5555
--More--

        5、mysql數據準備

              這裏因爲一天的數據量只有100M左右,並不算很是大;因此,咱們就將這些數據直接經過navicat導入到mysql來分析了。這樣能夠很方便爲客戶作分析,甚至咱們能夠直接將sql導出給客戶,讓他們本身查。

#a.下載結果文件
hdfs dfs -get   /weblogout/part-r-00000 
#b.mysql新建表
DROP TABLE IF EXISTS `wcc_web_log_2018-11-11`;
CREATE TABLE `wcc_web_log_2018-11-11`  (
  `remote_addr` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NULL DEFAULT NULL,
  `time_local` datetime(0) NULL DEFAULT NULL,
  `request_type` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NULL DEFAULT NULL,
  `url` varchar(500) CHARACTER SET utf8 COLLATE utf8_bin NULL DEFAULT NULL,
  `protocol` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NULL DEFAULT NULL,
  `status` int(255) NULL DEFAULT NULL,
  `traffic` int(11) NULL DEFAULT NULL,
  INDEX `idx_wcc_web_log_2018_11_11_time_local`(`time_local`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_bin ROW_FORMAT = Compact;

#c.使用navicat的txt方式導入mysql(此處你們能夠自行選擇,也可使用sqoop來完成)

        6、統計分析

        (1)按小時統計訪問量、而且按count倒序排序

SELECT
	count( * ),
	DATE_FORMAT( time_local, "%Y-%m-%d %H" ) 
FROM
	`wcc_web_log_2018-11-11` 
GROUP BY
	DATE_FORMAT( time_local, "%Y-%m-%d %H" ) 
ORDER BY
	count( * ) DESC;

 

        (2)按分鐘統計訪問量、流量(單位:Mb)

SELECT
	count( * ),sum(traffic)/1024/1024,
	DATE_FORMAT( time_local, "%Y-%m-%d %H:%i" ) 
FROM
	`wcc_web_log_2018-11-11` 
GROUP BY
	DATE_FORMAT( time_local, "%Y-%m-%d %H:%i" )
ORDER BY count(*) DESC;

        (3)按秒統計訪問量

SELECT
	count( * ),
	DATE_FORMAT( time_local, "%Y-%m-%d %H:%i:%S" ) 
FROM
	`wcc_web_log_2018-11-11` 
GROUP BY
	DATE_FORMAT( time_local, "%Y-%m-%d %H:%i:%S" )
ORDER BY count(*) DESC;

        (4)按url統計訪問量(此功能最好是在MR處理的時候,將url中「?"開始後的部分截除)

SELECT  count(*),url from `wcc_web_log_2018-11-11`
GROUP BY url
ORDER BY count(*) DESC;

        (5)總流量分析(單位:GB)

SELECT sum(traffic)/1024/1024/1024 from `wcc_web_log_2018-11-11`

        (6)按照請求code狀態統計,總流量分析(單位:MB)

SELECT count(*),sum(traffic)/1024/1024,status from `wcc_web_log_2018-11-11`  GROUP BY  status

        (7)統計總的訪問量

select count(*) from `wcc_web_log_2018-11-11`

        7、最後總結

               在分析統計時,若是小夥伴們須要出報表,其實只須要查詢該表就能夠完成;還能夠利用該表的查詢作郵件報表。固然,建議你們可使用hive來分析。博主這裏是因爲分析須要,因此直接用mysql來分析了。

               最後寄語,以上是博主本次文章的所有內容,若是你們以爲博主的文章還不錯,請點贊;若是您對博主其它服務器大數據技術或者博主本人感興趣,請關注博主博客,而且歡迎隨時跟博主溝通交流。

相關文章
相關標籤/搜索