爬蟲實踐--CBA歷年比賽數據

時間 2019-11-18

標籤爬蟲實踐 cba 歷年比賽數據欄目網絡爬蟲简体版

原文原文鏈接

閒來無聊，恰好有個朋友來問爬蟲的事情，提及來了CBA這兩年的比賽數據，作個分析，再來個大數據啥的。來了興趣，果真搞起來，下面分享一下爬蟲的思路。java

一、選取數據源

這裏我並不懂CBA，數據源選的是國內某門戶網站的CBA專欄，下面會放連接地址，有興趣的能夠去看看。編程

二、分析數據

通過查看頁面元素，發現頁面是後臺渲染，沒辦法經過接口直接獲取數據。下面就要分析頁面元素，看到全部的數據都是存在表格裏面的，這下就簡單了不少。微信

三、肯定思路

思路比較簡單，經過正則把全部行數據都提取出來，過濾掉無用的修飾信息，獲得的就是想要的數據。此處我把每行的列符合替換成了「,」方便用csv記錄數據。框架

通過過濾以後的數據以下：工具

球隊,第一節,第二節,第三節,第四節,總比分
廣州,33,37,36,27,133
北控,23,18,17,34,92
2019-01-1619:35:00輪次：31場序309開始比賽　　比賽已結束
首發,球員,出場時間,兩分球,三分球,罰球,進攻,籃板,助攻,失誤,搶斷,犯規,蓋帽,得分
,張永鵬,25.8,7-9,0-0,1-1,4,8,3,0,0,1,0,15
,鞠明欣,19.1,2-4,1-2,0-0,2,5,2,2,0,1,0,7
,西熱力江,25.5,1-1,4-8,0-0,1,2,4,1,3,1,0,14
,郭凱,15.5,2-2,0-0,0-0,2,3,0,2,0,2,0,4
,凱爾·弗格,38.1,5-9,5-9,11-11,0,10,12,2,2,4,0,36
,姚天一,12.3,0-1,1-4,0-0,0,1,5,0,0,0,0,3
,科裏·傑弗森,24.0,4-4,2-4,3-4,0,6,0,1,0,1,1,17
,陳盈駿,22.6,1-1,2-7,1-1,0,2,4,2,1,2,0,9
,司坤,19.0,2-2,0-2,0-0,0,5,1,0,1,4,0,4
,孫鳴陽,20.6,2-3,0-0,3-3,1,4,1,2,3,4,0,7
,谷玥灼,7.4,1-1,1-2,0-0,0,0,2,0,0,0,0,5
,鄭準,10.1,3-4,2-3,0-0,0,2,0,0,0,1,0,12
,總計,240.0,30-41(73.2%),18-41(43.9%),19-20(95.0%),10,48,34,12,10,21,1,133
首發,球員,出場時間,兩分球,三分球,罰球,進攻,籃板,助攻,失誤,搶斷,犯規,蓋帽,得分
,於梁,20.8,1-3,0-1,0-0,0,0,2,0,1,5,0,2
,於澍龍,17.9,0-1,1-3,0-0,0,2,1,2,0,1,0,3
,許夢君,46.2,1-3,5-12,0-0,1,6,2,1,0,3,0,17
,托馬斯·羅賓遜,43.4,9-20,0-2,9-14,3,11,5,2,1,3,1,27
,楊敬敏,16.0,3-4,0-3,0-0,0,0,0,2,0,1,0,6
,孫賀男,2.8,0-0,0-0,0-0,0,0,0,1,0,1,0,0
,劉大鵬,28.0,1-1,3-5,0-0,1,4,3,2,2,3,0,11
,張銘浩,8.5,0-0,0-0,1-2,0,0,0,0,1,1,0,1
,張帆,27.5,5-7,1-3,0-0,0,1,6,4,1,2,0,13
,王徵,23.3,3-3,0-0,6-8,0,2,0,0,1,1,1,12
,常亞鬆,5.6,0-0,0-1,0-0,0,1,0,1,2,0,0,0
,總計,240.0,23-42(54.8%),10-30(33.3%),16-24(66.7%),5,27,19,15,9,21,2,92

下面分享本身代碼：性能

package com.fun

import com.fun.frame.Save
import com.fun.frame.httpclient.FanLibrary
import com.fun.utils.Regex
import com.fun.utils.WriteRead

class sd extends FanLibrary {

    public static void main(String[] args) {
        int i = 1
        def total = []
        range(300, 381).forEach {x ->
            total.addAll test(x)
        }
        Save.saveStringList(total, "total4.csv")
        testOver()
    }


    static def test(int i) {
        if (new File(LONG_Path + "${i}.csv").exists()) return WriteRead.readTxtFileByLine(LONG_Path + "${i}.csv")
        String url = "http://cbadata.sports.sohu.com/game/content/2017/${i}"

        def get = getHttpGet(url)

        def response = getHttpResponse(get)


        def string = response.getString("content").replaceAll("\\s", EMPTY)
//        output(string)
        def all = Regex.regexAll(string, "<tr.*?<\\/tr>")
        def list = []
        all.forEach {x ->
            def info = x.replaceAll("</*?tr.*?>", EMPTY).replaceAll("</t(d|h)>", ",")
            info = info.replaceAll("<.*?>", EMPTY)

            info = info.charAt(info.length() - 1) == ',' ? info.substring(0, info.length() - 1) : info
            if (info.startsWith("總計")) info = "," + info
            list << info
            output(info)

        }
        Save.saveStringList(list, "${i}.csv")
        return list
    }

}

有興趣的，能夠後臺回覆「大爺來玩啊」獲取本人微信號，我們私聊。測試