Python 爬取高校歷年分數線

時間 2019-12-07

標籤 python 高校歷年分數欄目 Python 简体版

原文原文鏈接

最近一週一直在幫家裏小弟看高考志願，因此更新的沒那麼頻繁了，請你們見諒。html

在看各高校的往年分數時，忍不住手癢，想着能不能給它爬下來？哈哈，說幹就幹！python

1 流程分析

以前無心中在這個網站發現有各個高校的歷年錄取分數線：https://gkcx.eol.cn。json

咱們的目標是用 Python 將下面頁面的數據導出到 Excel：api

這個頁面的 URL 是：https://gkcx.eol.cn/schoolhtm...，顯然是須要一個 school_id 拼接而成的，那麼如何獲取這個 school_id 呢？微信

除非想辦法爬取到全部院校的 school_id，這裏我想着是從上面圖中的搜索框進入：學習

這樣，總體的業務流程咱們就理清楚了：網站

先調用搜索的 URL 獲取到高校的 school_id，拼接到高校的詳情訪問地址
訪問詳情地址，抓取目標數據
處理目標數據，存儲到 Excel 中

2 獲取 school_id

按下 F12，能夠看出搜索調用的 URL 是：https://gkcx.eol.cn/soudaxue/queryschool.html?&keyWord1=南京郵電大學，可是咱們發現該請求的 response 裏並無高校列表，因此猜想這裏是有二次數據請求獲取到高校的列表，而後解析顯示到頁面的。spa

順着請求流，咱們看到了這麼一個請求：excel

而且它的 response 恰好是一個包含高校信息的 json，到這裏應該仍是順利的，咱們只要從這個 json 裏解析出咱們想要的東西，而後繼續後面的步驟就能夠了。要注意該請求的 Referer。code

可是在解析這個 json 時會遇到一個小問題，返回的數據格式是這樣的：

({
 "totalRecord": {"num": "2"},
 "school":  [
    {
   "schoolid": "160",
   "schoolname": "南京郵電大學",
...
});

它是被 (); 包圍着的，不是一個合法的 json 數據，這裏須要對其進行處理後才能解析 json：

# 返回數據包含 ();，須要特殊處理
    text = ((response.text).split(');',1)[0]).split('(',1)[1]
    j = json.loads(text)

3 分數線獲取

學校的詳情頁面是：https://gkcx.eol.cn/schoolhtm...，一樣的套路，在點擊後 response 裏並無分數線數據，我想也是二次請求吧，果真在請求流裏找到了這個：

這裏的兩個請求恰好將高校的每一年分數線和各專業的分數線以 XML 的格式返回，Very Good！

下面要作的就是 XML 解析啦。

4 XML 解析

這裏咱們使用 xml.etree.ElementTree 來解析 XML：

<areapionts>
    <areapiont>
        <year>2017</year>
        <specialname>軟件工程（嵌入式培養）</specialname>
        <maxfs>369</maxfs>
        <varfs>366</varfs>
        <minfs>364</minfs>
        <pc>一批</pc>
        <stype>理科</stype>
    </areapiont>

因爲數據比較規整，解析也很簡單：

areapionts = ET.fromstring(response.text)
for areapiont in areapionts:
    print(areapiont.find('year').text)
    print(areapiont.find('specialname').text)

5 Excel 寫入

Excel 的寫入須要藉助於 openpyxl 模塊。

openpyxl 簡單使用示例

>>> import openpyxl
>>> wb = openpyxl.Workbook()
# 初始時會生成一個 sheet 頁
>>> wb.sheetnames
['Sheet']
# 建立 sheet 頁
>>> wb.create_sheet(index=0,title='First')
<Worksheet "First">
# 獲取全部 sheet 頁
>>> wb.sheetnames
['First', 'Sheet']
# 刪除 sheet 頁
>>> wb.remove(wb['Sheet'])
>>> wb.sheetnames
['First']
>>> sheet = wb['First']
# 設置單元格
>>> sheet['A1'] = '省份'
>>> sheet['B1'] = '學校'
# 設置指定的單元格
>>> sheet.cell(1,3).value='test'
>>> wb.save('test.xlsx')

XML 解析寫入 Excel

def gen_excel(school,xml,wb):
    sheet = wb.create_sheet(title='各專業歷年錄取分數線')
    sheet.column_dimensions['B'].width = 40
    sheet['A1'] = '年份'
    sheet['B1'] = '專業'
    sheet['C1'] = '最高分'
    sheet['D1'] = '平均分'
    sheet['E1'] = '最低分'
    sheet['F1'] = '批次'
    sheet['G1'] = '錄取批次'

    areapionts = ET.fromstring(xml)
    column = 1 
    for areapiont in areapionts:
        column += 1
        sheet.cell(column,1).value = areapiont.find('year').text
        sheet.cell(column,2).value = areapiont.find('specialname').text
        sheet.cell(column,3).value = areapiont.find('maxfs').text
        sheet.cell(column,4).value = areapiont.find('varfs').text
        sheet.cell(column,5).value = areapiont.find('minfs').text
        sheet.cell(column,6).value = areapiont.find('pc').text
        sheet.cell(column,7).value = areapiont.find('stype').text
    wb.save('{}.xlsx'.format(school['schoolname']))

執行效果

$ python gkcx.py
Please the school name：南京郵電大學
共檢索到 2 個高校：['南京郵電大學', '南京郵電大學通達學院']
數據獲取完成，已下載到腳本目錄

結果看着還能夠，可是仍是有問題的，由於各省的分數線確定是不同的，這裏默認檢索出的是學校所在省的分數線，所以若要獲取在其餘省的分數線，還須要進一步處理，有興趣的同窗不妨動手試一下。後臺回覆「高考」能夠獲取源碼。

若是以爲有用，歡迎關注個人微信，一塊兒學習，共同進步，不按期推出贈書活動~

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。