Python 網頁抓取與數據可視化練習：‘金三銀四’ 是真的嗎？

時間 2020-03-21

標籤 python 網頁抓取數據可視化練習欄目 Python 简体版

原文原文鏈接

一年之計在於春，2020 的春天由於疫情可能改變了許多人的計劃，如三四月份是企業傳統招聘高峯期之一，再有許多帥小夥過年拜見了丈母孃催促着得買房等，職場與樓市素有 ‘金三銀四’ 的說法，然而，這是真的嗎？html

最近又學習了一下 Python（爲何是又？由於學了就忘..），想到何不簡單驗證一下，畢竟數據不會撒謊。python

主要流程：git

選取樓市狀況做爲分析對象，與目前公司業務有點相關性。

從武漢市住房保障和房屋管理局網站獲取公開的新建商品房成交統計數據。

讀取數據並可視化，結合圖表簡要分析得出初步結論。

先貼最終生成的可視化數據圖：github

一、獲取數據

先使用 ‘爲人類設計的 HTTP 庫’ - requests 從房管局網站上獲取包含公開成交統計數據的 HTML 頁面，數據分爲按日統計發佈的及按月統計發佈的。而後使用 HTML 與 XML 處理庫 lxml 解析 HTML 頁面內容，分析後經過合適的 xpath 提取所需數據。mongodb

最開始個人想法是讀取每日數據再分別計算出每月的數據，爬完後發現目錄頁下面緊挨着的就是按月統計數據（笑哭.jpg ，可是按月的數據只發布到了2019年11月，連整兩年都湊不足可不行，因而結合按日統計數據（發佈到了2020年01月23日）計算出了2019年12月的數據，果真人生沒有白走的路：）數據庫

import requests
import lxml.html
import html
import time

import db_operator

def get_all_monthly_datas():
    """按月獲取全部成交數據"""
    # 索引頁（商品住房銷售月度成交統計）
    index_url = 'http://fgj.wuhan.gov.cn/spzfxysydjcjb/index.jhtml'
    max_page = get_max_page(index_url)
    if max_page > 0:
        print('共 ' + str(max_page) + ' 頁，' + '開始獲取月度數據..\n')
        for index in range(1, max_page + 1):
            if index >= 2:
                index_url = 'http://fgj.wuhan.gov.cn/spzfxysydjcjb/index_' + str(index) + '.jhtml'
            detail_urls = get_detail_urls(index, index_url)
            for detail_url in detail_urls:
                print('正在獲取月度統計詳情：' + detail_url)
                monthly_detail_html_str = request_html(detail_url)
                if monthly_detail_html_str:
                    find_and_insert_monthly_datas(monthly_detail_html_str)
    else:
        print('總頁數爲0。')


def request_html(target_url):
    """請求指定 url 頁面"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Mobile Safari/537.36',
    }
    html_response = requests.get(target_url, headers=headers)
    html_bytes = html_response.content
    html_str = html_bytes.decode()
    return html_str


def get_max_page(index_url) -> int:
    """從索引頁中獲取總頁數"""
    print('獲取總頁數中..')
    index_html_str = request_html(index_url)
    selector = lxml.html.fromstring(index_html_str)
    max_page_xpath = '//div[@class="whj_padding whj_color pages"]/text()'
    result = selector.xpath(max_page_xpath)
    if result and len(result) > 0:
        result = result[0]
        index_str = result.replace('\r', '').replace('\n', '').replace('\t', '')
        max_page = index_str.split('\xa0')[0]
        max_page = max_page.split('/')[1]
        return int(max_page)
    return 0


def get_detail_urls(index, index_url):
    """獲取統計數據詳情頁 url 列表"""
    print('正在獲取統計列表頁面數據:' + index_url + '\n')
    index_html_str = request_html(index_url)
    selector = lxml.html.fromstring(index_html_str)
    # 提取 url 列表。
    # 疑問：這裏使用 '//div[@class="fr hers"]/ul/li/a/@href' 指望應該能提取到更準確的數據，可是結果爲空
    detail_urls_xpath = '//div/ul/li/a/@href'
    detail_urls = selector.xpath(detail_urls_xpath)
    return detail_urls

複製代碼

二、保存數據

獲取到數據後須要保存下來，以便後續的數據處理與增量更新等。這裏使用與 Python 相親相愛的文檔型數據庫 MongoDB 存儲數據。ide

踩坑：對於 macOS 系統網上許多 MongoDB 安裝說明已經失效，須要參考 mongodb/homebrew-brew 引導安裝。工具

啓動服務後就能夠寫入數據：學習

from pymongo import MongoClient
from pymongo import collection
from pymongo import database

client: MongoClient = MongoClient()
db_name: str = 'housing_deal_data'
col_daily_name: str = 'wuhan_daily'
col_monthly_name: str = 'wuhan_monthly'
database: database.Database = client[db_name]
col_daily: collection = database[col_daily_name]
col_monthly: collection = database[col_monthly_name]


def insert_monthly_data(year_month, monthly_commercial_house):
    """寫入月度統計數據"""
    query = {'year_month': year_month}
    existed_row = col_monthly.find_one(query)
    try:
        monthly_commercial_house_value = int(monthly_commercial_house)
    except:
        if existed_row:
            print('月度數據已存在 =>')
            col_monthly.delete_one(query)
            print('已刪除：月度成交數不符合指望。\n')
        else:
            print('忽略：月度成交數不符合指望。\n')
    else:
        print(str({year_month: monthly_commercial_house_value}))
        item = {'year_month': year_month,
                'commercial_house': monthly_commercial_house_value,}
        if existed_row:
            print('月度數據已存在 =>')
            new_values = {'$set': item}
            result = col_monthly.update_one(query, new_values)
            print('更新數據成功：' + str(item) + '\n' + 'result：' + str(result) + '\n')
        else:
            result = col_monthly.insert_one(item)
            print('寫入數據成功：' + str(item) + '\n' + 'result：' + str(result) + '\n

複製代碼

因爲在實踐過程當中提取數據限制不夠嚴格致使前期寫入了一些髒數據，因此這裏除了正常的 insert、update 以外，還有一個 try-except 用來清理髒數據。字體

三、讀取數據

獲取並保存數據執行完成後，使用 MongoDB GUI 工具 Robo 3T 查看，整體確認數據完整基本符合指望。

接下來從數據庫讀取數據：

def read_all_monthly_datas():
    """從數據庫讀取全部月度統計數據"""
    return {"2018年": read_monthly_datas('2018'),
            "2019年": read_monthly_datas('2019'),}


def read_monthly_datas(year: str) -> list:
    """從數據庫讀取指定年份的月度統計數據"""
    query = {'year_month': {'$regex': '^' + year}}
    result = col_monthly.find(query).limit(12).sort('year_month')

    monthly_datas = {}
    for data in result:
        year_month = data['year_month']
        commercial_house = data['commercial_house']
        if commercial_house > 0:
            month_key = year_month.split('-')[1]
            monthly_datas[month_key] = data['commercial_house']

    # 若是讀取結果小於 12，即有月度數據缺失，則嘗試讀取每日數據並計算出該月統計數據
    if len(monthly_datas) < 12:
        for month in range(1, 13):
            month_key = "{:0>2d}".format(month)
            if month_key not in monthly_datas.keys():
                print('{}年{}月 數據缺失..'.format(year, month_key))
                commercial_house = get_month_data_from_daily_datas(year, month_key)
                if commercial_house > 0:
                    monthly_datas[month_key] = commercial_house
    return monthly_datas


def get_month_data_from_daily_datas(year: str, month: str) -> int:
    """從每日數據中計算月度統計數據"""
    print('從每日數據中獲取 {}年{}月 數據中..'.format(year, month))
    query = {'year_month_day': {'$regex': '^({}-{})'.format(year, month)}}
    result = col_daily.find(query).limit(31)
    sum = 0
    for daily_data in result:
        daily_num = daily_data['commercial_house']
        sum += daily_num
    print('{}年{}月數據：{}'.format(year, month, sum))
    return sum

複製代碼

能夠看到讀取月度數據方法中有校驗數據是否完整以及數據缺失則從每日數據中讀取計算相關的邏輯。

四、數據可視化

因爲只是練習簡單查看數據整體趨勢，因此沒有想要繪製稍複雜的圖表，使用圖表庫 matplotlib 繪製簡單統計圖：

import matplotlib.pyplot as plt
import html_spider
import db_operator

def generate_plot(all_monthly_datas):
    """生成統計圖表"""
    # 處理漢字未正常顯示問題
    # 還須要手動下載 SimHei.ttf 字體並放到 /venv/lib/python3.7/site-packages/matplotlib/mpl-data/fonts 目錄下）
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['font.family'] = 'sans-serif'

    # 生成統計圖表
    fig, ax = plt.subplots()
    plt.title(u"商品住宅成交統計數據（武漢）", fontsize=20)
    plt.ylabel(u"成交量", fontsize=14)
    plt.xlabel(u"月份", fontsize=14)
    for year, monthly_datas in all_monthly_datas.items():
        ax.plot(list(monthly_datas.keys()), list(monthly_datas.values()), label=year)
    ax.legend()
    plt.show()


# 爬取網頁數據（並寫入數據庫）
# html_spider.get_all_daily_datas()
html_spider.get_all_monthly_datas()
# 讀取數據，生成統計圖表
generate_plot(db_operator.read_all_monthly_datas())
複製代碼

執行完畢繪製生成的就是開始貼出的數據圖。

五、簡要分析

結合圖表中過去兩年的數據曲線能夠直觀的看出，近兩年每一年都是上半年上漲，隨着丈母孃壓力逐步下降到年中該買的買了，沒買的就是不着急的了，數據會回落而後隨着下半年又一撥準備見丈母孃的補充又開始上升。具體來看，2 月份整年最低（猜想是由於過年放寒假），以後穩步上升至 8 月份左右在 9 月份會回落後再次上漲（除了 2018年7月份也有個明顯回落，得查一下是否是當時有政策調控貸款等方面的調整影響）。

針對看三、4月份，都屬於上升區，但整年的高峯其實分別出如今年底與年中。因而可知若是從回暖角度看 ‘金山銀四’ 的說法有必定依據，但若是從高峯期角度看則不盡然。

最終沒有得出一個比較確定的真或假的結論，可能不少事的確是沒有明確答案的：）