python爬蟲實戰二——股票數據定向爬蟲

時間 2019-11-09

原文原文鏈接

功能簡介

目標： 獲取上交所和深交所全部股票的名稱和交易信息。
輸出： 保存到文件中。
技術路線： requests---bs4--re
語言：python3.5html

說明

網站選擇原則： 股票信息靜態存在於html頁面中，非js代碼生成，沒有Robbts協議限制。
選取方法： 打開網頁，查看源代碼，搜索網頁的股票價格數據是否存在於源代碼中。
如打開新浪股票網址：連接描述，以下圖所示：python

上圖中左邊爲網頁的界面，顯示了天山股份的股票價格是13.06。右邊爲該網頁的源代碼，在源代碼中查詢13.06發現沒有找到。因此判斷該網頁的數據使用js生成的，不適合本項目。所以換一個網頁。正則表達式

再打開百度股票的網址：連接描述，以下圖所示：
app

從上圖中能夠發現百度股票的數據是html代碼生成的，符合咱們本項目的要求，因此在本項目中選擇百度股票的網址。函數

因爲百度股票只有單個股票的信息，因此還須要當前股票市場中全部股票的列表，在這裏咱們選擇東方財富網，網址爲：連接描述，界面以下圖所示：網站

原理分析

查看百度股票每隻股票的網址：https://gupiao.baidu.com/stock/sz300023.html，能夠發現網址中有一個編號300023正好是這隻股票的編號，sz表示的深圳交易所。所以咱們構造的程序結構以下：url

步驟1： 從東方財富網獲取股票列表；
步驟2： 逐一獲取股票代碼，並增長到百度股票的連接中，最後對這些連接進行逐個的訪問得到股票的信息；
步驟3： 將結果存儲到文件。

接着查看百度個股信息網頁的源代碼，發現每隻股票的信息在html代碼中的存儲方式以下：spa

所以，在咱們存儲每隻股票的信息時，能夠參考上圖中html代碼的存儲方式。每個信息源對應一個信息值，即採用鍵值對的方式進行存儲。在python中鍵值對的方式能夠用字典類型。所以，在本項目中，使用字典來存儲每隻股票的信息，而後再用字典把全部股票的信息記錄起來，最後將字典中的數據輸出到文件中。code

代碼編寫

首先是得到html網頁數據的程序，在這裏很少作介紹了，代碼以下：orm

#得到html文本
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

接下來是html代碼解析程序，在這裏首先須要解析的是東方財富網頁面：連接描述，咱們打開其源代碼，以下圖所示：

由上圖能夠看到，a標籤的href屬性中的網址連接裏面有每隻股票的對應的號碼，所以咱們只要把網址裏面對應股票的號碼解析出來便可。解析步驟以下：
第一步，得到一個頁面：

html = getHTMLText(stockURL)

第二步，解析頁面，找到全部的a標籤：

soup = BeautifulSoup(html, 'html.parser') 
a = soup.find_all('a')

第三步，對a標籤中的每個進行遍從來進行相關的處理。處理過程以下：
1.找到a標籤中的href屬性，而且判斷屬性中間的連接，把連接後面的數字取出來，在這裏可使用正則表達式來進行匹配。因爲深圳交易所的代碼以sz開頭，上海交易所的代碼以sh開頭，股票的數字有6位構成，因此正則表達式能夠寫爲[s][hz]\d{6}。也就是說構造一個正則表達式，在連接中去尋找知足這個正則表達式的字符串，並把它提取出來。代碼以下：

for i in a:
    href = i.attrs['href']
    lst.append(re.findall(r"[s][hz]\d{6}", href)[0])

2.因爲在html中有不少的a標籤，可是有些a標籤中沒有href屬性，所以上述程序在運行的時候出現異常，全部對上述的程序還要進行try...except來對程序進行異常處理，代碼以下：

for i in a:
    try:
        href = i.attrs['href']
        lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
    except:
        continue

從上面代碼能夠看出，對於出現異常的狀況咱們使用了continue語句，直接讓其跳過，繼續執行下面的語句。經過上面的程序咱們就能夠把東方財富網上股票的代碼信息所有保存下來了。
將上述的代碼封裝成一個函數，對東方財富網頁面解析的完整代碼以下所示：

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

接下來是得到百度股票網連接描述單隻股票的信息。咱們先查看該頁面的源代碼，以下圖所示：

股票的信息就存在上圖所示的html代碼中，所以咱們須要對這段html代碼進行解析。過程以下：
1.百度股票網的網址爲：https://gupiao.baidu.com/stock/
一隻股票信息的網址爲：https://gupiao.baidu.com/stock/sz300023.html
因此只要百度股票網的網址+每隻股票的代碼便可，而每隻股票的代碼咱們已經有前面的程序getStockList從東方財富網解析出來了，所以對getStockList函數返回的列表進行遍歷便可，代碼以下：

for stock in lst:
        url = stockURL + stock + ".html"

2.得到網址後，就要訪問網頁得到網頁的html代碼了，程序以下：

html = getHTMLText(url)

3.得到了html代碼後就須要對html代碼進行解析，由上圖咱們能夠看到單個股票的信息存放在標籤爲div,屬性爲stock-bets的html代碼中，所以對其進行解析：

soup = BeautifulSoup(html, 'html.parser')
stockInfo = soup.find('div',attrs={'class':'stock-bets'})

4.咱們又發現股票名稱在bets-name標籤內，繼續解析，存入字典中：

infoDict = {}
name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
infoDict.update({'股票名稱': name.text.split()[0]})

split()的意思是股票名稱空格後面的部分不須要了。

5.咱們從html代碼中還能夠觀察到股票的其餘信息存放在dt和dd標籤中，其中dt表示股票信息的鍵域，dd標籤是值域。獲取所有的鍵和值：

keyList = stockInfo.find_all('dt')
valueList = stockInfo.find_all('dd')

並把得到的鍵和值按鍵值對的方式村放入字典中：

for i in range(len(keyList)):
    key = keyList[i].text
    val = valueList[i].text
    infoDict[key] = val

6.最後把字典中的數據存入外部文件中：

with open(fpath, 'a', encoding='utf-8') as f:
f.write( str(infoDict) + '\n' )

將上述過程封裝成完成的函數，代碼以下：

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            continue

其中try...except用於異常處理。

接下來編寫主函數，調用上述函數便可：

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

項目完整程序

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
            continue
 
def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

上述代碼中的print語句用於打印爬取的進度。執行完上述代碼後在D盤會出現BaiduStockInfo.txt文件，裏面存放了股票的信息。