【整理】用Python+beautifulsoup抓取股票數據

【剛開始寫總結,讀者若是對個人內容有任何建議歡迎留言反饋,或直接加QQ1172617666,期待交流】php

先貼上代碼,再詳細的寫一下在寫這些代碼的過程當中遇到的問題,解決的方法。html

這些代碼完成的任務是:訪問 http://vip.stock.finance.sina.com.cn/corp/go.php/vMS_MarketHistory/stockid/600000.phtml   把該股票代碼的全部極度的歷史信息抓取下來,保存成.json格式(能夠用記事本打開)文件。我是存放在了C:\Users\ZSH\Desktop\Python\DATA下面,你能夠把這個路徑替換爲你的相關路徑。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#coding:utf-8
'''
Created on 2014年3月20日
       
@author: ZSH
'''
import urllib.request
import json
from bs4 import BeautifulSoup
       
       
def get_year_range(code):
     content = urllib.request.urlopen(url).read()
     soup = BeautifulSoup(content)
     str1 = soup.findAll( 'select' , attrs = { 'name' : 'year' })   
     optionSoup = str1[ 0 ]
     optionTags = optionSoup.findAll( 'option' )
     yearlist = []
     for i in range ( 0 , len (optionTags)):
         yearlist.append(optionTags[i].string)
     return (yearlist)
       
def get_data(code):
     yearlist = get_year_range(code)
     for year in range ( 0 , len (yearlist)):
         for season in range ( 1 , 5 ):
             try :
                 jidu = str (season)
                 codestr = str (code)
                 url = 'http://vip.stock.finance.sina.com.cn/corp/go.php/vMS_MarketHistory/stockid/' + codestr + '.phtml?year=' + yearlist[year] + '&jidu=' + jidu
                 rsp = urllib.request.urlopen(url)
                 html = rsp.read()
                 soup = BeautifulSoup(html, from_encoding = 'GB2312' )
                 #tablesoup = soup.getText()
                 tablesoup = soup.find_all( 'table' , attrs = { 'id' : 'FundHoldSharesTable' })
                 d1 = {}
                 rows = tablesoup[ 0 ].findAll( 'tr' )
                 colume = rows[ 1 ].findAll( 'td' )
                 for row in rows[ 2 :]:
                     data = row.findAll( 'td' )
                     d1.setdefault(colume[ 0 ].get_text(),[]).append(data[ 0 ].get_text(strip = True ))
                     d1.setdefault(colume[ 1 ].get_text(),[]).append(data[ 1 ].get_text(strip = True ))
                     d1.setdefault(colume[ 2 ].get_text(),[]).append(data[ 2 ].get_text(strip = True ))
                     d1.setdefault(colume[ 3 ].get_text(),[]).append(data[ 3 ].get_text(strip = True ))
                     d1.setdefault(colume[ 4 ].get_text(),[]).append(data[ 4 ].get_text(strip = True ))
                     d1.setdefault(colume[ 5 ].get_text(),[]).append(data[ 5 ].get_text(strip = True ))
                     d1.setdefault(colume[ 6 ].get_text(),[]).append(data[ 6 ].get_text(strip = True ))
                 encodejson = open (r 'C:\Users\ZSH\Desktop\Python\DATA\ ' + rows[ 0 ].get_text(strip = True ) + yearlist[year] + r '年' + jidu + r '季度.json' , 'w' )
                 encodejson.write(json.dumps(d1,ensure_ascii = False ))
                 print ( '已完成' + rows[ 0 ].get_text(strip = True ) + yearlist[year] + r '年' + jidu + r '季度.json' )
             except :
                 print ( '出現了錯誤' )
                 continue  
     print ( '抓取完成!' )
       
get_data( 600000 )



1,windows下,Python環境的搭建,個人環境是myeclipse+pydev,參考的教程帖子是Python環境搭建 我的以爲myeclipse是個很是強大的編譯器,上手較容易。關於Python函數,for 語句等等基本基本語法,我推薦兩個文檔,一是「Python簡明教程」(中文),內容通俗易懂。另外一個就是位於C:\Python34\Doc的說明文檔。python

2,這個腳本用到的第三方模塊——beautifulsoup4,也就是from bs4 import BeautifulSoup 這一句代碼牽扯到的,這個模塊用於從html代碼中分析出表格區域,進一步解析出數據。關於beautifulsoup的安裝我參考的是Windows平臺安裝Beautiful Soup 。json

3,關於用urllib.request模塊實現整個功能的部分,我從這位大哥的博客裏學到了好多,他的博客真是超級詳細易懂,體貼初學者。博客地址windows

4,Python字符串「格式化」——也即替換句子中的某一個字符串。Python中與字符串相關的各類操做Python基礎教程筆記——使用字符串 中講的很詳細。app

5,Python2到Python3的轉換,因爲字符編碼的問題(中文print出來是ascii碼),有人建議換到Python3,由於Python3默認是utf-8,Python3.x和Python2.x的區別  這個連接講了Python2和Python3的區別。eclipse



相關文章
相關標籤/搜索