【剛開始寫總結,讀者若是對個人內容有任何建議歡迎留言反饋,或直接加QQ1172617666,期待交流】php
先貼上代碼,再詳細的寫一下在寫這些代碼的過程當中遇到的問題,解決的方法。html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
#coding:utf-8
'''
Created on 2014年3月20日
@author: ZSH
'''
import
urllib.request
import
json
from
bs4
import
BeautifulSoup
def
get_year_range(code):
content
=
urllib.request.urlopen(url).read()
soup
=
BeautifulSoup(content)
str1
=
soup.findAll(
'select'
, attrs
=
{
'name'
:
'year'
})
optionSoup
=
str1[
0
]
optionTags
=
optionSoup.findAll(
'option'
)
yearlist
=
[]
for
i
in
range
(
0
,
len
(optionTags)):
yearlist.append(optionTags[i].string)
return
(yearlist)
def
get_data(code):
yearlist
=
get_year_range(code)
for
year
in
range
(
0
,
len
(yearlist)):
for
season
in
range
(
1
,
5
):
try
:
jidu
=
str
(season)
codestr
=
str
(code)
url
=
'http://vip.stock.finance.sina.com.cn/corp/go.php/vMS_MarketHistory/stockid/'
+
codestr
+
'.phtml?year='
+
yearlist[year]
+
'&jidu='
+
jidu
rsp
=
urllib.request.urlopen(url)
html
=
rsp.read()
soup
=
BeautifulSoup(html, from_encoding
=
'GB2312'
)
#tablesoup = soup.getText()
tablesoup
=
soup.find_all(
'table'
, attrs
=
{
'id'
:
'FundHoldSharesTable'
})
d1
=
{}
rows
=
tablesoup[
0
].findAll(
'tr'
)
colume
=
rows[
1
].findAll(
'td'
)
for
row
in
rows[
2
:]:
data
=
row.findAll(
'td'
)
d1.setdefault(colume[
0
].get_text(),[]).append(data[
0
].get_text(strip
=
True
))
d1.setdefault(colume[
1
].get_text(),[]).append(data[
1
].get_text(strip
=
True
))
d1.setdefault(colume[
2
].get_text(),[]).append(data[
2
].get_text(strip
=
True
))
d1.setdefault(colume[
3
].get_text(),[]).append(data[
3
].get_text(strip
=
True
))
d1.setdefault(colume[
4
].get_text(),[]).append(data[
4
].get_text(strip
=
True
))
d1.setdefault(colume[
5
].get_text(),[]).append(data[
5
].get_text(strip
=
True
))
d1.setdefault(colume[
6
].get_text(),[]).append(data[
6
].get_text(strip
=
True
))
encodejson
=
open
(r
'C:\Users\ZSH\Desktop\Python\DATA\ '
+
rows[
0
].get_text(strip
=
True
)
+
yearlist[year]
+
r
'年'
+
jidu
+
r
'季度.json'
,
'w'
)
encodejson.write(json.dumps(d1,ensure_ascii
=
False
))
print
(
'已完成'
+
rows[
0
].get_text(strip
=
True
)
+
yearlist[year]
+
r
'年'
+
jidu
+
r
'季度.json'
)
except
:
print
(
'出現了錯誤'
)
continue
print
(
'抓取完成!'
)
get_data(
600000
)
|
1,windows下,Python環境的搭建,個人環境是myeclipse+pydev,參考的教程帖子是Python環境搭建 我的以爲myeclipse是個很是強大的編譯器,上手較容易。關於Python函數,for 語句等等基本基本語法,我推薦兩個文檔,一是「Python簡明教程」(中文),內容通俗易懂。另外一個就是位於C:\Python34\Doc的說明文檔。python
2,這個腳本用到的第三方模塊——beautifulsoup4,也就是from bs4 import BeautifulSoup 這一句代碼牽扯到的,這個模塊用於從html代碼中分析出表格區域,進一步解析出數據。關於beautifulsoup的安裝我參考的是Windows平臺安裝Beautiful Soup 。json
3,關於用urllib.request模塊實現整個功能的部分,我從這位大哥的博客裏學到了好多,他的博客真是超級詳細易懂,體貼初學者。博客地址windows
4,Python字符串「格式化」——也即替換句子中的某一個字符串。Python中與字符串相關的各類操做Python基礎教程筆記——使用字符串 中講的很詳細。app
5,Python2到Python3的轉換,因爲字符編碼的問題(中文print出來是ascii碼),有人建議換到Python3,由於Python3默認是utf-8,Python3.x和Python2.x的區別 這個連接講了Python2和Python3的區別。eclipse