【python】用python爬取中科院院士簡介信息

時間 2020-05-18

原文原文鏈接

018/07/09 23:43
項目名稱：爬取中科院871個院士的簡介信息html

1.爬取目的：中科院871個院士的簡介信息正則表達式

2.爬取最終結果：編碼

3.具體代碼以下：url

 1 import re # 不用安裝（注意！！）
 2 import os # 文件夾等的操做（注意！！）
 3 import time  4 import requests # http urllib2
 5 
 6 url = 'http://www.cae.cn/cae/html/main/col48/column_48_1.html'
 7 html = requests.get(url)  8 # print(html.status_code) # 狀態碼200 404 500 502
 9 html.encoding = 'utf-8'
10 # print(html.text) # 以文本形式返回網頁
11 
12 # 提取數據
13 # + 一次或屢次 大於等於一次
14 # findall返回的是列表（注意！！）
15 number = re.findall( 16 '<a href="/cae/html/main/colys/(\d+).html" target="_blank">', html.text) 17 
18 i = 1 # 這裏的i變量是由我創造進行明確區分所抓取的院士的數量的；
19 for m in number[:871]: 20 # for m in number[:4]: # 這裏控制要爬取的個數
21 # for m in number[28:88]:
22 nextUrl = 'http://www.cae.cn/cae/html/main/colys/{}.html'.format(m) 23 # 再次請求數據
24 nexthtml = requests.get(nextUrl) 25 nexthtml.encoding = 'utf-8'
26 # 注意正則表達式：
27 # () 提取數據
28 # . 匹配除了換行\n的任意單個字符
29 # * 匹配前面的表達式任意次 {1,5}
30 # ? 若是前面有限定符 非貪婪模式，注意！！！
31 # 儘可能可能少的匹配所搜索的字符串
32 text = re.findall('<div class="intro">(.*?)</div>', nexthtml.text, re.S) # re.S匹配換行的 
33 text2 = re.sub(r'<p>|&ensp;|&nbsp;|</p>', '', text[0]).strip() # .strip()清楚空格
34 
35 # 保存數據
36 with open(r'E:\02中科院院士信息爬取結果.txt', mode='a+', encoding="utf-8") as f: # 特別注意這裏的要以編碼utf-8方式打開
37 f.write('{}. '.format(i) + text2 + '\n') 38 i += 1
39 
40 # 不要下載太快
41 # 限制下載的速度
42 time.sleep(1) 43 # 程序運行到這個地方 暫停1s

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。