Python網絡爬蟲與文本數據分析(視頻課)
python
常見的爬蟲都是採集文本數據,若是待採集的是不少個文件,如何批量下載?
web
今天咱們以巨潮資訊網 http://www.cninfo.com.cn 爲例子json
在實戰前先總結一下爬蟲的訪問方法微信
90%的爬蟲大都是requests.getcookie
剩下的10%是requests.post網絡
訪問方法的肯定,查看開發者工具Network面板裏對應url裏的Request Methodapp
本教程中的訪問方法是POST,因此用到requests.post函數。less
requests.post詳解
針對requests.post,須要用到一個params參數
編輯器
即requests.post(url, params='字典數據類型')ide
url是post網址對象
params是爲了構造完整的url 形如
import requestsurl = 'http://www.cninfo.com.cn/new/disclosure'data = {'column': 'szse_latest', 'pageNum': 4, 'pageSize': 20, 'sortName': '', 'sortType':'', 'clusterFlag': 'true'}resp = requests.post(url, params=data)resp.url
視頻教程
視頻我已經上傳到B站【python網絡爬蟲快速入門】中,
視頻連接 https://www.bilibili.com/video/av72010301?p=10
也可點擊文末 「閱讀原文」跳轉爬蟲視頻連接
代碼
import requestsimport csv#下載pdf公告的函數def downloadpdf(url, file): resp = requests.get(url) f = open(file, 'wb') f.write(resp.content) f.close()#新建csv文件,存儲公告詳細信息csvf = open('data/巨潮資訊/深圳證券市場公告.csv', 'a+', encoding='gbk', newline='')writer = csv.writer(csvf)writer.writerow(('公司名', '股票代碼', '發佈時間', '公告標題', '公告pdf下載地址', '公告類型'))headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}cookies = {'Cookie': 'noticeTabClicks=%7B%22szse%22%3A1%2C%22sse%22%3A0%2C%22hot%22%3A0%2C%22myNotice%22%3A0%7D; tradeTabClicks=%7B%22financing%20%22%3A0%2C%22restricted%20%22%3A0%2C%22blocktrade%22%3A0%2C%22myMarket%22%3A0%2C%22financing%22%3Anull%7D; JSESSIONID=183467B85157E00A626B77D1E16CC580; insert_cookie=45380249; UC-JSESSIONID=A75421EA72188528B984B4166A86CAAA; _sp_ses.2141=*; _sp_id.2141=9063055c-7fc7-4b0c-a0ed-089886082fbd.1579084693.2.1579318110.1579084713.3f22b0aa-580d-4f25-9127-9734dd5647dc'}#post對應的網址url = 'http://www.cninfo.com.cn/new/disclosure'for page in range(39): try: #post請求構造參數 data = {'column': 'szse_latest', 'pageNum': page, 'pageSize': 20, 'sortName': '', 'sortType':'', 'clusterFlag': 'true'} #發起請求,採集 resp = requests.post(url, params=data, headers=headers, cookies=cookies) pdfss = resp.json()['classifiedAnnouncements'] print(page) for pdfs in pdfss: for pdf in pdfs: secName = pdf['secName'] secCode = 'SZ'+str(pdf['secCode']) announcementTitle = pdf['announcementTitle'] adjunctUrl = 'http://static.cninfo.com.cn/'+pdf['adjunctUrl'] pdffile = 'data/巨潮資訊/pdf/'+announcementTitle+'.pdf' downloadpdf(url=adjunctUrl, file=pdffile) announcementTypeName = pdf['announcementTypeName'] announcementTime = pdf['announcementTime'] #print(secName, secCode, announcementTime, announcementTitle, adjunctUrl, announcementTypeName) writer.writerow((secName, secCode, announcementTime, announcementTitle, adjunctUrl, announcementTypeName)) except: print('出問題的網址', resp.url)csvf.close()
jupyter notebook代碼獲取方式,公衆號後臺回覆關鍵詞「20200120」
本文分享自微信公衆號 - 大鄧和他的Python(DaDengAndHisPython)。
若有侵權,請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」,歡迎正在閱讀的你也加入,一塊兒分享。