逛知乎時想,若是寫個爬蟲把評論抓下來,應該蠻有意思的,說幹就幹。 html
先看一下網頁結構,果真在XHR下找到了一個像JSON的文件,打開一看,想要的數據都躲在裏面,不說廢話,直接上代碼。json
1 __author__ = 'xyz'
2
3 import requests 4 from urllib.parse import urlencode 5 from multiprocessing.dummy import Pool as ThreadPool 6
7 def get_info(url): 8 dic = { }#建個字典存儲數據
9
10 headers = { 11 'authorization': 'Bearer 2|1:0|10:1517047037|4:z_c0|80:MS4xU0IxRkJ3QUFBQUFtQUFBQVlBSlZUZjJhV1Z1U2FhTURoSVdhb0V1dV9abTJFNkE4aGpBVnlBPT0=|7200c19ff1e6c1a18390b9af7392f8dcca584bcb3c31cd3d8161d661c18eaa76', 12 'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36', 13 'cookie':'_zap = fa9bac27 - b3bb - 408f - 8ea8 - 581d77a0cc7e;z_c0 = "2|1:0|10:1517047037|4:z_c0|80:MS4xU0IxRkJ3QUFBQUFtQUFBQVlBSlZUZjJhV1Z1U2FhTURoSVdhb0V1dV9abTJFNkE4aGpBVnlBPT0=|7200c19ff1e6c1a18390b9af7392f8dcca584bcb3c31cd3d8161d661c18eaa76";__utma = 51854390.1871736191.1517047484.1517047484.1517047484.1;__utmz = 51854390.1517047484.1.1.utmcsr = zhihu.com | utmccn = (referral) | utmcmd = referral | utmcct = / people / wa - jue - ji - xiao - wang - zi - 45 / activities;__utmv = 51854390.100 - - | 2 = registration_date = 20180111 = 1 ^ 3 = entry_date = 20180111 = 1;q_c1 = cf2ffc48a71847e8804976b27342d96a | 1519734001000 | 1517047027000;__DAYU_PP = FEmB3nFIebeie76Qvafi283c612c853b;_xsrf = 45a278fa - 78a9 - 48a5 - 997e - f8ef03822600;d_c0 = "APDslTIFUQ2PTnO_uZxNy2IraDAP4g1nffs=|1521555479"'
14 } 15 # 請求頭,反爬的關鍵,精髓在'authorization'
16 html = requests.get(url,headers=headers) 17 for i in range(len(html.json()['data'])): 18 dic['content'] = res = html.json()['data'][i]['content'] 19 dic['author'] = res = html.json()['data'][i]['author']['member']['name'] 20 print('下載了{}'.format(i/(len(html.json()['data'])-1)*100)+'%') 21 #顯示下載進度,友好度+1
22 write(dic) 23 #循環下載
24
25 def write(dic): 26 with open('zhihu.text','a',encoding='utf-8') as f : 27 f.writelines('暱稱:' + dic['author']+'\n') 28 f.writelines('內容:' + dic['content']+'\n\n') 29 #將信息寫入本地文件,編碼要留心
30
31 if __name__ == '__main__' : 32 pool = ThreadPool(4)#多進程下載,4個進程池
33 page = []#將url存入列表,嘿嘿嘿
34 base_url = 'https://www.zhihu.com/api/v4/answers/346473778/comments?'
35 for i in range(0,240,20): 36 data = { 37 'include': 'data[*].author,collapsed,reply_to_author,disliked,content,voting,vote_count,is_parent_author,is_author', 38 'order': 'normal', 39 'limit': '20', 40 'offset': i, 41 'status': 'open'} 42 queries = urlencode(data) # integrate and incorporate
43 url = base_url + queries 44 page.append(url) 45 results = pool.map(get_info, page) 46 pool.close()
知乎仍是挖了一個小坑的,第一次請求時給我返回了一個401.我表面穩如老狗,其實慌的一逼。我喝了口水,冷靜下來一想,必定有什麼鮮爲人知的反爬機制,因而研究了一下返回的內容。api
1 {'error': {'message': '請求頭或參數封裝錯誤', 'code': 100, 'name': 'AuthenticationInvalidRequest'}}
我一拍腦殼,真是個小機靈鬼,我在header上加一手Authentication就行了。測試一下,果真美滋滋。cookie