爬取帖子

時間 2019-11-07

標籤帖子简体版

原文原文鏈接

百度貼吧、爬取帖子的標題、發佈時間和連接app

 1 import threading
 2 import requests
 3 import re
 4 import os
 5 
 6 #   百度貼吧        爬取帖子的標題、發佈時間和連接
 7 
 8 #   要搜索的貼吧名稱
 9 word = '文字控吧'
10 #   設置爬取頁數
11 num = 5
12 
13 
14 # 獲取詳情頁url和標題
15 def parse(word, pn):
16     r = requests.get('https://tieba.baidu.com/f', params={'kw': word, 'pn': pn}).content.decode()
17     article_urls = re.findall(r'<a rel="noreferrer" href="(/p/\d+)" title="(.*?)" target=', r, re.S)
18     print('正在請求中...')
19     return article_urls
20 
21 
22 #   發起請求
23 def parse_detail(article_urls):
24     for article_url in article_urls:
25         article_req = requests.get('https://tieba.baidu.com' + article_url[0]).text
26         if not re.findall(r'"userName":"(.*?)"', article_req, re.S):
27             print('未匹配到數據，這個正則不符合這個貼吧，須要重寫正則')
28             continue
29         #   樓主
30         author = re.findall(r'"userName":"(.*?)"', article_req, re.S)[0]
31         #   發帖時間
32         crete_time = \
33             re.findall(r'<span class="tail-info">1樓</span><span class="tail-info">(.*?)</span>', article_req, re.S)[0]
34         if author and crete_time and crete_time:
35             content = '樓主：{}\n標題：{}\n發佈時間：{}\n連接：{}\n'.format(author, article_url[1], crete_time,
36                                                               'https://tieba.baidu.com' + article_url[0])
37             print(content)
38             #   寫入文件
39             with open(word + '.txt', 'a')as f:
40                 f.write('{}\n'.format(content))
41 
42 
43 if not os.path.exists('百度貼吧'):
44     #   建立文件夾
45     os.mkdir('百度貼吧')
46 os.chdir('百度貼吧')
47 
48 t_list = []
49 for pn in range(0, num * 50, 50):
50     #   先獲取詳情頁url和標題
51     article_urls = parse(word, pn)
52     #   對每個詳情頁進行請求
53     t = threading.Thread(target=parse_detail, args=(article_urls,))
54     t_list.append(t)
55 
56 # 啓動線程
57 for t in t_list:
58     t.start()
59 # 等待全部線程結束
60 for t in t_list:
61     t.join()

百度貼吧

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。