Python爬取百度實時熱點排行榜

時間 2019-12-14

原文原文鏈接

今天爬取的百度的實時熱點排行榜html

按照慣例，先下載網站的內容到本地：服務器

1 def downhtml():
2     url = 'http://top.baidu.com/buzz?b=1&fr=20811'
3     headers = {'User-Agent':'Mozilla/5.0'}
4     r = requests.get('url',headers=headers)
5     with open('C:/Code/info_baidu.html','wb') as f:
6         f.write(r.content)

由於我習慣把網頁整個抓到本地再來分析數據，因此會有這一步，後面會貼直接抓取並分析的代碼。app

開始分析數據：ide

我想抓取的排名，關鍵詞和搜索指數這三個值。函數

打開網頁源代碼：測試

發現每一個標題的各個元素是一個個td被包裝在一個tr標籤裏面，每個標題都是一個tr（這裏注意前三個標題的tr標籤是有class=‘hideline’，然後面的則沒有）網站

排名：第一個td　　　　class=''first'url

關鍵詞：第二個td　　　 cass = 'keyword'spa

搜索指數：最後一個td 　　class = 'last'3d

肯定了我所須要的數據的位置了以後，能夠開始寫代碼了。

寫一個把打開本地html並返回給BeautifulSoup調用的函數：

def send_html():#把本地的html文件調給get_pages的BeautifulSoup
    path = 'C:/Code/info_baidu.html'
    htmlfile= open(path,'r')
    htmlhandle = htmlfile.read()
    return htmlhandle

這樣，我就能夠在下面的直接用本地html來測試，而不用每次都去請求百度的服務器了。

def get_pages(html):
    soup = BeautifulSoup(html,'html.parser')
    all_topics=soup.find_all('tr')[1:]#切片

由於第一個tr裝的是這些東西

<tr>
        <th width="50" class="first">排名</th>
        <th>關鍵詞</th>
        <th width="30%" class="tc">相關連接</th>
        <th width="20%" class="last">搜索指數</th>
    </tr>

並非排名第一的標題，因此我用切片把它過濾掉了。

而後開始挨個賦值：

def get_pages(html):
    soup = BeautifulSoup(html,'html.parser')
    all_topics=soup.find_all('tr')[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find('td',class_='last').get_text()#搜索指數
        topic_rank = each_topic.find('td',class_='first').get_text()#排名
        topic_name = each_topic.find('td',class_='keyword').get_text()#標題目
        print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))

這樣按道理來講應該是能夠輸出了，但百度仍是想給我一點難度。

這裏出現幾個問題，

1：AttributeError: 'NoneType' object has no attribute 'get_text'

2：輸出的格式

3：只有一個值

按照慣例，第一個問題應該是裏面多了一些不是Tag的類型，因此就來測試一下：

def get_pages(html):
    soup = BeautifulSoup(html,'html.parser')
    all_topics=soup.find_all('tr')[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find('td',class_='last')#搜索指數
        print(type(topic_times))

輸出以下：

咱們能夠發現前幾個值都參雜了NoneType（我去源代碼看了一下，並不知道是什麼致使的，等之後我知道了，再回來！）

所以，咱們只要把NoneType給過濾掉就行。

def get_pages(html):
    soup = BeautifulSoup(html,'html.parser')
    all_topics=soup.find_all('tr')[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find('td',class_='last')#搜索指數
        topic_rank = each_topic.find('td',class_='first')#排名
        topic_name = each_topic.find('td',class_='keyword')#標題目
        # print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))
        if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find('td',class_='first').get_text()
            topic_name = each_topic.find('td',class_='keyword').get_text()
            topic_times = each_topic.find('td',class_='last').get_text()
            print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))

輸出以下：

這樣就解決了第一個問題，發現能夠輸出了，連第三個問題也解決了。

但第二個問題還在，這shit通常的格式讓我很難受，致使這樣的緣由我猜是get_text時把一些空格符和換行符也一塊兒輸出了。

因此用replace()就應該能夠解決了。

if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find('td',class_='first').get_text().replace(' ','').replace('\n','')
            topic_name = each_topic.find('td',class_='keyword').get_text().replace(' ','').replace('\n','')
            topic_times = each_topic.find('td',class_='last').get_text().replace(' ','').replace('\n','')
            print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))

輸出以下：

哦吼，這樣感受就不錯了。

但強迫症患者感受仍是很難受啊，這個熱度（搜索指數）的格式也太亂了。

通過一番搜索，網友的力量仍是很強大的啊哈哈哈，立刻就有辦法了。

if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find('td',class_='first').get_text().replace(' ','').replace('\n','')
            topic_name = each_topic.find('td',class_='keyword').get_text().replace(' ','').replace('\n','')
            topic_times = each_topic.find('td',class_='last').get_text().replace(' ','').replace('\n','')
            #print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))
            tplt = "排名：{0:^4}\t標題：{1:{3}^15}\t熱度：{2:^7}"
            print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))

輸出以下：

本強迫症患者終於知足了哈哈。

附上總代碼：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4 
 5 
 6 def send_html():#把本地的html文件調給get_pages的BeautifulSoup
 7     path = 'C:/Code/info_baidu.html'
 8     htmlfile= open(path,'r')
 9     htmlhandle = htmlfile.read()
10     return htmlhandle
11 
12 def get_pages(html):
13     soup = BeautifulSoup(html,'html.parser')
14     all_topics=soup.find_all('tr')[1:]
15     for each_topic in all_topics:
16         #print(each_topic)
17         topic_times = each_topic.find('td',class_='last')#搜索指數
18         topic_rank = each_topic.find('td',class_='first')#排名
19         topic_name = each_topic.find('td',class_='keyword')#標題目
20         if topic_rank != None and topic_name!=None and topic_times!=None:
21             topic_rank = each_topic.find('td',class_='first').get_text().replace(' ','').replace('\n','')
22             topic_name = each_topic.find('td',class_='keyword').get_text().replace(' ','').replace('\n','')
23             topic_times = each_topic.find('td',class_='last').get_text().replace(' ','').replace('\n','')
24             #print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))
25             tplt = "排名：{0:^4}\t標題：{1:{3}^15}\t熱度：{2:^7}"
26             print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))    
27 
28 if __name__ =='__main__':
29     get_pages(send_html())

。

還有直接爬取不用下載網頁的總代碼：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4 
 5 def get_html(url,headers):
 6     r = requests.get(url,headers=headers)
 7     r.encoding = r.apparent_encoding 
 8     return r.text
 9 
10 
11 def get_pages(html):
12     soup = BeautifulSoup(html,'html.parser')
13     all_topics=soup.find_all('tr')[1:]
14     for each_topic in all_topics:
15         #print(each_topic)
16         topic_times = each_topic.find('td',class_='last')#搜索指數
17         topic_rank = each_topic.find('td',class_='first')#排名
18         topic_name = each_topic.find('td',class_='keyword')#標題目
19         if topic_rank != None and topic_name!=None and topic_times!=None:
20             topic_rank = each_topic.find('td',class_='first').get_text().replace(' ','').replace('\n','')
21             topic_name = each_topic.find('td',class_='keyword').get_text().replace(' ','').replace('\n','')
22             topic_times = each_topic.find('td',class_='last').get_text().replace(' ','').replace('\n','')
23             #print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))
24             tplt = "排名：{0:^4}\t標題：{1:{3}^15}\t熱度：{2:^8}"
25             print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))    
26 
27 def main():
28     url = 'http://top.baidu.com/buzz?b=1&fr=20811'
29     headers= {'User-Agent':'Mozilla/5.0'}
30     html = get_html(url,headers)
31     get_pages(html)
32 
33 if __name__=='__main__':
34     main()

好了。完成任務，生活愉快！

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。