百度貼吧的數據抓取和分析(一):指定條目帖子信息抓取

這個教程使用BeautifulSoup庫爬取指定貼吧的帖子信息。html

本教程的代碼託管於github: https://github.com/w392807287/spider_baidu_barpython

數據分析部分請移步:git

python版本:3.5.2github

使用BeautifulSoup庫獲取網頁信息

引入相關庫:json

from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError

 

這裏使用python吧爲例子,python吧的主頁爲:http://tieba.baidu.com/f?ie=utf-8&kw=python&fr=search,精簡一點http://tieba.baidu.com/f?kw=python數組

獲取BeautifulSoup對象:瀏覽器

url = "http://tieba.baidu.com/f?kw=python"
html = urlopen(url).read()
bsObj = BeautifulSoup(html, "lxml")

 這裏將這一段進行封裝,封裝成一個傳入url返回bs對象的函數:bash

def get_bsObj(url):
    '''
    返回給定url的beautifulsoup對象
    :param url:目標網址
    :return:beautifulsoup對象
    '''
    try:
        html = urlopen(url).read()
        bsObj = BeautifulSoup(html, "lxml")
        return bsObj
    except HTTPError as e:
        print(e)
        return None

 這個函數傳入一個url,返回beautifulsoup對象,若是發生錯誤則打印出錯誤並返回空值。服務器

 貼吧主頁的處理

在貼吧主頁中,包含了這個貼吧的大概信息,好比:關注量,主題數量,帖子數量。咱們將這些信息彙總到一個文件。微信

獲取主頁bs對象:

bsObj_mainpage = get_bsObj(url)

 獲取總頁數:

last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])

 獲取最後一頁的頁數是爲了後面爬去帖子時越界。其中使用了beautifulsoup的find方法。

獲取須要的信息並寫入文件:

        red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})
        subject_sum = red_text[0].get_text()  # 主題數
        post_sum = red_text[1].get_text()  # 帖子數
        follow_sum = red_text[2].get_text()  # 關注量

    with open('main_info.txt','w+') as f:
        f.writelines("統計時間: "+str(datetime.datetime.now())+"\n")
        f.writelines("主題數:  "+subject_sum+"\n")
        f.writelines("帖子數:  "+post_sum+"\n")
        f.writelines("關注量:  "+follow_sum+"\n")

 最後將這些步驟封裝成一個函數,傳入主頁面的url,寫入信息,返回最後一頁的頁碼:

def del_mainPage(url):
    '''
    對主頁面進行處理,返回最後頁碼
    :param url:目標主頁地址
    :return:返回最後頁碼,int
    '''
    bsObj_mainpage = get_bsObj(url)
    last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])
    try:
        red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})
        subject_sum = red_text[0].get_text()  # 主題數
        post_sum = red_text[1].get_text()  # 帖子數
        follow_sum = red_text[2].get_text()  # 關注量
    except AttributeError as e:
        print("發生錯誤:" + e + "時間:"+str(datetime.datetime.now()))
        return None
    with open('main_info.txt','w+') as f:
        f.writelines("統計時間: "+str(datetime.datetime.now())+"\n")
        f.writelines("主題數:  "+subject_sum+"\n")
        f.writelines("帖子數:  "+post_sum+"\n")
        f.writelines("關注量:  "+follow_sum+"\n")
    return last_page

 獲得的結果:

統計時間: 2016-10-07 15:14:19.642933
主題數:  25083
帖子數:  414831
關注量:  76511

 

從頁面中獲取帖子地址

咱們想要獲取每一個帖子的詳細信息,就須要進入這個帖子,因此須要這個帖子的地址。好比:http://tieba.baidu.com/p/4700788764

這個url中,http://tieba.baidu.com是服務器地址,/p應該是路由中帖子(post)的對應,/4700788764即帖子的id。

咱們觀察貼吧首頁會發現,每一個帖子都是在一個塊中。在瀏覽器中按F12觀察其代碼會發現,一個帖子所對應一個<li>標籤,每一個頁面中有50個帖子(首頁有廣告貼可能不同)。

其中<li>標籤大概長這樣:

<li class=" j_thread_list clearfix" data-field="{"id":4700788764,"author_name":"\u6768\u5175507","first_post_id":95008842757,"reply_num":2671,"is_bakan":null,"vid":"","is_good":null,"is_top":null,"is_protal":null,"is_membertop":null,"frs_tpoint":null}">
            <div class="t_con cleafix">
        
                    <div class="col2_left j_threadlist_li_left">
                <span class="threadlist_rep_num center_text" title="回覆">2671</span>
            </div>
                <div class="col2_right j_threadlist_li_right ">
            <div class="threadlist_lz clearfix">
                <div class="threadlist_title pull_left j_th_tit ">
    
    
    <a href="/p/4700788764" title="恭喜《從零開始學Python》進入百度閱讀平臺【首頁】新書推薦榜單" target="_blank" class="j_th_tit ">恭喜《從零開始學Python》進入百度閱讀平臺【首頁】新書推薦榜單</a>
</div><div class="threadlist_author pull_right">
    <span class="tb_icon_author " title="主題做者: 楊兵507" data-field="{"user_id":625543823}"><i class="icon_author"></i><span class="frs-author-name-wrap"><a data-field="{"un":"\u6768\u5175507"}" class="frs-author-name j_user_card " href="/home/main/?un=%E6%9D%A8%E5%85%B5507&ie=utf-8&fr=frs" target="_blank">楊兵507</a></span><span class="icon_wrap  icon_wrap_theme1 frs_bright_icons "></span>    </span>
    <span class="pull-right is_show_create_time" title="建立時間">7-29</span>
</div>
            </div>
                            <div class="threadlist_detail clearfix">
                    <div class="threadlist_text pull_left">
                                <div class="threadlist_abs threadlist_abs_onlyline ">
            http://yuedu.baidu.com/ebook/ec1aa9f7b90d6c85ec3ac6d7?fr=index
        </div>

                    </div>

                    
<div class="threadlist_author pull_right">
        <span class="tb_icon_author_rely j_replyer" title="最後回覆人: 楊兵小童鞋">
            <i class="icon_replyer"></i>
            <a data-field="{"un":"\u6768\u5175\u5c0f\u7ae5\u978b"}" class="frs-author-name j_user_card " href="/home/main/?un=%E6%9D%A8%E5%85%B5%E5%B0%8F%E7%AB%A5%E9%9E%8B&ie=utf-8&fr=frs" target="_blank">楊兵小童鞋</a>        </span>
        <span class="threadlist_reply_date pull_right j_reply_data" title="最後回覆時間">
            14:17        </span>
</div>
                </div>
                    </div>
    </div>
</li>

咱們能夠觀察到<li>標籤中的class=" j_thread_list clearfix" data-field="{"id":4700788764,"author_name":"\u6768\u5175507","first_post_id":95008842757,"reply_num":2671,"is_bakan":null,"vid":"","is_good":null,"is_top":null,"is_protal":null,"is_membertop":null,"frs_tpoint":null}"

根據class屬性咱們能夠找到單個頁面中全部的帖子:

posts = bsObj_page.findAll("li", {"class": "j_thread_list"})

 在data-field屬性中咱們能夠獲得:帖子ID,做者名稱,回覆數量,是否精品等信息。根據帖子ID咱們能夠獲得帖子對應的url,不過下面<a>標籤中直接給了。

咱們獲取連接並將其放入數組中:

post_info = post.find("a", {"class": "j_th_tit "})
urls.append("http://tieba.baidu.com" + post_info.attrs["href"])

 將上述代碼打包,給出單頁連接,返回此連接中全部帖子的url:

def get_url_from_page(page_url):
    '''
    對給定頁面進行處理返回頁面中帖子的url
    :param page_url: 頁面連接
    :return: 頁面中全部帖子的url
    '''
    bsObj_page = get_bsObj(page_url)
    urls = []
    try:
        posts = bsObj_page.findAll("li", {"class": "j_thread_list"})
    except AttributeError as e:
        print("發生錯誤:" + e + "時間:" + str(datetime.datetime.now()))
    for post in posts:
        post_info = post.find("a", {"class": "j_th_tit "})
        urls.append("http://tieba.baidu.com" + post_info.attrs["href"])
    return urls

 

處理每頁的信息

上面咱們獲得了每頁的地址,接下來咱們處理每一個帖子中的信息。咱們須要在這頁面中找到一些對咱們有用的信息並將其存入csv文件中。

一樣用這個地址舉例:http://tieba.baidu.com/p/4700788764

首先,當打開這個連接是咱們觀察到的信息就是:帖子的標題,樓主名稱,發帖時間,回覆量等。

讓咱們觀察一下這個頁面的代碼:

<div class="l_post j_l_post l_post_bright noborder " data-field="{"author":{"user_id":625543823,"user_name":"\u6768\u5175507","name_u":"%E6%9D%A8%E5%85%B5507&ie=utf-8","user_sex":2,"portrait":"8f0ae69da8e585b53530374925","is_like":1,"level_id":7,"level_name":"\u8d21\u58eb","cur_score":445,"bawu":0,"props":null},"content":{"post_id":95008842757,"is_anonym":false,"open_id":"tieba","open_type":"","date":"2016-07-29 19:10","vote_crypt":"","post_no":1,"type":"0","comment_num":0,"ptype":"0","is_saveface":false,"props":null,"post_index":0,"pb_tpoint":null}}">

 一樣兩個屬性 class 和data-field,在data-field中包含了這個帖子的大部分信息:帖人id,發帖人暱稱,性別,等級id,等級暱稱,open_id,open_type,發帖日期等。

首先咱們建立一個帖子對象,其中屬性爲帖子的信息,方法爲將信息寫入對應的csv文件:

class PostInfo:
    def __init__(self,post_id,post_title,post_url,reply_num,post_date,open_id,open_type,
                 user_name,user_sex,level_id,level_name):
        self.post_id = post_id
        self.post_title = post_title
        self.post_url = post_url
        self.reply_num = reply_num
        self.post_date = post_date
        self.open_id = open_id
        self.open_type = open_type
        self.user_name = user_name
        self.user_sex = user_sex
        self.level_id = level_id
        self.level_name = level_name

    def dump_to_csv(self,filename):
        csvFile = open(filename, "a+")
        try:
            writer = csv.writer(csvFile)
            writer.writerow((self.post_id,self.post_title,self.post_url,self.reply_num,self.post_date,self.open_id,
                             self.open_type,self.user_name,self.user_sex,self.level_id,self.level_name))
        finally:
            csvFile.close()

 而後咱們經過find方法找到對應信息:

obj1 = json.loads(
bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])
reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()
post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()

post_id = obj1.get('content').get('post_id')
post_url = url
post_date = obj1.get('content').get('date')
open_id = obj1.get('content').get('open_id')
open_type = obj1.get('content').get('open_type')
user_name = obj1.get('author').get('user_name')
user_sex = obj1.get('author').get('user_sex')
level_id = obj1.get('author').get('level_id')
level_name = obj1.get('author').get('level_name')

 建立對象,將其保存:

postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,user_sex, level_id, level_name)
postinfo.dump_to_csv('post_info2.csv')

 其實不用經過對象保存,這只是我的想法。

將上面代碼封裝成處理每一個帖子的函數:

def del_post(urls):
    '''
    處理傳入url的帖子
    :param url:
    :return:
    '''
    for url in urls:
        bsObj = get_bsObj(url)
        try:
            obj1 = json.loads(
                bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])
            reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()
            post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()
        except:
            print("發生錯誤:" + "---" + "時間:" + str(datetime.datetime.now()) + url)
            with open('error.txt', 'a+') as f:
                f.writelines("發生錯誤:" + "---" + "時間:" + str(datetime.datetime.now()) + url)
            return None
        post_id = obj1.get('content').get('post_id')
        post_url = url
        post_date = obj1.get('content').get('date')
        open_id = obj1.get('content').get('open_id')
        open_type = obj1.get('content').get('open_type')
        user_name = obj1.get('author').get('user_name')
        user_sex = obj1.get('author').get('user_sex')
        level_id = obj1.get('author').get('level_id')
        level_name = obj1.get('author').get('level_name')
        postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,
                            user_sex, level_id, level_name)
        postinfo.dump_to_csv('post_info2.csv')
        del postinfo

獲得的結果相似於:

98773024983,【轟動Python界】的學習速成高效大法,http://tieba.baidu.com/p/4811129571,2,2016-10-06 20:32,tieba,,openlabczx,0,7,貢士

 

組合使用上面的函數

首先,咱們讓用戶輸入須要爬去的貼吧的主頁:

home_page_url = input("請輸入要處理貼吧的主頁連接")

 處理url:

bar_name = home_page_url.split("=")[1].split("&")[0]
pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含頁數的前綴

 處理主頁:

all_post_num = del_mainPage(home_page_url)      #貼吧一共有多少條帖子

 讓用戶輸入須要爬去的帖子數量:

del_post_num = int(input("請輸入須要處理前多少條帖子:"))     #指定須要處理的帖子數目

 最後:

if del_post_num > all_post_num:
    print("須要處理的帖子數大於貼吧帖子總數!")
else:
    for page in range(0,del_post_num,50):
        print("It's processing page : " + str(page))
        page_url = pre_page_url+str(page)
        urls = get_url_from_page(page_url)
        t = threading.Thread(target=del_post,args=(urls,))
        t.start()

 

主函數代碼:

if __name__ == '__main__':
    #home_page_url = input("請輸入要處理貼吧的主頁連接")
    home_page_url = test_url
    bar_name = home_page_url.split("=")[1].split("&")[0]
    pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含頁數的前綴
    all_post_num = del_mainPage(home_page_url)      #貼吧一共有多少條帖子
    del_post_num = int(input("請輸入須要處理前多少條帖子:"))     #指定須要處理的帖子數目
    if del_post_num > all_post_num:
        print("須要處理的帖子數大於貼吧帖子總數!")
    else:
        for page in range(0,del_post_num,50):
            print("It's processing page : " + str(page))
            page_url = pre_page_url+str(page)
            urls = get_url_from_page(page_url)
            t = threading.Thread(target=del_post,args=(urls,))
            t.start()

 所有代碼:

from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
import json
import datetime
import csv
import threading

class PostInfo:
    def __init__(self,post_id,post_title,post_url,reply_num,post_date,open_id,open_type,
                 user_name,user_sex,level_id,level_name):
        self.post_id = post_id
        self.post_title = post_title
        self.post_url = post_url
        self.reply_num = reply_num
        self.post_date = post_date
        self.open_id = open_id
        self.open_type = open_type
        self.user_name = user_name
        self.user_sex = user_sex
        self.level_id = level_id
        self.level_name = level_name

    def dump_to_csv(self,filename):
        csvFile = open(filename, "a+")
        try:
            writer = csv.writer(csvFile)
            writer.writerow((self.post_id,self.post_title,self.post_url,self.reply_num,self.post_date,self.open_id,
                             self.open_type,self.user_name,self.user_sex,self.level_id,self.level_name))
        finally:
            csvFile.close()

def get_bsObj(url):
    '''
    返回給定url的beautifulsoup對象
    :param url:目標網址
    :return:beautifulsoup對象
    '''
    try:
        html = urlopen(url).read()
        bsObj = BeautifulSoup(html, "lxml")
        return bsObj
    except HTTPError as e:
        print(e)
        return None


def del_mainPage(url):
    '''
    對主頁面進行處理,返回最後頁碼
    :param url:目標主頁地址
    :return:返回最後頁碼,int
    '''
    bsObj_mainpage = get_bsObj(url)
    last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])
    try:
        red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})
        subject_sum = red_text[0].get_text()  # 主題數
        post_sum = red_text[1].get_text()  # 帖子數
        follow_sum = red_text[2].get_text()  # 關注量
    except AttributeError as e:
        print("發生錯誤:" + e + "時間:"+str(datetime.datetime.now()))
        return None
    with open('main_info.txt','w+') as f:
        f.writelines("統計時間: "+str(datetime.datetime.now())+"\n")
        f.writelines("主題數:  "+subject_sum+"\n")
        f.writelines("帖子數:  "+post_sum+"\n")
        f.writelines("關注量:  "+follow_sum+"\n")
    return last_page

def get_url_from_page(page_url):
    '''
    對給定頁面進行處理返回頁面中帖子的url
    :param page_url: 頁面連接
    :return: 頁面中全部帖子的url
    '''
    bsObj_page = get_bsObj(page_url)
    urls = []
    try:
        posts = bsObj_page.findAll("li", {"class": "j_thread_list"})
    except AttributeError as e:
        print("發生錯誤:" + e + "時間:" + str(datetime.datetime.now()))
    for post in posts:
        post_info = post.find("a", {"class": "j_th_tit "})
        urls.append("http://tieba.baidu.com" + post_info.attrs["href"])
    return urls

def del_post(urls):
    '''
    處理傳入url的帖子
    :param url:
    :return:
    '''
    for url in urls:
        bsObj = get_bsObj(url)
        try:
            obj1 = json.loads(
                bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])
            reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()
            post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()
        except:
            print("發生錯誤:" + "---" + "時間:" + str(datetime.datetime.now()) + url)
            with open('error.txt', 'a+') as f:
                f.writelines("發生錯誤:" + "---" + "時間:" + str(datetime.datetime.now()) + url)
            return None
        post_id = obj1.get('content').get('post_id')
        post_url = url
        post_date = obj1.get('content').get('date')
        open_id = obj1.get('content').get('open_id')
        open_type = obj1.get('content').get('open_type')
        user_name = obj1.get('author').get('user_name')
        user_sex = obj1.get('author').get('user_sex')
        level_id = obj1.get('author').get('level_id')
        level_name = obj1.get('author').get('level_name')
        postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,
                            user_sex, level_id, level_name)
        postinfo.dump_to_csv('post_info2.csv')
        # t = threading.Thread(target=postinfo.dump_to_csv,args=('post_info2.csv',))
        # t.start()
        del postinfo

test_url = "http://tieba.baidu.com/f?kw=python&ie=utf-8"

if __name__ == '__main__':
    #home_page_url = input("請輸入要處理貼吧的主頁連接")
    home_page_url = test_url
    bar_name = home_page_url.split("=")[1].split("&")[0]
    pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含頁數的前綴
    all_post_num = del_mainPage(home_page_url)      #貼吧一共有多少條帖子
    del_post_num = int(input("請輸入須要處理前多少條帖子:"))     #指定須要處理的帖子數目
    if del_post_num > all_post_num:
        print("須要處理的帖子數大於貼吧帖子總數!")
    else:
        for page in range(0,del_post_num,50):
            print("It's processing page : " + str(page))
            page_url = pre_page_url+str(page)
            urls = get_url_from_page(page_url)
            t = threading.Thread(target=del_post,args=(urls,))
            t.start()
    t.join()
            #del_post(urls)

 

以上  

歡迎多來訪問博客:http://liqiongyu.com/blog

微信公衆號:

相關文章
相關標籤/搜索