爬蟲實戰——爬百思不得姐

時間 2019-11-18

原文原文鏈接

看完了爬蟲的入門以後，想實戰一下，因而找了一個段子網站——百思不得姐，爬一下段子：

首先進入到 http://www.budejie.com/text/，裏面所有是段子，暫時只把段子爬下來，不爬圖片，打開頁面查看源代碼:html

發現段子都在相似於這樣 <a href="(/detail-3242432.html)">段子</a> 的結構中，
因而咱們有辦法了，把段子在的地方放入正則表達式reg = re.compile(r'<a href="(/detail-.*?)">(.*?)</a>')
點讚的人數也是重複上面的過程：python

正則表達式reg = re.compile(r'<i class="icon-up ui-icon-up"></i>  <span>(.*?)</span>正則表達式

代碼以下：markdown

# encoding: utf-8
import urllib2
import re


def getduan():
    url = 'http://www.budejie.com/text/'
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'#代理
    headers = {'User-Agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    res = response.read()
    reg = re.compile(r'<a href="(/detail-.*?)">(.*?)</a>')
    return re.findall(reg, res)


def up():
    url = 'http://www.budejie.com/text/'
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    res = response.read()
    reg = re.compile(r'<i class="icon-up ui-icon-up"></i>&nbsp;&nbsp;<span>(.*?)</span>')
    return re.findall(reg, res)


if __name__ == '__main__':
    d = zip(getduan(), up())
    d = dict(d)
    count = 0
    for j, i in d.items():
        print '段子', (count+1),j[1]
        count = count+1
        print 'up人數：',i

這裏用到了代理，爲了防止反爬蟲，環境是python2.7，最後獲得的效果如圖：框架

很是簡單的爬蟲沒有用任何框架，接下來會用框架解決爬蟲問題，請繼續關注。python2.7

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。