首先進入到 http://www.budejie.com/text/,裏面所有是段子,暫時只把段子爬下來,不爬圖片,打開頁面查看源代碼:html
發現段子都在相似於這樣 <a href="(/detail-3242432.html)">段子</a>
的結構中,
因而咱們有辦法了,把段子在的地方放入正則表達式reg = re.compile(r'<a href="(/detail-.*?)">(.*?)</a>')
點讚的人數也是重複上面的過程:python
正則表達式reg = re.compile(r'<i class="icon-up ui-icon-up"></i> <span>(.*?)</span>
正則表達式
代碼以下:markdown
# encoding: utf-8
import urllib2
import re
def getduan():
url = 'http://www.budejie.com/text/'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'#代理
headers = {'User-Agent': user_agent}
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
res = response.read()
reg = re.compile(r'<a href="(/detail-.*?)">(.*?)</a>')
return re.findall(reg, res)
def up():
url = 'http://www.budejie.com/text/'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
res = response.read()
reg = re.compile(r'<i class="icon-up ui-icon-up"></i> <span>(.*?)</span>')
return re.findall(reg, res)
if __name__ == '__main__':
d = zip(getduan(), up())
d = dict(d)
count = 0
for j, i in d.items():
print '段子', (count+1),j[1]
count = count+1
print 'up人數:',i
這裏用到了代理,爲了防止反爬蟲,環境是python2.7,最後獲得的效果如圖:框架
很是簡單的爬蟲沒有用任何框架,接下來會用框架解決爬蟲問題,請繼續關注。python2.7