一、Python版本:3.5,urllib庫,html
二、爬取糗事百科24小時熱門段子,第一頁(網頁地址:http://www.qiushibaike.com/hot/1)ui
三、使用正則匹配, re庫url
四、Python2的urllib、urllib2合併成pytohn3的urllib庫,Pytohn3:urllib.request, urllib.error, urllib.parsespa
# -*- coding:utf-8 -*- # 抓取糗事百科24小時第一頁段子(用戶名,內容,可笑數,評論數) import re import urllib.request from urllib.error import URLError url = 'http://www.qiushibaike.com/hot/page/1' # headers驗證 h = { 'User-Agent': '(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36', } try: requ = urllib.request.Request(url, headers=h) response = urllib.request.urlopen(requ) content = response.read().decode('utf-8') # 正則匹配(此正則匹配目前糗百最新網頁內容) pattern = re.compile( '<div class="author clearfix">.*?<h2>(.*?)</h2>.*?<div class="content">(.*?)</div>.*?<div class="stats"' '.*?i class="number">(.*?)</i>(.*?)</span>.*?<span class="dash">.*?i class="number">(.*?)</i>(.*?)</a>', re.S ) items = re.findall(pattern, content) # 過濾掉內容中圖片 for item in items: img = re.search('img', item[1]) if not img: print(item[0], item[1], item[2], item[3], item[4], item[5]) except URLError as e: print('error', e.reason)
注: 本文閱讀參考博客後,修改運行。code