項目代碼所有上傳到github 這個項目的做用是將v2ex的全部的文章所有爬取下來最終抓取到的內容以下圖所示 php
scrapy建立工程的方式特別的簡單,只要在shell下面輸入下面的語句 css
執行成功以後會自動建立咱們工程所須要的文件html
java
在抓取的過程中返回的都是403的錯誤,網站採用了防爬技術anti-web-crawling technique(Amazon所用),後來經過經過隊列的形式隨機更換user_aget來發送請求來解決這個問題python
咱們須要使用下面的rotate_useragent.py的代碼來進行更換請求的頭,同時須要在settings.py裏面將DOWNLOADER_MIDDLEWARES的註釋去掉而且進行更正成正確的引用git
DOWNLOADER_MIDDLEWARES = { 'v2ex.rotate_useragent.RotateUserAgentMiddleware': 400, }
rotate_useragent.py文件的代碼github
#!/usr/bin/python #-*-coding:utf-8-*- import random from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware class RotateUserAgentMiddleware(UserAgentMiddleware): def __init__(self, user_agent=''): self.user_agent = user_agent def process_request(self, request, spider): #這句話用於隨機選擇user-agent ua = random.choice(self.user_agent_list) if ua: request.headers.setdefault('User-Agent', ua) #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\ "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\ "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKithttp://o7ez1faxc.bkt.clouddn.com/2016-05-19-%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7%202016-05-19%20%E4%B8%8B%E5%8D%884.16.05.png/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
I## items配置 items.py裏面定義了咱們須要爬取的元素web
from scrapy import Item,Field class TencentItem(Item): title=Field() #文檔標題 url=Field() #文章的連接
在scrapy裏面會有一個piplines.py文章,爬蟲會將抓取到的元素調用這個文件裏面的函數進行存儲chrome
class JsonWithEncodingTencentPipeline(object): def __init__(self): self.file = codecs.open('v2ex.json', 'w', encoding='utf-8')#設置encoding來防止亂碼 def process_item(self, item, spider): line = json.dumps(dict(item), ensure_ascii=False) + "\n"#ensure_ascii爲true的話輸出的是一個ascii字符,想輸出中文的話須要將其設置爲False self.file.write(line) return item def spider_closed(self, spider): self.file.close( )
rules = [ Rule( sle(allow=("recent\?p=\d{1,5}")), follow=True, callback='parse_item') ]
下面是rule的源代碼,當flollow爲True的時候會自動調用 callback的函數shell
if follow is None: self.follow = False if callback else True else: self.follow = follow
下面的是一篇文章的html的標記,咱們如今須要取出全部div中class爲'cell item'的元素,而後進行遍歷 而後再分別取出item_title的text和href的內容
<div class="cell item" style=""> <table cellpadding="0" cellspacing="0" border="0" width="100%"> <tr> <td width="48" valign="top" align="center">[ ](/member/jiangew)</td> <td width="10"></td> <td width="auto" valign="middle"><span class="item_title">[ 跳槽季:北京~Java~4 年~服務端碼農](/t/279762#reply6)</span> <div class="sep5"></div> <span class="small fade"><div class="votes"></div>[ Java](/go/java) **[jiangew](/member/jiangew)** 2 分鐘前 最後回覆來自 **[feiyang21687](/member/feiyang21687)**</span> </td> <td width="70" align="right" valign="middle"> [6](/t/279762#reply6) </td> </tr> </table> </div>
獲取內容的代碼
sites_even = sel.css('div.cell.item') for site in sites_even: item=TencentItem() item['title']=site.css('.item_title a').xpath('text()').extract()[0] item['url']='http://v2ex.com'+site.css('.item_title a').xpath('@href').extract()[0] items.append(item)