scrapy 框架 python 爬蟲

時間 2020-01-08

標籤 scrapy 框架 python 爬蟲欄目 Python 简体版

原文原文鏈接

朋友託我幫忙寫個爬蟲，記錄一下。node

項目總體介紹：python

scrapy 框架， anaconda(python 3.6)json

開發工具：框架

IDEAdom

詳細介紹：scrapy

scrapy 結構圖：ide

Scrapy主要包括瞭如下組件：工具

- 引擎(Scrapy Engine)
  負責Spider . ItemPipline. Downloader . Scheduler 中間的通信，信號，數據傳遞等
- 調度器(Scheduler)
  負責接受引擎發送過來的Request請求，並按照必定的方式進行整理隊列，入隊，當引擎須要時，交換給引擎（引擎發送指令要下載的時候）
- 下載器(Downloader)
  負責下載engine發送的全部Request請求，並將獲取到的response交換給engine，由引擎交給spider來處理
- 爬蟲(Spiders)
  它負責出來全部Response請求，並將須要跟進的URL提交給引擎，再次進入schedule
- 項目管道(Pipeline)
  負責處理Spider中獲取到的item，並進行後期處理（詳細分析，過濾，存儲等）的地方
- 下載器中間件(Downloader Middlewares)
  位於Scrapy引擎和下載器之間的框架，能夠看成是一個能夠自定義擴展下載功能的組件。
- 爬蟲中間件(Spider Middlewares)
  介於Scrapy引擎和爬蟲之間的框架，主要工做是處理蜘蛛的響應輸入和請求輸出。
- 調度中間件(Scheduler Middewares)
  介於Scrapy引擎和調度之間的中間件，一個自定擴展和操做引擎和Spider中間通訊的功能組件(好比Spider的Response和從Spider出去的Requests)

製做爬蟲過程：

1.新建項目(scrapy startproject xxx ):新建一個新的爬蟲項目開發工具

2.明確目標(編寫item.py):明確你想抓去的目標網站

scrapy genspider 爬蟲名稱域名 --在spider包下面建立爬蟲(Spiders)文件

scrapy crawl 爬蟲名稱 --啓動項目

3.製做爬蟲:(spiders/xxxspider.py):製做爬蟲開始爬取網頁

4.存儲內容:(piplines.py):設計管道存儲爬取內容

保存數據

默認有四種，-o 輸出到指定格式

#json 格式，默認是Unicode

scrapy crawl itcast -o teachers.json

#lines格式，默認是Unicode

scrapy crawl itcast -o teachers.jsonl

#csv逗號表達式式，可用Excel打開

scrapy crawl itcast -o teachers.scv

#xml

scrapy crawl itcast -o teachers.xml

下面是我給朋友寫的爬蟲例子：

1.首先先說下需求吧。紅色框裏面是咱們須要的信息。目標網站是

https://www.dankegongyu.com/room/tj

咱們爬全部的詳情信息。

2.首先咱們先寫items

class ItcastItem(scrapy.Item):
    #名稱
    name=scrapy.Field()
    #地理位置
    discount=scrapy.Field()
    #標題
    title=scrapy.Field()
    #價格
    price=scrapy.Field()
    #詳情
    list_box=scrapy.Field()

而後咱們再寫spiders

class ItcastSpider(scrapy.Spider):
    #爬蟲名
    name = "itcast"
    #容許爬的域名，做用與攔截器同樣
    allowed_domains = ['https://www.dankegongyu.com/room']
    #起始爬蟲的url
    start_urls = ['https://www.dankegongyu.com/room/tj?page=1']

    #流程是爬起始頁，而後獲取詳情頁的URL，爬詳情頁輸出，爬完起始頁的信息後，判斷是否有下一頁，繼續爬下一頁
     直到全部頁面
    def parse(self, response):
        node_list=response.xpath("//div[@class='r_lbx']")
        for node in node_list:
            url=node.xpath("./a/@href").extract()[0]
            #爬取詳情頁
            #這個方法是回調，dont_filter=True 會爬取重複的頁面
            #yield  做用與return同樣，可是不一樣的是返回來的時候會從這開始運行
            yield scrapy.Request(url,callback=self.parse_detail1,dont_filter=True)

        #分頁爬取
        #1.獲取下一頁的url
        now_url=response.xpath("//div[@class='page']/a[@class='on']/text()").extract()
        next_url=int(now_url[0])+1
        print('next_url=',next_url)
        #2.若是存在下一頁，就繼續發送請求
        if response.xpath("//div[@class='page']/a[@href='https://www.dankegongyu.com/room/tj?page="+str(next_url)+"']"):
            url2='https://www.dankegongyu.com/room/tj?page='+str(next_url)
            yield scrapy.Request(url2,callback=self.parse,dont_filter=True)


    #詳情頁面爬取規則
    def parse_detail1(self,response):
        item=ItcastItem()
        name=response.xpath("//div[@class='room-detail-right']/div[@class='room-name']/h1/text()").extract()
        discount=response.xpath("//div[@class='room-detail-right']/div[@class='room-name']/em/text()").extract()
        title=response.xpath("//div[@class='room-detail-right']/div[@class='room-title']/span/text()").extract()
        price=response.xpath("//div[@class='room-detail-right']//div[@class='room-price-sale']/text()").extract()
        list_box=response.xpath("//div[@class='room-detail-right']/div[@class='room-list-box']//label/text()").extract()

        #處理結果,並塞入item對象
        item['name']=name[0]
        item['discount']=discount[0]
        item['title']=','.join(title)
        item['price']=price[0].strip()
        item['list_box']=','.join(list_box).replace(' ','').replace('\n','').replace(',,,,,',',')

        yield item

這個爬蟲不須要寫中間件，因此咱們的爬蟲就算寫好了，很簡單可是寫學到東西了。