爬蟲之Scripy

Scrapy是一個爲了爬取網站數據,提取結構性數據而編寫的應用框架。 其能夠應用在數據挖掘,信息處理或存儲歷史數據等一系列的程序中。
其最初是爲了頁面抓取 (更確切來講, 網絡抓取 )所設計的, 也能夠應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy用途普遍,能夠用於數據挖掘、監測和自動化測試。html

 

Scrapy 使用了 Twisted異步網絡庫來處理網絡通信。總體架構大體以下python

  • 引擎(Scrapy)
           用來處理整個系統的數據流處理, 觸發事務(框架核心)
  • 調度器(Scheduler)
           用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL(抓取網頁的網址或者說是連接)的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址
  • 下載器(Downloader)
           用於下載網頁內容, 並將網頁內容返回給蜘蛛(Scrapy下載器是創建在twisted這個高效的異步模型上的)
  • 爬蟲(Spiders)
           爬蟲是主要幹活的, 用於從特定的網頁中提取本身須要的信息, 即所謂的實體(Item)。用戶也能夠從中提取出連接,讓Scrapy繼續抓取下一個頁面
  • 項目管道(Pipeline)
           負責處理爬蟲從網頁中抽取的實體,主要的功能是持久化實體、驗證明體的有效性、清除不須要的信息。當頁面被爬蟲解析後,將被髮送到項目管道,並通過幾個特定的次序處理數據。
  • 下載器中間件(Downloader Middlewares)
          位於Scrapy引擎和下載器之間的框架,主要是處理Scrapy引擎與下載器之間的請求及響應。
  • 爬蟲中間件(Spider Middlewares)
          介於Scrapy引擎和爬蟲之間的框架,主要工做是處理蜘蛛的響應輸入和請求輸出。
  • 調度中間件(Scheduler Middewares)
      介於Scrapy引擎和調度之間的中間件,從Scrapy引擎發送到調度的請求和響應。

 

Scrapy運行流程大概以下:web

  1. 引擎從調度器中取出一個連接(URL)用於接下來的抓取
  2. 引擎把URL封裝成一個請求(Request)傳給下載器
  3. 下載器把資源下載下來,並封裝成應答包(Response)
  4. 爬蟲解析Response
  5. 解析出實體(Item),則交給實體管道進行進一步的處理
  6. 解析出的是連接(URL),則把URL交給調度器等待抓取

安裝:json

#scrapy 的一些依賴:pywin3二、pyOpenSSL、Twisted、lxml 、zope.interface。(安裝的時候,注意看報錯信息)

#安裝wheel
pip3 install wheel-i http://pypi.douban.com/simple --trusted-host pypi.douban.com

#安裝這個依賴包,纔有安裝上Twisted
pip3 install Incremental -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

#再pip3安裝Twisted,可是仍是安裝不成功,會報錯。(解決其它依賴問題)
pip3 install Twisted -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

#再進入軟件存放目錄,再安裝就能夠成功啦。
pip3 install Twisted-17.1.0-cp35-cp35m-win32.whl

#安裝scrapy
pip3 install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

#pywin32
下載:https://sourceforge.net/projects/pywin32/files/

建立:api

#建立項目
scrapy startproject xiaohuar


#進入項目
cd xiaohuar


#建立爬蟲應用
scrapy genspider xiaohuar xiaohar.com


#運行爬蟲
scrapy crawl chouti --nolog

目錄:cookie

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py

解釋:網絡

  • scrapy.cfg  項目的配置信息,主要爲Scrapy命令行工具提供一個基礎的配置信息。(真正爬蟲相關的配置信息在settings.py文件中)
  • items.py    設置數據存儲模板,用於結構化數據,如:Django的Model
  • pipelines    數據處理行爲,如:通常結構化的數據持久化
  • settings.py 配置文件,如:遞歸的層數、併發數,延遲下載等
  • spiders      爬蟲目錄,如:建立文件,編寫爬蟲規則

注意:通常建立爬蟲文件時,以網站域名命名架構

選擇器:併發

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""
response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath('//a')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[2]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath('//body/ul/li')
# for item in ul_list:
#     v = item.xpath('./a/span')
#     # 或
#     # v = item.xpath('a/span')
#     # 或
#     # v = item.xpath('*/a/span')
#     print(v)

自定義擴展:app

自定義擴展時,利用信號在指定位置註冊制定操做

from scrapy import signals


class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('MMMM')
        ext = cls(val)

        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

        return ext

    def spider_opened(self, spider):
        print('open')

    def spider_closed(self, spider):
        print('close')

自定義去重複:

scrapy默認使用 scrapy.dupefilter.RFPDupeFilter 進行去重,相關配置有:

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = False
JOBDIR = "保存範文記錄的日誌路徑,如:/root/"  # 最終路徑爲 /root/requests.seen

自定義:

#偶合性低,給url去重使用
class RepeatFilter(object):
    def __init__(self):
        self.visited_set = set()
    @classmethod
    def from_settings(cls, settings):
        return cls()
    def request_seen(self, request):
        if request.url in self.visited_set:#先看當前url在不在visited_set
            return True
        self.visited_set.add(request.url) #若是不在就加進去
        return False
    def open(self):  # 每次開始的時候都會調用
        # print('open')
        pass
    def close(self, reason): #每次結束的時候都會調用
        # print('close')
        pass
    def log(self, request, spider):#每次捕捉到重複的url都會寫在log裏面
        # print('log....')
        pass

settings:

# 1. 爬蟲名稱
# BOT_NAME = 'step8_king'

# 2. 爬蟲應用路徑
# SPIDER_MODULES = ['step8_king.spiders']
# NEWSPIDER_MODULE = 'step8_king.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客戶端 user-agent請求頭
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 4. 禁止爬蟲配置
# ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 併發請求數
# CONCURRENT_REQUESTS = 4

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延遲下載秒數
# DOWNLOAD_DELAY = 2


# The download delay setting will honor only one of:
# 7. 單域名訪問併發數,而且延遲下次秒數也應用在每一個域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 單IP訪問併發數,若是有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN,而且延遲下次秒數也應用在每一個IP
# CONCURRENT_REQUESTS_PER_IP = 3

# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar進行操做cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True

# Disable Telnet Console (enabled by default)
# 9. Telnet用於查看當前爬蟲的信息,操做爬蟲等...
#    使用telnet ip port ,而後經過命令操做
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]

# Override the default request headers:
# 10. 默認請求頭
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定義pipeline處理請求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.CustomPipeline': 500,
# }


# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# 12. 自定義擴展,基於信號進行調用
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }


# 13. 爬蟲容許的最大深度,能夠經過meta查看當前深度;0表示無深度
# DEPTH_LIMIT = 3

# 14. 爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo

# 後進先出,深度優先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先進先出,廣度優先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

# 15. 調度器隊列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler


# 16. 訪問URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# 開始自動限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下載延遲
# AUTOTHROTTLE_START_DELAY = 10

# The maximum download delay to be set in case of high latencies
# 最大下載延遲
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒併發數
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否顯示
# AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# 爬蟲中間件
SPIDER_MIDDLEWARES = {
   'step8_king.middlewares.MyCustomSpiderMiddleware': 543,
}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# 下載中間件
DOWNLOADER_MIDDLEWARES = {
   # 'step8_king.middlewares.MyCustomDownloaderMiddleware': 500,
}

自定義pipline

 

一個簡單的爬蟲:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
import re
import urllib
import os
 
 
class XiaoHuarSpider(scrapy.spiders.Spider):
    name = "xiaohuar"
    allowed_domains = ["xiaohuar.com"]
    start_urls = [
        "http://www.xiaohuar.com/list-1-1.html",
    ]
 
    def parse(self, response):
        # 分析頁面
        # 找到頁面中符合規則的內容(校花圖片),保存
        # 找到全部的a標籤,再訪問其餘a標籤,一層一層的搞下去
 
        hxs = HtmlXPathSelector(response)
 
        # 若是url是 http://www.xiaohuar.com/list-1-\d+.html
        if re.match('http://www.xiaohuar.com/list-1-\d+.html', response.url):
            items = hxs.select('//div[@class="item_list infinite_scroll"]/div')
            for i in range(len(items)):
                src = hxs.select('//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/a/img/@src' % i).extract()
                name = hxs.select('//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/span/text()' % i).extract()
                school = hxs.select('//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/div[@class="btns"]/a/text()' % i).extract()
                if src:
                    ab_src = "http://www.xiaohuar.com" + src[0]
                    file_name = "%s_%s.jpg" % (school[0].encode('utf-8'), name[0].encode('utf-8'))
                    file_path = os.path.join("/Users/wupeiqi/PycharmProjects/beauty/pic", file_name)
                    urllib.urlretrieve(ab_src, file_path)
 
        # 獲取全部的url,繼續訪問,並在其中尋找相同的url
        all_urls = hxs.select('//a/@href').extract()
        for url in all_urls:
            if url.startswith('http://www.xiaohuar.com/list-1-'):
                yield Request(url, callback=self.parse)

以上代碼將符合規則的頁面中的圖片保存在指定目錄,而且在HTML源碼中找到全部的其餘 a 標籤的href屬性,從而「遞歸」的執行下去,直到全部的頁面都被訪問過爲止。以上代碼之因此能夠進行「遞歸」的訪問相關URL,關鍵在於parse方法使用了 yield Request對象。

注:能夠修改settings.py 中的配置文件,以此來指定「遞歸」的層數,如: DEPTH_LIMIT = 1

獲取相應的cookie:

def parse(self, response):
    from scrapy.http.cookies import CookieJar
    cookieJar = CookieJar()
    cookieJar.extract_cookies(response, response.request)
    print(cookieJar._cookies)

格式化處理:

上述實例只是簡單的圖片處理,因此在parse方法中直接處理。若是對於想要獲取更多的數據(獲取頁面的價格、商品名稱、QQ等),則能夠利用Scrapy的items將數據格式化,而後統一交由pipelines來處理。

import scrapy
 
class JieYiCaiItem(scrapy.Item):
 
    company = scrapy.Field()
    title = scrapy.Field()
    qq = scrapy.Field()
    info = scrapy.Field()
    more = scrapy.Field()

上述定義模板,之後對於從請求的源碼中獲取的數據贊成按照此結構來獲取,因此在spider中須要有一下操做:

import scrapy
import hashlib
from beauty.items import JieYiCaiItem
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class JieYiCaiSpider(scrapy.spiders.Spider):
    count = 0
    url_set = set()

    name = "jieyicai"
    domain = 'http://www.jieyicai.com'
    allowed_domains = ["jieyicai.com"]

    start_urls = [
        "http://www.jieyicai.com",
    ]

    rules = [
        #下面是符合規則的網址,可是不抓取內容,只是提取該頁的連接(這裏網址是虛構的,實際使用時請替換)
        #Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))),
        #下面是符合規則的網址,提取內容,(這裏網址是虛構的,實際使用時請替換)
        #Rule(LinkExtractor(allow=(r'http://www.jieyicai.com/Product/Detail.aspx?pid=\d+')), callback="parse"),
    ]

    def parse(self, response):
        md5_obj = hashlib.md5()
        md5_obj.update(response.url)
        md5_url = md5_obj.hexdigest()
        if md5_url in JieYiCaiSpider.url_set:
            pass
        else:
            JieYiCaiSpider.url_set.add(md5_url)
            
            hxs = HtmlXPathSelector(response)
            if response.url.startswith('http://www.jieyicai.com/Product/Detail.aspx'):
                item = JieYiCaiItem()
                item['company'] = hxs.select('//span[@class="username g-fs-14"]/text()').extract()
                item['qq'] = hxs.select('//span[@class="g-left bor1qq"]/a/@href').re('.*uin=(?P<qq>\d*)&')
                item['info'] = hxs.select('//div[@class="padd20 bor1 comard"]/text()').extract()
                item['more'] = hxs.select('//li[@class="style4"]/a/@href').extract()
                item['title'] = hxs.select('//div[@class="g-left prodetail-text"]/h2/text()').extract()
                yield item

            current_page_urls = hxs.select('//a/@href').extract()
            for i in range(len(current_page_urls)):
                url = current_page_urls[i]
                if url.startswith('/'):
                    url_ab = JieYiCaiSpider.domain + url
                    yield Request(url_ab, callback=self.parse)

此處代碼的關鍵在於:

  • 將獲取的數據封裝在了Item對象中
  • yield Item對象 (一旦parse中執行yield Item對象,則自動將該對象交個pipelines的類來處理)
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
from twisted.enterprise import adbapi
import MySQLdb.cursors
import re

mobile_re = re.compile(r'(13[0-9]|15[012356789]|17[678]|18[0-9]|14[57])[0-9]{8}')
phone_re = re.compile(r'(\d+-\d+|\d+)')

class JsonPipeline(object):

    def __init__(self):
        self.file = open('/Users/wupeiqi/PycharmProjects/beauty/beauty/jieyicai.json', 'wb')


    def process_item(self, item, spider):
        line = "%s  %s\n" % (item['company'][0].encode('utf-8'), item['title'][0].encode('utf-8'))
        self.file.write(line)
        return item

class DBPipeline(object):

    def __init__(self):
        self.db_pool = adbapi.ConnectionPool('MySQLdb',
                                             db='DbCenter',
                                             user='root',
                                             passwd='123',
                                             cursorclass=MySQLdb.cursors.DictCursor,
                                             use_unicode=True)

    def process_item(self, item, spider):
        query = self.db_pool.runInteraction(self._conditional_insert, item)
        query.addErrback(self.handle_error)
        return item

    def _conditional_insert(self, tx, item):
        tx.execute("select nid from company where company = %s", (item['company'][0], ))
        result = tx.fetchone()
        if result:
            pass
        else:
            phone_obj = phone_re.search(item['info'][0].strip())
            phone = phone_obj.group() if phone_obj else ' '

            mobile_obj = mobile_re.search(item['info'][1].strip())
            mobile = mobile_obj.group() if mobile_obj else ' '

            values = (
                item['company'][0],
                item['qq'][0],
                phone,
                mobile,
                item['info'][2].strip(),
                item['more'][0])
            tx.execute("insert into company(company,qq,phone,mobile,address,more) values(%s,%s,%s,%s,%s,%s)", values)

    def handle_error(self, e):
        print 'error',e

上述中的pipelines中有多個類,到底Scapy會自動執行那個?哈哈哈哈,固然須要先配置了,否則Scapy就蒙逼了。。。

在settings.py中作以下配置:

ITEM_PIPELINES = {
    'beauty.pipelines.DBPipeline': 300,
    'beauty.pipelines.JsonPipeline': 100,
}
# 每行後面的整型值,肯定了他們運行的順序,item按數字從低到高的順序,經過pipeline,一般將這些數字定義在0-1000範圍內。

一個小蜘蛛:

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest


class ChouTiSpider(scrapy.Spider):
    # 爬蟲應用的名稱,經過此名稱啓動爬蟲命令
    name = "chouti"
    # 容許的域名
    allowed_domains = ["chouti.com"]

    cookie_dict = {}
    has_request_set = {}

    def start_requests(self):
        url = 'http://dig.chouti.com/'
        # return [Request(url=url, callback=self.login)]
        yield Request(url=url, callback=self.login)

    def login(self, response):
        cookie_jar = CookieJar()
        cookie_jar.extract_cookies(response, response.request)
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    self.cookie_dict[m] = n.value

        req = Request(
            url='http://dig.chouti.com/login',
            method='POST',
            headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
            body='phone=8615131255089&password=pppppppp&oneMonth=1',
            cookies=self.cookie_dict,
            callback=self.check_login
        )
        yield req

    def check_login(self, response):
        req = Request(
            url='http://dig.chouti.com/',
            method='GET',
            callback=self.show,
            cookies=self.cookie_dict,
            dont_filter=True
        )
        yield req

    def show(self, response):
        # print(response)
        hxs = HtmlXPathSelector(response)
        news_list = hxs.select('//div[@id="content-list"]/div[@class="item"]')
        for new in news_list:
            # temp = new.xpath('div/div[@class="part2"]/@share-linkid').extract()
            link_id = new.xpath('*/div[@class="part2"]/@share-linkid').extract_first()
            yield Request(
                url='http://dig.chouti.com/link/vote?linksId=%s' %(link_id,),
                method='POST',
                cookies=self.cookie_dict,
                callback=self.do_favor
            )

        page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
        for page in page_list:

            page_url = 'http://dig.chouti.com%s' % page
            import hashlib
            hash = hashlib.md5()
            hash.update(bytes(page_url,encoding='utf-8'))
            key = hash.hexdigest()
            if key in self.has_request_set:
                pass
            else:
                self.has_request_set[key] = page_url
                yield Request(
                    url=page_url,
                    method='GET',
                    callback=self.show
                )

    def do_favor(self, response):
        print(response.text)
自動登陸抽屜點贊
相關文章
相關標籤/搜索