scarpy爬蟲框架

時間 2020-04-11

原文原文鏈接

架構介紹

Scrapy一個開源和協做的框架，其最初是爲了頁面抓取 (更確切來講, 網絡抓取 )所設計的，使用它能夠以快速、簡單、可擴展的方式從網站中提取所需的數據。但目前Scrapy的用途十分普遍，可用於如數據挖掘、監測和自動化測試等領域，也能夠應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。php

Scrapy 是基於twisted框架開發而來，twisted是一個流行的事件驅動的python網絡框架。所以Scrapy使用了一種非阻塞（又名異步）的代碼來實現併發。總體架構大體以下css

IO多路複用python

# 引擎(EGINE)（大總管）
引擎負責控制系統全部組件之間的數據流，並在某些動做發生時觸發事件。有關詳細信息，請參見上面的數據流部分。
# 調度器(SCHEDULER)
用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL的優先級隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址
# 下載器(DOWLOADER)
用於下載網頁內容, 並將網頁內容返回給EGINE，下載器是創建在twisted這個高效的異步模型上的
# 爬蟲(SPIDERS)
SPIDERS是開發人員自定義的類，用來解析responses，而且提取items，或者發送新的請求
# 項目管道(ITEM PIPLINES)
在items被提取後負責處理它們，主要包括清理、驗證、持久化（好比存到數據庫）等操做


# 兩個中間件
-爬蟲中間件
-下載中間件（用的最多，加頭，加代理，加cookie，集成selenium）

安裝建立和啓動

# 1 框架 不是 模塊
# 2 號稱爬蟲界的django（你會發現，跟django不少地方同樣）
# 3 安裝
	-mac，linux平臺：pip3 install scrapy
  -windows平臺：pip3 install scrapy（大部分人能夠）
  	- 若是失敗：
      一、pip3 install wheel #安裝後，便支持經過wheel文件安裝軟件，wheel文件官網：https://www.lfd.uci.edu/~gohlke/pythonlibs
      三、pip3 install lxml
      四、pip3 install pyopenssl
      五、下載並安裝pywin32：https://sourceforge.net/projects/pywin32/files/pywin32/
      六、下載twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
      七、執行pip3 install 下載目錄\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
      八、pip3 install scrapy
 # 4 在script文件夾下會有scrapy.exe可執行文件
	-建立scrapy項目：scrapy startproject 項目名   (django建立項目)
  	-建立爬蟲：scrapy genspider 爬蟲名 要爬取的網站地址   # 能夠建立多個爬蟲
    
 # 5 命令啓動爬蟲
		-scrapy crawl 爬蟲名字
  		-scrapy crawl 爬蟲名字 --nolog   # 沒有日誌輸出啓動
 # 6 文件執行爬蟲(推薦使用)
	-在項目路徑下建立一個main.py,右鍵執行便可
  	from scrapy.cmdline import execute
    # execute(['scrapy','crawl','chouti','--nolog'])  # 沒有設置日誌級別
    execute(['scrapy','crawl','chouti'])			  # 設置了日誌級別

配置文件目錄介紹

-crawl_chouti   # 項目名
  -crawl_chouti # 跟項目一個名，文件夾
    -spiders    # spiders：放着爬蟲  genspider生成的爬蟲，都放在這下面
    	-__init__.py
      -chouti.py # 抽屜爬蟲
      -cnblogs.py # cnblogs 爬蟲
    -items.py     # 對比django中的models.py文件 ,寫一個個的模型類
    -middlewares.py  # 中間件（爬蟲中間件，下載中間件），中間件寫在這
    -pipelines.py   # 寫持久化的地方（持久化到文件，mysql，redis，mongodb）
    -settings.py    # 配置文件
  -scrapy.cfg       # 不用關注，上線相關的
  
  
  
  
# 配置文件settings.py
ROBOTSTXT_OBEY = False   # 是否遵循爬蟲協議，強行運行
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'    # 請求頭中的ua,去瀏覽器複製，或者用ua池拿
LOG_LEVEL='ERROR' # 這樣配置，程序錯誤信息纔會打印，
	#啓動爬蟲直接 scrapy crawl 爬蟲名   就沒有日誌輸出
  	# scrapy crawl 爬蟲名 --nolog  # 配置了就不須要這樣啓動了



# 爬蟲文件
class ChoutiSpider(scrapy.Spider):
    name = 'chouti'   # 爬蟲名字
    allowed_domains = ['https://dig.chouti.com/']  # 容許爬取的域，想要多爬就註釋掉
    start_urls = ['https://dig.chouti.com/']   # 起始爬取的位置，爬蟲一啓動，會先向它發請求

    def parse(self, response):  # 解析，請求回來，自動執行parser，在這個方法中作解析
        print('---------------------------',response)

爬取數據，並解析

# 1 解析，可使用bs4解析
from bs4 import BeautifulSoup
soup=BeautifulSoup(response.text,'lxml')
soup.find_all()  # bs4解析
soup.select()  # css解析

# 2 內置的解析器
response.css  
response.xpath

# 內置解析 
  # 全部用css或者xpath選擇出來的都放在列表中
  # 取第一個:extract_first()
  # 取出全部extract()
# css選擇器取文本和屬性：
    # .link-title::text  # 取文本，數據都在data中
    # .link-title::attr(href)   # 取屬性，數據都在data中
# xpath選擇器取文本和屬性
    # .//a[contains(@class,"link-title")/text()]
    #.//a[contains(@class,"link-title")/@href]

# 內置css選擇期，取全部
div_list = response.css('.link-con .link-item')
for div in div_list:
    content = div.css('.link-title').extract()
    print(content)

數據持久化

# 方式一（不推薦）
  -1 parser解析函數，return 列表，列表套字典
    # 命令   (支持：('json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle')
    # 數據到aa.json文件中
  -2 scrapy crawl chouti -o aa.json   
# 代碼：
lis = []
for div in div_list:
    content = div.select('.link-title')[0].text
    lis.append({'title':content})
    return lis


# 方式二 pipline的方式（管道）
   -1 在items.py中建立模型類
   -2 在爬蟲中chouti.py，引入，把解析的數據放到item對象中（要用中括號）
   -3 yield item對象
   -4 配置文件配置管道
       ITEM_PIPELINES = {
        # 數字表示優先級（數字越小，優先級越大）
       'crawl_chouti.pipelines.CrawlChoutiPipeline': 300,
       'crawl_chouti.pipelines.CrawlChoutiRedisPipeline': 301，
    	}
  -5 pipline.py中寫持久化的類
        spider_open  # 方法，一開始就打開文件
        process_item # 方法，寫入文件
        spider_close # 方法，關閉文件

保存到文件

# choutiaa.py 爬蟲文件
import scrapy
from chouti.items import ChoutiItem  # 導入模型類
class ChoutiaaSpider(scrapy.Spider):
    name = 'choutiaa'
    # allowed_domains = ['https://dig.chouti.com/']   # 容許爬取的域
    start_urls = ['https://dig.chouti.com//']   # 起始爬取位置
    # 解析，請求回來，自動執行parse，在這個方法中解析
    def parse(self, response):
        print('----------------',response)
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(response.text,'lxml')
        div_list = soup.select('.link-con .link-item')

        for div in div_list:
            content = div.select('.link-title')[0].text
            href = div.select('.link-title')[0].attrs['href']
            item = ChoutiItem()  # 生成模型對象
            item['content'] = content  # 添加值
            item['href'] = href
            yield item  # 必須用yield  	
            
# items.py 模型類文件
import scrapy
class ChoutiItem(scrapy.Item):
    content = scrapy.Field()
    href = scrapy.Field()
    
# pipelines.py 數據持久化文件
class ChoutiPipeline(object):
    def open_spider(self, spider):
        # 一開始就打開文件
        self.f = open('a.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # print(item)
        # 寫入文件的操做
        self.f.write(item['content'])
        self.f.write(item['href'])
        self.f.write('\n')
        return item

    def close_spider(self, spider):
        # 寫入完畢，最後關閉文件
        self.f.close()
        
# setting.py
ITEM_PIPELINES = {
    # 數字表示優先級，越小優先級越高
   'chouti.pipelines.ChoutiPipeline': 300,
   'chouti.pipelines.ChoutiRedisPipeline': 301,
}

保存到redis

# settings.ps
ITEM_PIPELINES = {
    # 數字表示優先級，越小優先級越高
   'chouti.pipelines.ChoutiPipeline': 300,
   'chouti.pipelines.ChoutiRedisPipeline': 301,
}

# pipelines.py
# 保存到redis
from redis import Redis
class ChoutiRedisPipeline(object):
    def open_spider(self, spider):
        # 不寫參數就用默認配置
        self.conn = Redis(password='123')  # 一開始就拿到redis對象

    def process_item(self, item, spider):
        print(item)
        import json
        s = json.dumps({'content': item['content'], 'href': item['href']})
        self.conn.hset('choudi_article', item['id'], s)

        return item

    def close_spider(self, spoder):
        pass
        # self.conn.close()

# chouti.py
import scrapy
from chouti.items import ChoutiItem  # 導入模型類
class ChoutiaaSpider(scrapy.Spider):
    name = 'choutiaa'
    # allowed_domains = ['https://dig.chouti.com/']   # 容許爬取的域
    start_urls = ['https://dig.chouti.com//']   # 起始爬取位置
    # 解析，請求回來，自動執行parse，在這個方法中解析
    def parse(self, response):
        print('----------------',response)
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(response.text,'lxml')
        div_list = soup.select('.link-con .link-item')

        for div in div_list:
            content = div.select('.link-title')[0].text
            href = div.select('.link-title')[0].attrs['href']
            id = div.attrs['data-id']
            item = ChoutiItem()  # 生成模型對象
            item['content'] = content  # 添加值
            item['href'] = href
            item['id'] = id
            yield item  # 必須用yield

動做鏈，控制滑動的驗證碼

from selenium import webdriver
from selenium.webdriver import ActionChains
import time
bro=webdriver.Chrome(executable_path='./chromedriver')
bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
bro.implicitly_wait(10)

#切換frame（不多）
bro.switch_to.frame('iframeResult')
div=bro.find_element_by_xpath('//*[@id="draggable"]')

# 1 生成一個動做練對象
action=ActionChains(bro)
# 2 點擊並夯住某個控件
action.click_and_hold(div)
# 3 移動（三種方式）
# action.move_by_offset() # 經過座標（x,y）
# action.move_to_element() # 到另外一個標籤
# action.move_to_element_with_offset() # 到另外一個標籤，再偏移一部分


for i in range(5):
    action.move_by_offset(10,10)

# 4 真正的移動
action.perform()

# 5 釋放控件（鬆開鼠標）
action.release()