python-scrapy框架學習

時間 2020-11-16

標籤 css html python 正則表達式 shell api app 框架 dom scrapy 欄目 Python 简体版

原文原文鏈接

Scrapy框架

Scrapy安裝

正常安裝會報錯，主要是兩個緣由css

0x01 升級pip3包

python -m pip install -U pip

0x02 手動安裝依賴

須要手動安裝 wheel、lxml、Twisted、pywin32html

pip3 install wheel

pip3 install lxml

pip3 install Twisted

pip3 install pywin32

0x03 安裝Scrapy

pip3 install scrapy

Scrapy 進行項目管理

0x01 使用scrapy建立一個新的爬蟲項目

mkdir Scrapy 

scrapy startproject myfirstpjt

cd myfirstpjt

0x02 scrapy相關命令

命令分爲兩種，一種爲全局命令，一種爲項目命令python

全局命令不須要依賴Scrapy項目便可直接與性能，項目命令必須依賴項目正則表達式

在Scrapy項目所在目錄外使用scrapy -h 顯示全部全局命令shell

C:\Users\LENOVO>scrapy -h
Scrapy 2.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

fetch

fetch命令主要用來顯示爬蟲爬取過程api

0x03 選擇器

支持 XPath CSS 選擇器app

同時XPath選擇器還有一個.re()方法用於經過正則表達式來提取數據框架

，不一樣於使用.xpath()或者.css()方法，.re()方法返回unicode字符串的列表，因此沒法構造嵌套式的.re()調用dom

建立Scrapy項目

日常寫小腳本或者項目就至關於白紙上寫做文，而框架會集成一些常常用的東西將做文題變成填空題，大大減小了工做量scrapy

scrapy startproject 項目名稱 (例如 todayMovie)

tree todayMovie

D:\pycharm\code\Scrapy>scrapy startproject todayMovie
New Scrapy project 'todayMovie', using template directory 'c:\python3.7\lib\site-packages\scrapy\templates\project', created in:
    D:\pycharm\code\Scrapy\todayMovie

You can start your first spider with:
    cd todayMovie
    scrapy genspider example example.com

D:\pycharm\code\Scrapy>tree todayMovie
文件夾 PATH 列表
卷序列號爲 6858-7249
D:\PYCHARM\CODE\SCRAPY\TODAYMOVIE
└─todayMovie
    └─spiders

D:\pycharm\code\Scrapy>

0x01 使用 `genspider`參數新建基礎爬蟲

新建一個名爲wuHanMovieSpider的爬蟲腳本，腳本搜索的域爲mtime.com

scrapy genspider wuHanMovieSpider mtime.com

0x02 關於框架下的文件

scrapy.cfg

主要聲明默認設置文件位置爲todayMovie模塊下的settings文件（setting.py），定義項目名爲todayMovie

items.py文件的做用是定義爬蟲最終須要哪些項，

pipelines.py文件的做用是掃尾。Scrapy爬蟲爬取了網頁中的內容後，這些內容怎麼處理就取決於pipelines.py如何設置

須要修改、填空的只有4個文件，它們分別是items.py、settings.py、pipelines.py、wuHanMovieSpider.py。

其中 items.py決定爬取哪些項目，wuHanMovieSpider.py決定怎麼爬， settings.py決定由誰去處理爬取的內容，pipelines.py決定爬取後的內容怎樣處理

0x03 xpath選擇器

selector = response.xpath('/html/body/div[@id='homeContentRegion']//text()')[0].extract()

extract() 返回選中內容的Unicode字符串。

關於xpath遍歷文檔樹

符號	用途
/	選中文檔的根，通常爲html
//	選中從當前位置開始全部子孫節點
./	表示從當前節點提取，二次提取數據時會用到
.	選中當前節點，相對路徑
..	選中當前節點父節點，相對路徑
ELEMENT	選中子節點中全部ELEMENT元素節點
//ELEMENT	選中子孫節點全部ELEMENT元素節點
*	選中全部元素子節點
text()	選中素有文本子節點
@ATTR	選中名爲ATTR的屬性節點
@*	選中全部屬性節點
/@ATTR	獲取節點屬值

方法	用途
contains	a[contains(@href,"test")] 查找href屬性包含test字符的a標籤
start-with	a[start-with(@href,"http")] 查找href屬性以http開頭的a標籤

舉例

response.xpath('//a/text()') #選取全部a的文本

response.xpath('//div/*/img') #選取div孫節點的全部img

response.xpath('//p[contains(@class,'song')]') #選擇class屬性中含有‘song’的p元素

response.xpath('//a[contains(@data-pan,'M18_Index_review_short_movieName')]/text()')

response.xpath('//div/a | //div/p') 或者，頁面中多是a多是p

selector = response.xpath('//a[contains(@href,"http://movie.mtime.com")]/text()').extract()

參考文章

http://www.javashuo.com/article/p-yqgmlgqj-dy.html

http://www.javashuo.com/article/p-wmhvcwio-no.html

實例爬取天氣預報

0x01 建立weather項目及基礎爬蟲

cd Scrapy\code
scrapy startproject weather
scrapy genspider beiJingSpider www.weather.com.cn/weather/101010100.shtml

0x02 修改items.py

class WeatherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    cityDate = scrapy.Field()   #城市及日期
    week = scrapy.Field()   #星期
    temperature = scrapy.Field()    #溫度
    weather = scrapy.Field()    #天氣
    wind = scrapy.Field()   #風力

0x03 scrapy shell

先使用scrapy shell命令來測試獲取選擇器，主要是看一下網站有沒有反爬機制

scrapy shell https://www.tianqi.com/beijing/

好比403就是禁止讀取，而不是頁面不存在。

簡單的bypass就是添加UA和訪問頻率

0x04 簡單的bypass

先準備一堆User-Agent放到resource.py利用random每次隨機選擇其中一個便可。

setp1：將準備好的resource.py放在settings.py的同級目錄中

resource.py

#-*- coding:utf-8 -*-
UserAgents = [
  "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
  "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
  "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
  "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
  "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
  "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
  "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
  "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
  "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/76.0",
]

step2：修改middlewares.py

導入random，UserAgents，UserAgentMiddleware

最下面添加一個新類，新類繼承於UserAgentMiddleware類

類中大體內容爲提供每次請求時隨機挑選的UA頭

class CustomUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent='Scrapy'):
        # ua = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/76.0"
        ua = random.choice(UserAgents)
        self.user_agent = ua

step3：修改settings.py

用CustomUserAgentMiddleware來替代 UserAgentMiddleware。

在settings.py中找到DOWNLOADER_MIDDLEWARES這個選項修改成以下圖所示

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    #'weather.middlewares.WeatherDownloaderMiddleware': 543,
    'weather.middlewares.CustomUserAgentMiddleware': 542,
}

step4：修改請求時間間隔

Scrapy在兩次請求之間的時間設置是DOWNLOAD_DELAY，若是不考慮反爬那必然越小越好，值爲30就是每隔30s像網站請求一次網頁。

ps：通常網站添加UA頭便可bypass

0x05 修改beiJingSpider.py

要獲取的內容在class=day7這個div下，在這個div下錨點

# 注意這裏url最後要有/否則獲取不到內容
scrapy shell https://tianqi.com/beijing/

selector = response.xpath('//div[@class="day7"]')

selector1 = selector.xpath('ul[@class="week"]/li')

beiJingSpider.py

import scrapy
from weather.items import WeatherItem

class BeijingspiderSpider(scrapy.Spider):
    name = 'beiJingSpider'
    allowed_domains = ['https://www.tianqi.com/beijing/']
    start_urls = ['https://www.tianqi.com/beijing/']


    def parse(self, response):
        items = []
        city = response.xpath('//dd[@class="name"]/h2/text()').extract()
        Selector = response.xpath('//div[@class="day7"]')
        date = Selector.xpath('ul[@class="week"]/li/b/text()').extract()
        week = Selector.xpath('ul[@class="week"]/li/span/text()').extract()
        wind = Selector.xpath('ul[@class="txt"]/li/text()').extract()
        weather = Selector.xpath('ul[@class="txt txt2"]/li/text()').extract()
        temperature1 = Selector.xpath('div[@class="zxt_shuju"]/ul/li/span/text()')
        temperature2 = Selector.xpath('div[@class="zxt_shuju"]/ul/li/b/text()').extract()
        for i in range(7):
            item = WeatherItem()
            try:
                item['cityDate'] = city + date[i]
                item['week'] = week[i]
                item['temperature'] = temperature1[i] + ',' + temperature2[i]
                item['weather'] = weather[i]
                item['wind'] = wind[i]
            except IndexError as e:
                exit()
            items.append(item)
        return items

0x06 修改pipelines.py 處理Spider的結果

import time
import codecs

class WeatherPipeline:
    def process_item(self, item, spider):
        today = timw.strftime('%Y%m%d', time.localtime())
        fileName = today + '.txt'
        with codecs.open(fileName, 'a', 'utf-8') as fp:
            fp.write("%s \t %s \t %s \t %s \r\n" 
                     %(item['cityDate'],
                       item['week'],
                       item['temperature'],
                       item['weather'],
                       item['wind']))
        return item