python從零學——scrapy初體驗

時間 2019-11-10

原文原文鏈接

python從零學——scrapy初體驗

近日由於一些事情，須要從網上爬取一些東西，故而想經過使用爬蟲來順便學習下強大的python。現將一些學習中遇到的問題記錄下來，以便往後查詢html

1. 開發環境的準備（本人windows10 x64）

python的爬蟲框架應該說是有挺多的了，使用scrapy也是由於它名氣比較大啦。首先是安裝使用，由於我也是從零開始，從開始安裝python開始的，因此我也就從安裝python開始的。python

1.1 python安裝

一開始，我安裝的是python3.7，可是在安裝scrapy的時候，發現一直出現依賴錯誤「Microsoft Visual C++ 14.0 is required」這個蛋疼的錯誤，死活調很差，直到我在scrapy的官方教程上看到這句話竟然只支持python2.7，wtf!!!!浪費了我好多時間，好吧，2.7就2.7，我從python的官網上下載了python-2.7.15.amd64.msi，忘記有沒有自動添加環境了，若是沒有的話隨便添加一下吧，很簡單的，在path裏面添加下面的路徑數據庫

$(python的安裝路徑)windows

$(python的安裝路徑)\Scriptscookie

個人路徑是app

D:\softwares\Python27 D:\softwares\Python27\Scripts框架

安裝完成之後，win+R運行cmd，輸入python看下有反應不，若是有就說明已經安裝好了。dom

1.2 安裝python IDE，PyCharm

PyCharm好像用的比較多，我就安裝這個了，看起來是用visual studio那一套作的，很像。PyCharm有分專業版和社區版的，做爲一個窮逼固然是下載社區版本的啦。國內用戶好像沒法直接打開連接，可是好像下載連接是能夠用的，那我就像上面的pyhon同樣貼一個下載地址吧：pycharm2018.1.4。python2.7

1.3 scrapy安裝

python有一個很好的地方，就是有一個包管理系統（pip）來管理python的包，我們想要使用的scrapy包就能很方便的下載下來，而沒必要去網上處處找。以前咱們安裝的python2.7.15已經默認安裝了pip，因此如今咱們就使用pip來安裝一下scrapy好了。在cmd裏面輸入一下命令：scrapy

pip install scrapy

而後若是沒有意外的話，通常會出現如下包缺失的提示：

building 'twisted.test.raiser' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

不要慌，到這個網站上下載對應沒有編譯的包就好了，咱們就不用在本身電腦上編譯了。這裏是twisted缺失，因此我根據個人系統和python的版本，選擇了這個Twisted‑18.7.0‑cp27‑cp27m‑win_amd64.whl下載。下載好了之後，用cmd來安裝，輸入如下的命令

pip install d:\Twisted-18.7.0-cp27-cp27m-win_amd64.whl

而後安裝安裝完成之後就能夠從新安裝scrapy了，從新輸入pip install scrapy而後看有沒有其餘的依賴錯誤，若是有的話就跟剛纔同樣處理就好了。到此爲止，scrapy須要的環境都安裝完畢了，接下來就是使用scrapy來爬取東西了

2. 爬取靜態圖片

用某寶的寶貝頁面來爬取是最好的了，由於某寶的寶貝頁面不只有靜態的數據還有動態的數據，很適合學習。咱們先來爬取這部分的圖片：

2.1 建立scrapy項目

首先，使用如下命令來建立一個空的scrapy項目。

scrapy startproject taobao

生成成功，將項目用pycharm打開首先咱們編輯下items.py，這個類是用來暫存爬取到的信息的：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class taobaoItem(scrapy.Item):
    url = scrapy.Field()
    name = scrapy.Field()
    image_urls = scrapy.Field()

這裏，咱們要存的就是寶貝的地址，名字和圖片的地址。而後咱們新建一個spider，叫taobaoSpider好了。spider是用來請求網頁和獲取爬取目標的地址的。說白了作一些處理連接的工做。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from taobao.items import taobaoItem
from scrapy_splash import SplashRequest
class taobaoSpider(scrapy.Spider):
    name = "taobao"
    allowed_domains = ["taobao.com"]
    start_urls = []
    def start_requests(self):
        input_url = 'https://item.taobao.com/item.htm?spm=a1z10.1-c.w4023-18381915794.4.44d14551es5Ex7&id=556114290901'
        self.start_urls.append(input_url)
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse)


    def parse(self, response):
        # sel是頁面源代碼，載入scrapy.selector
        sel = Selector(response)
        for link in sel.xpath('//*[@id="J_isku"]/div/dl[1]/dd/ul/li/a'):
            url = link.xpath('@style').extract()[0]
            image_url = "http://" + url[17:-28] + "400x400.jpg"
            image_urls = []
            image_urls.append(image_url)
            name = link.xpath('span/text()').extract()
            item = taobaoItem()
            item['url'] = url
            item['name'] = name
            item['image_urls'] = image_urls
            yield item  # 返回請求

接下來修改settings.py，這個文件是配置文件，配置一些參數：

# -*- coding: utf-8 -*-

# Scrapy settings for taobao project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'taobao'

SPIDER_MODULES = ['taobao.spiders']
NEWSPIDER_MODULE = 'taobao.spiders'

ITEM_PIPELINES = {
    'taobao.pipelines.taobaoPipeline': 1,
}
#設置圖片下載路徑
IMAGES_STORE = 'd:/download'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

注意：這裏ROBOTSTXT_OBEY 默認是True，這是scrapy默認遵照爬取協議。若是這裏爲Ture，則沒法爬取淘寶的數據，會出現一下的提示。因此須要改成False 修改爲：

ROBOTSTXT_OBEY = False

最後設置piplines，用於持久化爬取的數據，也就是儲存到硬盤或者數據庫裏面的東西：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import requests
from taobao import settings
import os

class taobaoPipeline(object):
    def process_item(self, item, spider):
        if 'image_urls' in item:  # 如何‘圖片地址’在項目中
            images = []  # 定義圖片空集

            dir_path = '%s/%s' % (settings.IMAGES_STORE, spider.name)

            if not os.path.exists(dir_path):
                os.makedirs(dir_path)
            for image_url in item['image_urls']:
                us = image_url.split('/')[-1:]
                image_file_name = '_'.join(us)
                file_path = '%s/%s' % (dir_path, image_file_name)
                images.append(file_path)
                if os.path.exists(file_path):
                    continue

                with open(file_path, 'wb') as handle:
                    headers = {
                        'user-agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
                        'cookie': "user_trace_token=20170502200739-07d687303c1e44fa9c7f0259097266d6;"
                    }
                    response = requests.get(image_url, stream=True, headers=headers)
                    for block in response.iter_content(1024):
                        if not block:
                            break
                        handle.write(block)
        return item

最後在taobao目錄下，新建一個main.py文件，用於啓動這個爬蟲（crawl）：

# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl("taobaoSpider")
process.start()  # the script will block here until the crawling is finished

項目的目錄如今是這樣的：

點擊pycharm右上角的eidt configurations：選擇main文件：而後點擊運行程序，則能夠看到爬取的圖片存到硬盤的D:\download\taobaoSpider目錄了。