1、scrapy的下載安裝---Windows（安裝軟件太讓我傷心了）

時間 2019-11-10

標籤 scrapy 下載安裝 windows 軟件傷心欄目 Python 简体版

原文原文鏈接

寫博客就和筆記同樣真的頗有用，你能夠隨時的翻閱。爬蟲的爬蟲原理與數據抓取、非結構化與結構化數據提取、動態HTML處理和簡單的圖像識別已經學完，就差整理博客了html

開始學習scrapy了，因此從新建了個分類。python

scrapy的下載到安裝，再到可以成功運行就耗費了我三個小時的時間，爲了防止之後忘記，記錄一下。windows

我用的是Python3.6. Windows 須要四步網絡

一、pip3 install wheel架構

二、安裝Twisted併發

　　a. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted, 下載：Twisted-17.9.0-cp36-cp36m-win_amd64.whl
　　框架

　　b. 進入文件所在目錄
　　dom

　　c. pip3 install Twisted-17.1.0-cp35-cp35m-win_amd64.whl異步

三、pip3 install scrapyscrapy

而後我打開cmd,輸入了scrapy, 出現：

　　　 scrapy startproject myspider -----------建立scrapy項目

　　　 cd myspider -----------進入myspider目錄

scrapy genspider baidu baidu.com ------------建立爬蟲文件

　　　 scrapy crawl baidu -------------運行文件

以後，就報錯了，說缺乏一個模塊win32, 上網查說 windows上scrapy依賴pywin32，下載網址： https://sourceforge.net/projects/pywin32/files/

我下載了，在安裝的時候出現了：

上面說是沒註冊什麼的，上網搜了一下解決方案，唉，本身沒看懂。痛心疾首，對我本身的智商感到捉急

四、在cmd中使用python -m pip install pypiwin32

這是我成功的方法，上網查以後，在https://stackoverflow.com/questions/4863056/how-to-install-pywin32-module-in-windows-7有這樣一段話：

天天一個小實例：爬視頻（其實找到了視頻的url連接，用urllib.request.urlretrieve(視頻url，存儲的路徑）就能夠了。

我作的這個例子太簡單；用scrapy框架顯得複雜，，我只是下載了一頁，多頁的話循環url，主要是走一遍使用Scrapy的流程：

 1 #第一
 2 打開mySpider目錄下的items.py
 3 
 4 # -*- coding: utf-8 -*-
 5 
 6 # Define here the models for your scraped items
 7 #
 8 # See documentation in:
 9 # https://doc.scrapy.org/en/latest/topics/items.html
10 
11 import scrapy
12 
13 '''Item 定義結構化數據字段，用來保存爬取到的數據，有點像Python中的dict，可是提供了一些額外的保護減小錯誤。
14 
15 能夠經過建立一個 scrapy.Item 類， 而且定義類型爲 scrapy.Field的類屬性來定義一個Item（能夠理解成相似於ORM的映射關係）。'''
16 class MyspiderItem(scrapy.Item):
17     # define the fields for your item here like:
18     name = scrapy.Field()
19     mp4_url = scrapy.Field()
20 
21 
22 
23 #第二，打開你建立的爬蟲文件，個人是baisi.py
24 
25 # -*- coding: utf-8 -*-
26 import scrapy
27 from myspider.items import MyspiderItem
28 
29 class BaisiSpider(scrapy.Spider):
30     name = 'baisi'
31     allowed_domains = ['http://www.budejie.com']
32     start_urls = ['http://www.budejie.com/video/']
33 
34     def parse(self, response):
35         # 將咱們獲得的數據封裝到一個 `MyspiderItem` 對象
36         item = MyspiderItem()
37 
38         #提取數據
39         mp4_links = response.xpath('//li[@class="j-r-list-tool-l-down f-tar j-down-video j-down-hide ipad-hide"]')
40         for mp4_link in mp4_links:
41             name = mp4_link.xpath('./@data-text')[0].extract()
42             video = mp4_link.xpath('./a/@href')[0].extract()
43             #判斷是否有MP4——url連接，有的保存
44             if video:
45                 item['name'] = name
46                 item['mp4_url'] = video
47             # 將獲取的數據交給pipelines
48             yield item
49 
50 
51 
52 
53 
54 #第三 打開pipelines.py文件
55 
56 # -*- coding: utf-8 -*-
57 
58 # Define your item pipelines here
59 #
60 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
61 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
62 import urllib.request
63 import os
64 class MyspiderPipeline(object):
65     def process_item(self, item, spider):
66 #文件名
67         file_name = "%s.mp4" % item['name']
68 #文件保存路徑
69         file_path = os.path.join("F:\\myspider\\myspider\\video", file_name)
70         urllib.request.urlretrieve(item['mp4_url'],file_path)
71         return item
72 
73 
74 第四，執行scrapy crawl baisi

View Code

執行結果：

Scrapy 框架

Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架，用途很是普遍。

框架的力量，用戶只須要定製開發幾個模塊就能夠輕鬆的實現一個爬蟲，用來抓取網頁內容以及各類圖片，很是之方便。

Scrapy 使用了 Twisted['twɪstɪd](其主要對手是Tornado)異步網絡框架來處理網絡通信，能夠加快咱們的下載速度，不用本身去實現異步框架，而且包含了各類中間件接口，能夠靈活的完成各類需求。
Scrapy框架官方網址：http://doc.scrapy.org/en/latest

Scrapy中文維護站點：http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

Scrapy架構圖(綠線是數據流向)：

Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通信，信號、數據傳遞等。
Scheduler(調度器): 它負責接受引擎發送過來的Request請求，並按照必定的方式進行整理排列，入隊，當引擎須要時，交還給引擎。
Downloader（下載器）：負責下載Scrapy Engine(引擎)發送的全部Requests請求，並將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理，
Spider（爬蟲）：它負責處理全部Responses,從中分析提取數據，獲取Item字段須要的數據，並將須要跟進的URL提交給引擎，再次進入Scheduler(調度器)，
Item Pipeline(管道)：它負責處理Spider中獲取到的Item，並進行進行後期處理（詳細分析、過濾、存儲等）的地方.
Downloader Middlewares（下載中間件）：你能夠看成是一個能夠自定義擴展下載功能的組件。
Spider Middlewares（Spider中間件）：你能夠理解爲是一個能夠自定擴展和操做引擎和Spider中間通訊的功能組件（好比進入Spider的Responses;和從Spider出去的Requests）