使用scrapy首先須要安裝 php
python環境使用3.6 html
windows下激活進入python3.6環境python
activate python36
mac下 mysql
mac@macdeMacBook-Pro:~$ source activate python36
(python36) mac@macdeMacBook-Pro:~$
安裝 scrapylinux
(python36) mac@macdeMacBook-Pro:~$ pip install scrapy (python36) mac@macdeMacBook-Pro:~$ scrapy --version Scrapy 1.8.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command (python36) mac@macdeMacBook-Pro:~$ scrapy startproject images New Scrapy project 'images', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in: /Users/mac/images You can start your first spider with: cd images scrapy genspider example example.com (python36) mac@macdeMacBook-Pro:~$ cd images (python36) mac@macdeMacBook-Pro:~/images$ scrapy genspider -t crawl pexels www.pexels.com Created spider 'pexels' using template 'crawl' in module: images.spiders.pexels (python36) mac@macdeMacBook-Pro:~/images$
setting.py裏面 關閉robot.txt遵循sql
ROBOTSTXT_OBEY = False
分析目標網站規則 www.pexels.comchrome
https://www.pexels.com/photo/man-using-black-camera-3136161/shell
https://www.pexels.com/video/beach-waves-and-sunset-855633/windows
https://www.pexels.com/photo/white-vehicle-2569855/api
https://www.pexels.com/photo/monochrome-photo-of-city-during-daytime-3074526/
得出要抓取的規則
rules = (
Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=True),
)
圖片管道 要定義兩個item
class ImagesItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field()
images_url是抓取到的圖片url 須要傳遞過來
images 檢測圖片完整性,可是我打印好像沒看到這個字段
pexels.py裏面引入item 而且定義對象
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from images.items import ImagesItem class PexelsSpider(CrawlSpider): name = 'pexels' allowed_domains = ['www.pexels.com'] start_urls = ['http://www.pexels.com/'] rules = ( Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=False), ) def parse_item(self, response): item = ImagesItem() item['image_urls'] = response.xpath('//img[contains(@src,"photos")]/@src').extract() print(item['image_urls']) return item
設置setting.py裏面啓用圖片管道 設置存儲路勁
ITEM_PIPELINES = { #'images.pipelines.ImagesPipeline': 300, 'scrapy.pipelines.images.ImagesPipeline': 1 } IMAGES_STORE = '/www/crawl' # 圖片的下載地址 根據item中的字段來設置哪個內容須要被下載 IMAGES_URLS_FIELD = 'image_urls'
啓動爬蟲
scrapy crawl pexels --nolog
發現已經下載下來了
可是下載的圖片不是高清的,要處理下圖片的後綴
setting.py打開默認管道 設置優先級高一些
ITEM_PIPELINES = { 'images.pipelines.ImagesPipeline': 1, 'scrapy.pipelines.images.ImagesPipeline': 2 }
管道文件裏面對後綴進行處理去掉
class ImagesPipeline(object): def process_item(self, item, spider): tmp = item['image_urls'] item['image_urls'] = [] for i in tmp: if '?' in i: item['image_urls'].append(i.split('?')[0]) else: item['image_urls'].append(i) return item
最終下載的就是大圖了,可是圖片管道仍是默認對圖片會有壓縮的,因此若是使用文件管道下載的纔是徹底的原圖,很是大。
若是不下載圖片,直接存圖片url到mysql的話參考
http://www.javashuo.com/article/p-ttxpyvpy-q.html
圖片管道 配置最小寬度和高度分辨率
IMAGES_MIN_HEIGHT=800
IMAGES_MIN_WIDTH=600
IMAGES_EXPIRES=90 天 不會對重複的進行下載
生成縮略圖
IMAGES_THUMBS={
‘small’:(50,50),
'big':(600,600)
}