使用scrapy爬取整站圖片

scrapy是簡單易用的爬蟲框架,python語言實現。具體去官網看看吧:http://scrapy.org/python

以前想抓一些圖片製做拼貼馬賽克(見 拼貼馬賽克算法),沒有找到順手的爬蟲軟件,就本身diy一個。使用scrapy抓取很是簡單,由於它竟然內置了圖片抓取的管道 ImagesPipeline。簡單到幾行代碼就能夠搞定一個圖片爬蟲。git

scrapy的使用更ruby有點兒相似,建立一個project,而後框架就有了,只要在相應的文件中填寫本身的內容就ok了。github

spider文件中添加爬取代碼:算法

1
< / p> <p> class ImageDownloaderSpider(CrawlSpider):<br>name = "image_downloader" <br>allowed_domains = [ "sina.com.cn" ]<br>start_urls = [<br> "http://www.sina.com.cn/" <br>]<br>rules = [Rule(SgmlLinkExtractor(allow = []), 'parse_item' )]< / p> <p> def parse_item( self , response):<br> self .log( 'page: %s' % response.url)<br>hxs = HtmlXPathSelector(response)<br>images = hxs.select( '//img/@src' ).extract()<br>items = []<br> for image in images:<br>item = ImageDownloaderItem()<br>item[ 'image_urls' ] = [image]<br>items.append(item)<br> return items< / p> <p>

item中添加字段:ruby

1
< / p> <p> class ImageDownloaderItem(Item):<br>image_urls = Field()<br>images = Field()< / p> <p>

pipelines中過濾並保存圖片:app

1
< / p> <p> class ImageDownloaderPipeline(ImagesPipeline):< / p> <p> def get_media_requests( self , item, info):<br> for image_url in item[ 'image_urls' ]:<br> yield Request(image_url)< / p> <p> def item_completed( self , results, item, info):<br>image_paths = [x[ 'path' ] for ok, x in results if ok]<br> if not image_paths:<br> raise DropItem( "Item contains no images" )<br> return item< / p> <p>

settings文件中添加project和圖片過濾設置:框架

1
< / p> <p>IMAGES_MIN_HEIGHT = 50 <br>IMAGES_MIN_WIDTH = 50 <br>IMAGES_STORE = 'image-downloaded/' <br>DOWNLOAD_TIMEOUT = 1200 <br>ITEM_PIPELINES = [ '<a href="http://lzhj.me/archives/tag/scrapy" class="st_tag internal_tag" rel="tag" title="Posts tagged with scrapy">scrapy</a>.contrib.pipeline.images.ImagesPipeline' ,<br> 'image_downloader.pipelines.ImageDownloaderPipeline' ]< / p> <p>

代碼下載:@githubdom

scrapy優美的數據流:scrapy

scrapy_architecture