scrapy是簡單易用的爬蟲框架,python語言實現。具體去官網看看吧:http://scrapy.org/python
以前想抓一些圖片製做拼貼馬賽克(見 拼貼馬賽克算法),沒有找到順手的爬蟲軟件,就本身diy一個。使用scrapy抓取很是簡單,由於它竟然內置了圖片抓取的管道 ImagesPipeline。簡單到幾行代碼就能夠搞定一個圖片爬蟲。git
scrapy的使用更ruby有點兒相似,建立一個project,而後框架就有了,只要在相應的文件中填寫本身的內容就ok了。github
spider文件中添加爬取代碼:算法
1
|
<
/
p> <p>
class
ImageDownloaderSpider(CrawlSpider):<br>name
=
"image_downloader"
<br>allowed_domains
=
[
"sina.com.cn"
]<br>start_urls
=
[<br>
"http://www.sina.com.cn/"
<br>]<br>rules
=
[Rule(SgmlLinkExtractor(allow
=
[]),
'parse_item'
)]<
/
p> <p>
def
parse_item(
self
, response):<br>
self
.log(
'page: %s'
%
response.url)<br>hxs
=
HtmlXPathSelector(response)<br>images
=
hxs.select(
'//img/@src'
).extract()<br>items
=
[]<br>
for
image
in
images:<br>item
=
ImageDownloaderItem()<br>item[
'image_urls'
]
=
[image]<br>items.append(item)<br>
return
items<
/
p> <p>
|
item中添加字段:ruby
1
|
<
/
p> <p>
class
ImageDownloaderItem(Item):<br>image_urls
=
Field()<br>images
=
Field()<
/
p> <p>
|
pipelines中過濾並保存圖片:app
1
|
<
/
p> <p>
class
ImageDownloaderPipeline(ImagesPipeline):<
/
p> <p>
def
get_media_requests(
self
, item, info):<br>
for
image_url
in
item[
'image_urls'
]:<br>
yield
Request(image_url)<
/
p> <p>
def
item_completed(
self
, results, item, info):<br>image_paths
=
[x[
'path'
]
for
ok, x
in
results
if
ok]<br>
if
not
image_paths:<br>
raise
DropItem(
"Item contains no images"
)<br>
return
item<
/
p> <p>
|
settings文件中添加project和圖片過濾設置:框架
1
|
<
/
p> <p>IMAGES_MIN_HEIGHT
=
50
<br>IMAGES_MIN_WIDTH
=
50
<br>IMAGES_STORE
=
'image-downloaded/'
<br>DOWNLOAD_TIMEOUT
=
1200
<br>ITEM_PIPELINES
=
[
'<a href="http://lzhj.me/archives/tag/scrapy" class="st_tag internal_tag" rel="tag" title="Posts tagged with scrapy">scrapy</a>.contrib.pipeline.images.ImagesPipeline'
,<br>
'image_downloader.pipelines.ImageDownloaderPipeline'
]<
/
p> <p>
|
代碼下載:@githubdom
scrapy優美的數據流:scrapy