安裝 html
yum install libxslt-devel libffi-devel dom
pip install Scrapy scrapy
建立項目 ide
scrapy startproject tutorial(工程名) url
定義item(至關於數據表中的一條數據) .net
vi tutorial/items.py htm
class myItem(scrapy.Item): blog
title = scrapy.Field()//至關於數據表的字段 ip
link = scrapy.Field() 文檔
desc = scrapy.Field()
編寫爬蟲
import scrapy class DmozSpider(scrapy.spiders.Spider)://有幾種抓取方式的父類
name = "dmoz"//必須定義的
allowed_domains = ["dmoz.org"]//可選屬性
start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ]//必須定義
def parse(self, response)://解析網頁
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
爬取數據
scrapy crawl dmoz
官方中文文檔http://scrapy-chs.readthedocs.org/zh_CN/0.24/ 注:不是最新的~
參考:
http://www.cnblogs.com/rwxwsblog/p/4572367.html
http://blog.csdn.net/HanTangSongMing/article/details/24454453