#導入GPG密鑰 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7 #添加軟件源 echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list #更新包列表並安裝scrapy sudo apt-get update && sudo apt-get install scrapy-0.22
運行Scrapy後,只須要重寫一個download就能夠了。html
這裏是別人的一個抓取招聘網站信息的例子,基本結構能夠參考。但我運行時,出現不少錯誤,缺乏misc目錄,並且piplines的配置沒有寫出來。shell
諸君稍候,等我試一下,所有分享給你們。ubuntu
根據官方的入門例程,快速開始:
dom
首先,根據模版建立一個工程。
scrapy
scrapy startproject tutorial
裏面的目錄和文件看起來是這樣的:ide
tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py ...
進items.py,改爲這樣:網站
import scrapy class DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()
再進dmoz_spider.py,改爲這樣:url
import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] + '.html' with open(filename, 'wb') as f: f.write(response.body)
再運行這個:spa
scrapy crawl dmoz
可是,出錯了哇!.net
找了個別的例子參考了下,把裏面的scrapy.Spider改成scrapy.spider.Spider,以下:
class DmozSpider(scrapy.spider.Spider):
基本上就能夠Run起來了。
多是版本修改了,把Spider這個類放到了spider命名空間下了,但是例子沒有改,有點坑人咧!
完整的教程在這裏:http://scrapy-chs.readthedocs.org/zh_CN/latest/intro/tutorial.html