1,首先進入虛擬環境
2,使用國內豆瓣源進行安裝,快!
html
1 pip install -i https://pypi.douban.com/simple/ scrapy
3,特殊狀況出錯:缺乏c++,解決辦法:本身安裝了個vs2015c++
1 scrapy --help 2 Available commands: 3 bench Run quick benchmark test 4 commands 5 fetch Fetch a URL using the Scrapy downloader 6 genspider Generate new spider using pre-defined templates 7 runspider Run a self-contained spider (without creating a project) 8 settings Get settings values 9 shell Interactive scraping console 10 startproject Create new project 11 version Print Scrapy version 12 view Open URL in browser, as seen by Scrapy 13 14 [ more ] More commands available when run from project directory 15 到時候用到再說
在這裏只能經過命令行:pycharm 沒有加載scrapy,與Django 不同
命令:
#注意:cd 到所需建立工程的目錄下
scrapy startproject projectname
默認是沒有模板的,還須要本身命令建立
目錄樹:(main是後來本身建的)shell
比如在Django中建立一個APP,在次建立一個爬蟲
命令:
#注意:必須在該工程目錄下
#建立一個名字爲blogbole,爬取root地址爲blog.jobbole.com 的爬蟲;爬伯樂在線
scrapy genspider jobbole blog.jobbole.com api
1 建立的文件: 2 # -*- coding: utf-8 -*- 3 import scrapy 4 5 6 class JobboleSpider(scrapy.Spider): 7 #爬蟲名字 8 name = "jobbole" 9 #運行爬取的域名 10 allowed_domains = ["blog.jobbole.com"] 11 #開始爬取的URL 12 start_urls = ['http://blog.jobbole.com'] 13 14 #爬取函數 15 def parse(self, response): 16 #xpath 解析response內容,提取數據 17 #//*[@id="post-110769"]/div[1]/h1 18 re_selector = response.xpath('//*[@id="post-110769"]/div[1]/h1/text()') 19 re2_selector = response.xpath('/html/body/div[3]/div[1]/h1/text()') 20 re3_selector = response.xpath('//div[@class="entry-header"]/h1/text()') 21 22 pass
至此,一個爬蟲工程創建完畢;dom