PyCharm下進行Scrapy項目的調試,能夠在爬蟲項目的根目錄建立一個main.py,而後在PyCharm設置下運行路徑,那麼就不用每次都在命令行運行代碼,直接運行main.py就能啓動爬蟲了。javascript
在命令行輸入:java
scrapy startproject project_name複製代碼
project_name爲項目名稱,好比個人項目名稱爲py_scrapyjobbole,生成的目錄爲:web
在命令行輸入:chrome
scrapy genspider jobbole(spider名稱) blog.jobbole.com(爬取的起始url)複製代碼
# -*- coding: utf-8 -*-
import scrapy
class JobboleSpider(scrapy.Spider):
name = 'jobbole'
allowed_domains = ['blog.jobbole.com']
start_urls = ['http://blog.jobbole.com/111322/']
def parse(self, response):
re_select = response.xpath('//*[@id="post-111322"]/div[1]/h1')
pass複製代碼
BOT_NAME = 'py_scrapyjobbole'
SPIDER_MODULES = ['py_scrapyjobbole.spiders']
NEWSPIDER_MODULE = 'py_scrapyjobbole.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'py_scrapyjobbole (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False複製代碼
ROBOTSTXT_OBEY = False
必定要設置成 False,斷點調試才能正常進行。<>app
from scrapy.cmdline import execute
import sys
import os
# 打斷點調試py文件
# sys.path.append('D:\PyCharm\py_scrapyjobbole')
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
print(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy', 'crawl', 'jobbole'])複製代碼
xpath相關知識dom
在用Scrapy進行數據爬取時可能會用到xpath相關知識,因此簡單地展現一張圖:scrapy
在這裏面值得注意的是‘’/「和」//「的區別!ide
/:表明子元素,選取的元素必須是父子關係post
//:表明全部後代元素,選取的元素不必定是父子關係,只要是後代元素便可url
不過,你們要是以爲難的話,也能夠利用chrome的元素查找功能進行xpath路徑的複製: