python學習-scrapy學習筆記

時間 2019-12-19

原文原文鏈接

python-scrapy學習筆記

1、你能夠爲你的spider指定處理數據的pipeline，不過這須要一些代碼

首先咱們須要一個裝飾器（decorator），這個裝飾器放到pipeline文件中，類的外部，由於多個pipeline須要用到這個裝飾器html

def check_spider_pipeline(process_item_method): """該註解用在pipeline上 :param process_item_method: :return: """  @functools.wraps(process_item_method) def wrapper(self, item, spider): # message template for debugging msg = "{0} pipeline step".format(self.__class__.__name__) # if class is in the spider"s pipeline, then use the # process_item method normally. if self.__class__ in spider.pipeline: logging.info(msg.format("executing")) return process_item_method(self, item, spider) # otherwise, just return the untouched item (skip this step in # the pipeline) else: logging.info(msg.format("skipping")) return item return wrapper

裝飾器的做用是判斷spider中有沒有設置這個pipeline方法，代碼的關鍵在於python

if self.__class__ in spider.pipeline:

基於這個判斷，咱們須要在spider中設置咱們的pipeline：mysql

pipeline = set([
    pipelines.RentMySQLPipeline, ])

在spider類中添加這段代碼，創建這兩段代碼的聯繫。在pipeline中使用裝飾器以後，咱們就會判斷spider中是否受權了該方法去操做item。sql

固然，咱們在使用以前也必須將pipelines導入到文件中。數據庫

二者創建聯繫以後，使用以下代碼：flask

@check_spider_pipeline def process_item(self, item, spider):

此時，就大功告成了。每一個pipeline方法前都使用這個裝飾器，而後在spider中受權方法的使用。session

2、利用ORM，咱們能夠實現快速的入門操做數據庫

ORM指object relational mapping，即對象關係映射。app

首先咱們的有一些基礎知識，我本身用的是mysql和SQLAlchemy。若有不熟悉請移步mysql菜鳥教程，SQLAlchemy教程。scrapy

在咱們經過spider爬取到數據以後，全部的數據都是提交給pipeline處理，pipeline須要在settings中註冊ide

ITEM_PIPELINES = { 'spider.pipelines.SpiderPipeline': 300, 'spider.pipelines.SpiderDetailPipeline': 300, }

而後咱們須要在mysql中添加本身的數據庫和表

mysql -u root -p
create database xxx; use xxx; create table spider(id integer not null, primary key (id));

添加好本身須要的數據以後，咱們在程序中建立一個表的映射類

from sqlalchemy import Column, String, DateTime,create_engine, Integer, Text, INT from sqlalchemy.orm import sessionmaker from sqlalchemy.ext.declarative import declarative_base import settings Base = declarative_base() class topic(Base): __tablename__ = 'topic' id = Column(Integer, primary_key=True, unique=True, autoincrement=True) topic_title = Column(String(256)) topic_author = Column(String(256)) topic_author_img = Column(String(256)) topic_class = Column(String(256)) topic_reply_num = Column(Integer) spider_time = Column(String(256)) def __init__(self, topic_title, topic_author, topic_class, topic_reply_num, spider_time, topic_author_img): # self.topic_id = topic_id self.topic_title = topic_title self.topic_author = topic_author self.topic_author_img = topic_author_img self.topic_class = topic_class self.topic_reply_num = topic_reply_num self.spider_time = spider_time DBSession = sessionmaker(bind=settings.engine)

Base做爲基類，供全部的對象類繼承 DBSession做爲操做數據庫的一個對話，經過sessionmaker建立後，能夠對方便的對數據庫進行操做。接下來就是進行數據的插入了，由於咱們是爬蟲操做，也不須要關心刪除修改這些。直接上代碼

class TesterhomeSpiderPipeline(object): def __init__(self): self.session = DBSession() @check_spider_pipeline def process_item(self, item, spider): my_topic = Topic(topic_title=item['topic_title'][0].encode('unicode-escape'), topic_author=item['topic_author'][0].encode('unicode-escape'), topic_author_img=item['topic_author_img'][0].encode('unicode-escape'), topic_class=item['topic_class'][0].encode('unicode-escape'), topic_reply_num=item['topic_reply_num'][0].encode('unicode-escape'), spider_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')) try: self.session.add(my_topic) self.session.commit() except: self.session.rollback() raise finally: self.session.close() return item