爬取伯樂在線文章（四）將爬取結果保存到MySQL

時間 2019-11-05

標籤伯樂在線文章結果保存 mysql 欄目 MySQL 简体版

原文原文鏈接

Item Pipeline

當Item在Spider中被收集以後，它將會被傳遞到Item Pipeline，這些Item Pipeline組件按定義的順序處理Item。css

每一個Item Pipeline都是實現了簡單方法的Python類，好比決定此Item是丟棄而存儲。如下是item pipeline的一些典型應用：mysql

驗證爬取的數據(檢查item包含某些字段，好比說name字段)
查重(並丟棄)
將爬取結果保存到文件或者數據庫中

編寫item

在items.py中進行編寫sql

class JobBoleArticleItem(scrapy.Item):
    title = scrapy.Field()
    create_date = scrapy.Field()
    praise_num = scrapy.Field()
    collect_num = scrapy.Field()
    comment_num = scrapy.Field()
    front_image_url = scrapy.Field()

編寫以後在提取文章邏輯裏面進行實例化數據庫

    def parse_detail(self,response):
        print("目前爬取的URL是："+response.url)
        #提取文章的具體邏輯
        article_item = JobBoleArticleItem()
        front_image_url = response.meta.get("front_image_url", "")
        #  獲取文章標題
        title = response.css('.entry-header h1::text').extract()[0]
        #  獲取發佈日期
        create_date = response.css('.entry-meta .entry-meta-hide-on-mobile::text').extract()[0].strip().replace("·", "")
        #  獲取點贊數
        praise_num = response.css('.vote-post-up h10::text').extract()[0]
        #  獲取收藏數
        collect_num = response.css('.post-adds .bookmark-btn::text').extract()[0].split(" ")[1]
        collect_match_re = re.match(r'.*?(\d+).*', collect_num)
        if collect_match_re:
            collect_num = int(collect_match_re.group(1))
        else:
            collect_num = 0
        #  獲取評論數
        comment_num = response.css('.post-adds .hide-on-480::text').extract()[0]
        comment_match_re = re.match(r'.*?(\d+).*', comment_num)
        if comment_match_re:
            comment_num = int(comment_match_re.group(1))
        else:
            comment_num = 0

        content = response.css('div.entry').extract()[0]

        article_item["title"] = title
        article_item["create_date"] =create_date
        article_item["praise_num"] = praise_num
        article_item["collect_num"] = collect_num
        article_item["comment_num"] = comment_num
        article_item["front_image_url"] = front_image_url

        yield article_item

最後調用yield article_item以後，article_item會傳遞到pipelines.py裏面json

編寫pipelines

在pipelines.py文件中模板已經寫好，可是若是要使之生效，須要修改settings.py文件，將ITEM_PIPELINES的註釋去掉api

在pipelines.py裏面打斷點進行調試，看article_item是否能傳遞盡來數組

如何將圖片保存到本地

繼續修改item，scrapy提供了一些方法，方便快速開發，修改settings.py異步

ITEM_PIPELINES = {
    'EnterpriseSpider.pipelines.EnterprisespiderPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_URLS_FIELD = "front_image_url"
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, "images")

'scrapy.pipelines.images.ImagesPipeline': 1-------設置scrapy自帶的普票保存方法，後面設置數字是流經管道的順序，數字小的先流經scrapy

IMAGES_URLS_FIELD = "front_image_url"------從item中提取圖片的URL，前面的IMAGES_URLS_FIELD是固定寫法ide

project_dir = os.path.abspath(os.path.dirname(__file__))：獲取當前項目的路徑

IMAGES_STORE = os.path.join(project_dir, "images")：設置圖片存儲的路徑

此時運行咱們的main看是否能將圖片保存

報錯，沒有PIL這個模塊，這個是與圖片文件相關的庫，咱們沒有按照，因此報錯，在虛擬環境中安裝PIL模塊

(scrapyenv) E:\Python\Envs>pip install -i https://pypi.douban.com/simple pillow

安裝以後從新運行程序，此時又報另外一個錯誤

這個由於item傳遞到pipline的時候，下面的front_image_url 會被當作數組處理，可是咱們在業務邏輯處理時候只是把他當作一個值進行處理

IMAGES_URLS_FIELD = "front_image_url"

修改業務處理邏輯

        article_item["title"] = title
        article_item["create_date"] =create_date
        article_item["praise_num"] = praise_num
        article_item["collect_num"] = collect_num
        article_item["comment_num"] = comment_num
        article_item["front_image_url"] = [front_image_url] yield article_item

修改完以後在運行程序，此時爬取的圖片成功保存到images文件夾下面

既然圖片已經保存到本地了，那麼是否能夠提取出路徑，是否能把item裏面的front_image_url與本地路徑綁定在一塊兒，此時咱們須要定義一個本身pipeline，重載ImagesPipeline 中的item_completed

方法。

class ArticleImagePipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        pass

此時在修改settings.py文件，設置問咱們自定義的圖片處理pipeline

ITEM_PIPELINES = {
    'EnterpriseSpider.pipelines.EnterprisespiderPipeline': 300,
   # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'EnterpriseSpider.pipelines.ArticleImagePipeline': 1,
}

打斷點進行調試

重寫item_completed方法

class ArticleImagePipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        for ok, value in results:
            image_file_path = results["path"]
        item["front_image_url"] = image_file_path
        return item

保存到JSON

class JsonWithEncodingPipeline(object):
    def __init__(self):
        self.file = codecs.open('article.json', 'w', encoding='utf-8')

    def process_item(self,item,spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(lines)
        return item

    def spider_closed(self,spider):
        self.file.close()

保存到MySQL

同步保存

class MysqlPipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect('127.0.0.1', 'root', '123456', 'article', charset='utf8', use_unicode=True)
        self.cursor = self.conn.cursor()

    def process_item(self,item,spider):
        insert_sql = '''
        insert into jobbole(title,create_date,front_image_url,praise_num,collect_num,comment_num,url,url_object_id) 
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
        '''

        self.cursor.execute(insert_sql, (item['title'], item['create_date'], item["front_image_url"],
                                         item["praise_num"], item["collect_num"], item["comment_num"],
                                         item["url"], item["url_object_id"]))
        self.conn.commit()

異步保存

class MysqlTwistedPipeline(object):

    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbparams = dict(
            host=settings['MYSQL_HOST'],
            db=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            password=settings['MYSQL_PASSWORD'],
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("MySQLdb", **dbparams)
        return cls(dbpool)

    def process_item(self,item,spider):
        #使用twisted將mysql插入變成異步插入
        query = self.dbpool.runInteraction(self.db_insert, item)
        query.addErrback(self.handler_error, item, spider)

    def handler_error(self,failuer,item,spider):
        #處理異步插入的異常
        print(failuer)

    def db_insert(self,cursor,item):
        insert_sql = '''
                    insert into jobbole(title,create_date,front_image_url,praise_num,collect_num,comment_num,url,url_object_id) 
                    VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
                '''

        cursor.execute(insert_sql, (item['title'], item['create_date'], item["front_image_url"],
                                         item["praise_num"], item["collect_num"], item["comment_num"],
                                         item["url"], item["url_object_id"]))

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。