當Item在Spider中被收集以後,它將會被傳遞到Item Pipeline,這些Item Pipeline組件按定義的順序處理Item。css
每一個Item Pipeline都是實現了簡單方法的Python類,好比決定此Item是丟棄而存儲。如下是item pipeline的一些典型應用:mysql
在items.py中進行編寫sql
class JobBoleArticleItem(scrapy.Item): title = scrapy.Field() create_date = scrapy.Field() praise_num = scrapy.Field() collect_num = scrapy.Field() comment_num = scrapy.Field() front_image_url = scrapy.Field()
編寫以後在提取文章邏輯裏面進行實例化數據庫
def parse_detail(self,response): print("目前爬取的URL是:"+response.url) #提取文章的具體邏輯 article_item = JobBoleArticleItem() front_image_url = response.meta.get("front_image_url", "") # 獲取文章標題 title = response.css('.entry-header h1::text').extract()[0] # 獲取發佈日期 create_date = response.css('.entry-meta .entry-meta-hide-on-mobile::text').extract()[0].strip().replace("·", "") # 獲取點贊數 praise_num = response.css('.vote-post-up h10::text').extract()[0] # 獲取收藏數 collect_num = response.css('.post-adds .bookmark-btn::text').extract()[0].split(" ")[1] collect_match_re = re.match(r'.*?(\d+).*', collect_num) if collect_match_re: collect_num = int(collect_match_re.group(1)) else: collect_num = 0 # 獲取評論數 comment_num = response.css('.post-adds .hide-on-480::text').extract()[0] comment_match_re = re.match(r'.*?(\d+).*', comment_num) if comment_match_re: comment_num = int(comment_match_re.group(1)) else: comment_num = 0 content = response.css('div.entry').extract()[0] article_item["title"] = title article_item["create_date"] =create_date article_item["praise_num"] = praise_num article_item["collect_num"] = collect_num article_item["comment_num"] = comment_num article_item["front_image_url"] = front_image_url yield article_item
最後調用yield article_item以後,article_item會傳遞到pipelines.py裏面json
在pipelines.py文件中模板已經寫好,可是若是要使之生效,須要修改settings.py文件,將ITEM_PIPELINES的註釋去掉api
在pipelines.py裏面打斷點進行調試,看article_item是否能傳遞盡來數組
繼續修改item,scrapy提供了一些方法,方便快速開發,修改settings.py異步
ITEM_PIPELINES = { 'EnterpriseSpider.pipelines.EnterprisespiderPipeline': 300, 'scrapy.pipelines.images.ImagesPipeline': 1, } IMAGES_URLS_FIELD = "front_image_url" project_dir = os.path.abspath(os.path.dirname(__file__)) IMAGES_STORE = os.path.join(project_dir, "images")
'scrapy.pipelines.images.ImagesPipeline': 1-------設置scrapy自帶的普票保存方法,後面設置數字是流經管道的順序,數字小的先流經scrapy
IMAGES_URLS_FIELD = "front_image_url"------從item中提取圖片的URL,前面的IMAGES_URLS_FIELD是固定寫法ide
project_dir = os.path.abspath(os.path.dirname(__file__)):獲取當前項目的路徑
IMAGES_STORE = os.path.join(project_dir, "images"):設置圖片存儲的路徑
此時運行咱們的main看是否能將圖片保存
報錯,沒有PIL這個模塊,這個是與圖片文件相關的庫,咱們沒有按照,因此報錯,在虛擬環境中安裝PIL模塊
(scrapyenv) E:\Python\Envs>pip install -i https://pypi.douban.com/simple pillow
安裝以後從新運行程序,此時又報另外一個錯誤
這個由於item傳遞到pipline的時候,下面的front_image_url 會被當作數組處理,可是咱們在業務邏輯處理時候只是把他當作一個值進行處理
IMAGES_URLS_FIELD = "front_image_url"
修改業務處理邏輯
article_item["title"] = title article_item["create_date"] =create_date article_item["praise_num"] = praise_num article_item["collect_num"] = collect_num article_item["comment_num"] = comment_num article_item["front_image_url"] = [front_image_url] yield article_item
修改完以後在運行程序,此時爬取的圖片成功保存到images文件夾下面
既然圖片已經保存到本地了,那麼是否能夠提取出路徑,是否能把item裏面的front_image_url與本地路徑綁定在一塊兒,此時咱們須要定義一個本身pipeline,重載ImagesPipeline 中的item_completed
方法。
class ArticleImagePipeline(ImagesPipeline): def item_completed(self, results, item, info): pass
此時在修改settings.py文件,設置問咱們自定義的圖片處理pipeline
ITEM_PIPELINES = { 'EnterpriseSpider.pipelines.EnterprisespiderPipeline': 300, # 'scrapy.pipelines.images.ImagesPipeline': 1, 'EnterpriseSpider.pipelines.ArticleImagePipeline': 1, }
打斷點進行調試
重寫item_completed方法
class ArticleImagePipeline(ImagesPipeline): def item_completed(self, results, item, info): for ok, value in results: image_file_path = results["path"] item["front_image_url"] = image_file_path return item
class JsonWithEncodingPipeline(object): def __init__(self): self.file = codecs.open('article.json', 'w', encoding='utf-8') def process_item(self,item,spider): lines = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(lines) return item def spider_closed(self,spider): self.file.close()
同步保存
class MysqlPipeline(object): def __init__(self): self.conn = MySQLdb.connect('127.0.0.1', 'root', '123456', 'article', charset='utf8', use_unicode=True) self.cursor = self.conn.cursor() def process_item(self,item,spider): insert_sql = ''' insert into jobbole(title,create_date,front_image_url,praise_num,collect_num,comment_num,url,url_object_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s) ''' self.cursor.execute(insert_sql, (item['title'], item['create_date'], item["front_image_url"], item["praise_num"], item["collect_num"], item["comment_num"], item["url"], item["url_object_id"])) self.conn.commit()
異步保存
class MysqlTwistedPipeline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls, settings): dbparams = dict( host=settings['MYSQL_HOST'], db=settings['MYSQL_DBNAME'], user=settings['MYSQL_USER'], password=settings['MYSQL_PASSWORD'], charset='utf8', cursorclass=MySQLdb.cursors.DictCursor, use_unicode=True, ) dbpool = adbapi.ConnectionPool("MySQLdb", **dbparams) return cls(dbpool) def process_item(self,item,spider): #使用twisted將mysql插入變成異步插入 query = self.dbpool.runInteraction(self.db_insert, item) query.addErrback(self.handler_error, item, spider) def handler_error(self,failuer,item,spider): #處理異步插入的異常 print(failuer) def db_insert(self,cursor,item): insert_sql = ''' insert into jobbole(title,create_date,front_image_url,praise_num,collect_num,comment_num,url,url_object_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s) ''' cursor.execute(insert_sql, (item['title'], item['create_date'], item["front_image_url"], item["praise_num"], item["collect_num"], item["comment_num"], item["url"], item["url_object_id"]))