依賴包:virtualenv,virtualenvwrapper(爲了更方便管理和使用虛擬環境)css
安裝:pip install virtulaenv,virtualenvwrapper或經過源碼包安裝html
經常使用命令:mkvirtualenv --python=/usr/local/python3.5.3/bin/python article_spider(如有多個Python版本能夠指定,而後建立虛擬環境article_spider);node
workon :顯示當前環境下全部虛擬環境python
workon 虛擬環境名:進入相關環境:mysql
退出虛擬環境:deactivatesql
刪除虛擬環境:rmvirtualenv article_spidershell
安裝相關依賴包及Scrapy框架:pip install scrapy(建議用豆瓣源鏡像安裝,快得多pip install https://pypi.douban.com /simple scrapy) 數據庫
windows操做環境中還需安裝(pip install pypiwin32)json
注:若安裝失敗有多是版本不一樣,能夠到官網查看對應版本安裝:https://www.lfd.uci.edu/~gohlke/pythonlibs/windows
3.新建Scrapy項目(能夠定製模板,這裏用默認的):
scrapy startproject article_spider:
用Pycharm打開,結構以下(與Django類似),爬蟲都放在spider文件夾中:
建立爬蟲文件:cd article_spider:進入項目
scrapy genspider --list(查看spider提供的模板)
scrapy genspider -t 模板名 爬蟲文件名 域名(指定模板):
scrapy genspider 爬蟲文件名 所爬取的域名(默認模板爲basic)
jobbole.py文件以下(start_url中的url都會經過parse函數,能夠把要爬取的網址放進start_url):查看Spider源碼可知,經過start_requests返回url,是一個生成器
# -*- coding: utf-8 -*- import scrapy class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/'] def parse(self, response): pass
from scrapy.cmdline import execute import sys import os #將父目錄添加到搜索目錄中 sys.path.append(os.path.dirname(os.path.abspath(__file__))) execute(["scrapy","crawl","jobbole"])
修改setting.py:
# Obey robots.txt rules #默認爲True會過濾掉ROBOTS協議 ROBOTSTXT_OBEY =False
調試結果以下,body中爲網頁全部內容:
父節點
子節點
同胞節點(兄弟節點)
先輩節點
後代節點
# -*- coding: utf-8 -*- import scrapy class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/114405/']
def parse(self, response):
title=response.xpath('//*[@id="post-114405"]/div[1]/h1/text()')
pass
返回的是一個Selectorlist,便於嵌套xpath
fav_nums=response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0]
#使用正則匹配,有可能無收藏,匹配不到 match_fav=re.match(".*(\d+).*",fav_nums) if match_fav: fav_nums=int(match_fav.group(1)) else: fav_nums=0
comments_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] math_comments=re.match(".*(\d).*",comments_nums) if math_comments: comments_nums=int(math_comments) else: comments_nums=0
tag_list= response.xpath('//*[@id="post-114405"]/div[2]/p/a/text()').extract() tag_list=[element for element in tag_list if not element.strip().endswith("評論")] tags=','.join(tag_list)
# -*- coding: utf-8 -*- import scrapy import re class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/114405/'] def parse(self, response): #經過xpath提取 title=response.xpath('//div[@class="entry-header"]/h1/text()') create_date= response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].replace("·","").strip() praise_nums=response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0] if praise_nums: praise_nums=int(praise_nums) else: praise_nums=0 fav_nums=response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0] match_fav=re.match(".*(\d+).*",fav_nums) if match_fav: fav_nums=int(match_fav.group(1)) else: fav_nums=0 comments_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] math_comments=re.match(".*(\d).*",comments_nums) if math_comments: comments_nums=int(math_comments.group(1)) else: comments_nums=0 cotent=response.xpath('//div[@class="entry"]').extract()[0] tag_list= response.xpath('//div[@class="entry-meta"]/p/a/text()').extract() tag_list=[element for element in tag_list if not element.strip().endswith("評論")] tags=','.join(tag_list) #經過CSS提取 title=response.css(".entry-header > h1::text").extract()[0] create_time=response.css("p.entry-meta-hide-on-mobile::text").extract()[0].replace("·","").strip() praise_nums=int(response.css("span.vote-post-up h10::text").extract()[0]) if praise_nums: praise_nums = int(praise_nums) else: praise_nums = 0 fav_nums=response.css(".bookmark-btn::text").extract()[0] match_fav = re.match(".*(\d+).*", fav_nums) if match_fav: fav_nums = int(match_fav.group(1)) else: fav_nums = 0 comments_nums=response.css("a[href='#article-comment'] span::text").extract()[0] math_comments = re.match(".*(\d).*", comments_nums) if math_comments: comments_nums = int(math_comments.group(1)) else: comments_nums = 0 content=response.css("div.entry").extract()[0] tag_list = response.css("p.entry-meta-hide-on-mobile a::text").extract() tag_list = [element for element in tag_list if not element.strip().endswith("評論")] tags = ','.join(tag_list)
from scrapy.http import Request #提取域名的函數 #python3 from urllib import parse #python2 #import urlparse class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): ''' 1.獲取文章列表頁中的文章url並交給scrapy下載後進行解析; 2.獲取下一頁url並交給scrapy下載交給parse解析字段 ''' #解析列表頁中全部文章url交給scrapy下載後進行解析 post_urls=response.css("div#archive div.floated-thumb div.post-meta p a.archive-title::attr(href)").extract() for post_url in post_urls: #若提取的url不全,不包含域名,能夠用parse拼接 #Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail) #生成器,回調 yield Request(post_url,callback=self.parse_detail) #提取下一頁並交給scrapy下載 next_url=response.css(".next.page-numbers::attr(href)").extract_first() if next_url: yield Request(next_url,callback=self.parse) def parse_detail(self,response): # 經過xpath提取 title=response.xpath('//div[@class="entry-header"]/h1/text()') create_date= response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].replace("·","").strip() praise_nums=response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0] if praise_nums: praise_nums=int(praise_nums) else: praise_nums=0 fav_nums=response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0] match_fav=re.match(".*(\d+).*",fav_nums) if match_fav: fav_nums=int(match_fav.group(1)) else: fav_nums=0 comments_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] math_comments=re.match(".*(\d).*",comments_nums) if math_comments: comments_nums=int(math_comments.group(1)) else: comments_nums=0 cotent=response.xpath('//div[@class="entry"]').extract()[0] tag_list= response.xpath('//div[@class="entry-meta"]/p/a/text()').extract() tag_list=[element for element in tag_list if not element.strip().endswith("評論")] tags=','.join(tag_list)
class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): ''' 1.獲取文章列表頁中的文章url並交給scrapy下載後進行解析; 2.獲取下一頁url並交給scrapy下載交給parse解析字段 ''' #解析列表頁中全部文章url交給scrapy下載後進行解析 #獲取url及image的節點 post_nodes=response.css("div#archive div.floated-thumb div.post-thumb a") for post_node in post_nodes: image_url=post_node.css("img::attr(src)") post_url=post_node.css("::attr(href)") #若提取的url不全,不包含域名,能夠用parse拼接 #Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail) #生成器,回調 yield Request(parse.urljoin(response.url,post_url),meta={"front-image-url":image_url},callback=self.parse_detail) #提取下一頁並交給scrapy下載 next_url=response.css(".next.page-numbers::attr(href)").extract_first() if next_url: yield Request(next_url,callback=self.parse)
def parse_detail(self,response):
# 經過xpath提取
front_image_url=response.meta.get("front-image-url","")#文章封面圖
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class ArticleSpiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass
...... class JobboleArticleSpider(scrapy.Item): #字段中有Field類型,能夠接受任何類型 title=scrapy.Field() create_date=scrapy.Field() url=scrapy.Field() #對url作MD5,固定url的長度 url_object_id=scrapy.Field() front_image_url=scrapy.Field() front_image_path=scrapy.Field() praise_nums=scrapy.Field() fav_nums=scrapy.Field() comment_nums=scrapy.Field() tags=scrapy.Field() content=scrapy.Field()
......
def parse_detail(self,response): #實例化item中JobboleArtilce對象 article_item=JobboleArticleSpider() # 經過xpath提取 front_image_url=response.meta.get("front-image-url","")#文章封面圖 title=response.xpath('//div[@class="entry-header"]/h1/text()').extract_first() create_date= response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].replace("·","").strip() praise_nums=response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0] if praise_nums: praise_nums=int(praise_nums) else: praise_nums=0 fav_nums=response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0] match_fav=re.match(".*(\d+).*",fav_nums) if match_fav: fav_nums=int(match_fav.group(1)) else: fav_nums=0 comments_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] math_comments=re.match(".*(\d).*",comments_nums) if math_comments: comments_nums=int(math_comments.group(1)) else: comments_nums=0 content=response.xpath('//div[@class="entry"]').extract()[0] tag_list= response.xpath('//div[@class="entry-meta"]/p/a/text()').extract() tag_list=[element for element in tag_list if not element.strip().endswith("評論")] tags=','.join(tag_list) #填充item數據 article_item["title"]=title article_item["url"]=response.url article_item["create_date"]=create_date article_item["front_image_url"]=front_image_url article_item["praise_nums"]=praise_nums article_item["fav_nums"]=fav_nums article_item["comment_nums"]=comments_nums article_item["tags"]=tags article_item["content"]=content #傳遞到item中 yield article_item
ITEM_PIPELINES = { 'article_spider.pipelines.ArticleSpiderPipeline': 300, }
調試發現,數據會傳送到pipelines中,能夠在這兒作一系列操做
ITEM_PIPELINES = { 'article_spider.pipelines.ArticleSpiderPipeline': 300, #後面的數字大小表示pipeline前後流經的順序,先1,後300 'scrapy.pipelines.images.ImagesPipeline':1 }
#配置圖片在item中下載的字段,front-image-url會被當成數組處理,所以jobole中該字段應改爲列表
IMAGES_URLS_FIELD="front_image_url"
#配置圖片保存路徑
import os
#獲取setting.py的父目錄
project_dir=os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE=os.path.join(project_dir,"images")
#設置圖片最小寬度,高度,即必須大於這麼多才下載,還有不少屬性能夠坎源碼
IMAGES_MIN_HEIGHT=100
IMAGES_MIN_WIDTH=100
front_image_url需轉換爲列表格式才能被下載
部分image.py源碼
class ArticleImagePipeline(ImagesPipeline): ''' 定製圖片的pipepline ''' #重寫ImagesPipeline中的item_completed()函數 def item_completed(self, results, item, info): for ok, value in results: image_path = value['path'] item['front_image_path'] = image_path
#記得返回item return item
經過item_completed()獲取的results結構如上圖
item_completed()該方法源碼如上圖
import hashlib def get_md5(url): # 判斷url若是爲unicode編碼,則轉換爲utf-8 if isinstance(url, str): url = url.encode('utf-8') m = hashlib.md5() m.update(url) return m.hexdigest() if __name__ == "__main__": print(get_md5("https://jobbole.com".encode('utf-8')))
導入模塊並調用get_md5()填充數據(jobbole.py)
class JsonEncodingPipeline(object):
#自定義json文件的導出 def __init__(self): self.file=codecs.open("article.json","w",encoding='utf-8') def process_item(self, item, spider): #調用pipeline生成的函數,ensure_ascii=False防止中文等編碼錯誤 lines=json.dumps(dict(item),ensure_ascii=False)+"\n" self.file.write(lines) return item def spider_closed(self,spider): ''' 調用spider_closed(信號)關閉文件 ''' self.file.close()
ITEM_PIPELINES = { 'article_spider.pipelines.ArticleSpiderPipeline': 300, #後面的數字大小表示pipeline前後流經的順序 # 'scrapy.pipelines.images.ImagesPipeline':1 'article_spider.pipelines.JsonEncodingPipeline':2, 'article_spider.pipelines.ArticleImagePipeline':1 }
exporters中的文件格式種類
class JsonExporterPipeline(object): #調用scrapy提供的exporter導出json文件 def __init__(self): self.file=open("articlexporter.json","wb") self.exporter=JsonItemExporter(self.file,encoding="utf-8",ensure_ascii=False)
#寫入"[\n" self.exporter.start_exporting() def close_spider(self,spider):
#寫入"]\n" self.exporter.finish_exporting() self.file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item
表的設計,這裏只有一張表,能夠直接設計:
時間格式的轉換:
try: create_date=datetime.datetime.strptime(create_date,"%Y/%m/%d").date() except Exception as e: create_date=datetime.datetime.now().date() article_item["create_date"]=create_date
mysql驅動安裝:pip install mysqlclient,利用Mysqldb鏈接數據庫時的參數:
mysqlpipeline的實現,第一種插入mysql的方法(記得setting中配置pipeline),解析速度大於數據插入mysql速度,有可能致使阻塞:
......
import Mysqldb
class MysqlPipeline(object): #數據導入數據庫,爬取速度有可能遠大於插入速度,形成阻塞 def __init__(self): self.conn=MySQLdb.connect("localhost","root","112358","bole_articles",charset="utf8",use_unicode=True) self.cursor=self.conn.cursor() def process_item(self,item,spider): inser_sql=""" insert into articles(title,url,url_object_id,font_img_url,font_img_path,create_time,fa_num,sc_num,pinglun_num,tag,content) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) """ self.cursor.execute(inser_sql,(item["title"],item["url"],item["url_object_id"], item["front_image_url"],item["front_image_path"],item["create_date"], item["praise_nums"],item["fav_nums"],item["comment_nums"],item["tags"],item["content"])) self.conn.commit()
mysqltwistedpipeline異步插入(基於Twisted異步框架):
#setting中配置mysql相關信息 MYSQL_HOST="localhost" MYSQL_DBNAME="bole_articles" MYSQL_USER="root" MYSQL_PASSWORD="112358"
...... from twisted.enterprise import adbapi import MySQLdb import MySQLdb.cursors class MysqlTwistedPipeline(object): def __init__(self, dbpool,dbpool2): self.dbpool = dbpool # 導入setting中的配置(固定函數),注:是from_settings而不是from_setting @classmethod def from_settings(cls, setting): # 將dbtool傳入 dbparms = dict( host=setting["MYSQL_HOST"], db=setting["MYSQL_DBNAME"], user=setting["MYSQL_USER"], password=setting["MYSQL_PASSWORD"], charset="utf8", cursorclass=MySQLdb.cursors.DictCursor, use_unicode=True, ) # twisted異步容器,使用MySQldb模塊鏈接 dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms) return cls(dbpool) def process_item(self, item, spider): # 使用Twisted將mysql插入變成異步執行 query = self.dbpool.runInteraction(self.do_insert, item) # 處理異常 query.addErrback(self.handle_error,item,spider) def handle_error(self, failure, item, spider): print(failure) def do_insert(self, cursor, item): # 執行具體的插入 inser_sql = """ insert into articles(title,url,url_object_id,font_img_url,font_img_path,create_time,fa_num,sc_num,pinglun_num,tag,content) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) """ cursor.execute(inser_sql, (item["title"], item["url"], item["url_object_id"], item["front_image_url"], item["front_image_path"], item["create_date"], item["praise_nums"], item["fav_nums"], item["comment_nums"], item["tags"], item["content"]))
from scrapy.loader import ItemLoader ...... def parse_detail(self,response): #經過item_loader加載item item_loader=ItemLoader(item=JobboleArticleSpider(),response=response) #三個重要方法item_loader.add_xpath();item_loader.add_css();item_loader.add_css() item_loader.add_css("title",".entry-header > h1::text") item_loader.add_value("url",response.url) item_loader.add_value("url_ooject_id",get_md5(front_image_url)) item_loader.add_css("create_date","p.entry-meta-hide-on-mobile::text") item_loader.add_value("front_image_url",[front_image_url]) item_loader.add_css("praise_nums","span.vote-post-up h10::text") item_loader.add_css("fav_nums",".bookmark-btn::text") item_loader.add_css("comments_nums", "a[href='#article-comment'] span::text") item_loader.add_css("content", "div.entry") item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text") #調用此方法才生效 article_item=item_loader.load_item() yield article_item
from scrapy.loader.processors import MapCompose,TakeFirst ...... def date_convert(value): #定義處理時間函數,返回時間 try: create_date = datetime.datetime.strptime(value, "%Y/%m/%d").date() except Exception as e: create_date = datetime.datetime.now().date() return create_date class JobboleArticleSpider(scrapy.Item): # 字段中有Field類型,能夠接受任何類型 title = scrapy.Field(
#能夠傳多個函數 input_processor=MapCompose(lambda x: x + "hah") ) create_date = scrapy.Field( #處理時間,仍是數組 input_processor=MapCompose(date_convert), #只取數組的第一個,若是都要寫麻煩,能夠定製itemloader output_processor=TakeFirst() ) url = scrapy.Field() # 對url作MD5,固定url的長度 url_object_id = scrapy.Field() front_image_url = scrapy.Field() front_image_path = scrapy.Field() praise_nums = scrapy.Field() fav_nums = scrapy.Field() comment_nums = scrapy.Field() tags = scrapy.Field() content = scrapy.Field()
items,py:
from scrapy.loader import ItemLoader ...... class ArticleItemLoader(ItemLoader): #自定義itemloader,值取數組的第一個,修改item中的loader default_output_processor = TakeFirst()
jobbole.py:
...... item_loader=ArticleItemLoader(item=JobboleArticleSpider(),response=response) #三個重要方法item_loader.add_xpath();item_loader.add_css();item_loader.add_css() item_loader.add_css("title",".entry-header > h1::text") item_loader.add_value("url",response.url) item_loader.add_value("url_object_id",get_md5(front_image_url)) item_loader.add_css("create_date","p.entry-meta-hide-on-mobile::text") item_loader.add_value("front_image_url",[front_image_url]) item_loader.add_css("praise_nums","span.vote-post-up h10::text") item_loader.add_css("fav_nums",".bookmark-btn::text") item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text") item_loader.add_css("content", "div.entry") item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text") article_item=item_loader.load_item() yield article_item
用itemloader方法定製的item(items.py):
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import datetime import re import scrapy # TakeFirst取第一個,Join鏈接 from scrapy.loader.processors import MapCompose, TakeFirst, Join from scrapy.loader import ItemLoader class ArticleSpiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass def date_convert(value): # 定義處理時間函數,返回時間 try: create_date = datetime.datetime.strptime(value, "%Y/%m/%d").date() except Exception as e: create_date = datetime.datetime.now().date() return create_date class ArticleItemLoader(ItemLoader): # 自定義itemloader,值取數組的第一個,修改item中的loader default_output_processor = TakeFirst() def get_nums(value): # 定義處理點贊數,收藏數,評論數處理等 match_num = re.match(".*(\d+).*", value) if match_num: value = int(match_num.group(1)) else: value = 0 return value def return_value(value): # 什麼也不作 return value def remove_comment(value): # 去掉tag中提取的含評論的便籤 if "評論" in value: return "" else: return value class JobboleArticleSpider(scrapy.Item): # 字段中有Field類型,能夠接受任何類型 title = scrapy.Field() create_date = scrapy.Field( # 處理時間,仍是數組 input_processor=MapCompose(date_convert), # 只取數組的第一個 # output_processor=TakeFirst() ) url = scrapy.Field() # 對url作MD5,固定url的長度 url_object_id = scrapy.Field() front_image_url = scrapy.Field(
#覆蓋定製的itemloader,這裏必須爲列表 outout_processor=MapCompose(return_value) ) front_image_path = scrapy.Field() praise_nums = scrapy.Field( input_processor=MapCompose(get_nums) ) fav_nums = scrapy.Field( input_processor=MapCompose(get_nums) ) comment_nums = scrapy.Field( input_processor=MapCompose(get_nums) ) tags = scrapy.Field( # 覆蓋定製的取第一個 output_processor=Join(",") ) content = scrapy.Field()
利用上面的方法就能夠快速爬取全部文章了,scrapy是一個分佈式的設計,也是多線程,寫爬蟲的主要部分就是在spider中定製爬蟲要爬取的url及填充數據(jobbole.py),以及定製item的模板(items.py),而後就是定製pipeline對item中的數據進行一系列操做,如寫入json文件,導入數據庫,下載圖片,獲取圖片路徑等等。