運維生存時間這個博客內容仍是比較詳盡的,對與運維技術人員的我來講,是偶爾溜達進來的地方,從中也學習到很多知識,感謝博主的奉獻!css
這段時間我就經過scrapy來收集下此博客內文章的相關數據,供之後須要從中提取我認爲值得看的文章做爲數據依據.html
今天,要作的事就是把數據先抓取出來,後期再將其數據存儲起來.node
首先經過命令
scrapy genspider ttlsa www.ttlsa.com建立一個蜘蛛程序應用名爲ttlsamysql
其次在ttlsa.py下編寫以下代碼.sql
# -*- coding: utf-8 -*- import re from urllib import parse from datetime import datetime import scrapy from scrapy.http import Request from ScrapyProject.utils.common import get_object_id ''' 獲取ttlsa文章相關數據 ''' class TtlsaSpider(scrapy.Spider): name = 'ttlsa' allowed_domains = ['www.ttlsa.com'] start_urls = ['http://www.ttlsa.com/'] def parse(self, response): post_nodes = response.css("article") for node in post_nodes: front_img_url = node.css("figure:nth-child(1) > a:nth-child(1) > img:nth-child(1)::attr(src)").extract_first("") #create_time = node.css("div:nth-child(3) > span:nth-child(3) > span:nth-child(1)::text").extract_first("") url = node.css("figure:nth-child(1) > a:nth-child(1)::attr(href)").extract_first("") url = parse.urljoin(response.url,url) if url != "http://www.ttlsa.com/": yield Request(url=url,meta={"front_img_url": front_img_url}, callback=self.parse_detail) next_page = response.css(".next ::attr(href)").extract_first("") if next_page: yield Request(url=parse.urljoin(response.url,next_page),callback=self.parse) def parse_detail(self,response): front_img_url = response.meta.get("front_img_url", "") try: create_time = response.css(".spostinfo ::text").extract()[3] pattern = ".*?(\d+/\d+/\d+)" m = re.match(pattern, create_time) create_time = datetime.strptime(m[1], "%d/%m/%Y").date() except IndexError: create_time = datetime.now().date() title = response.css(".entry-title::text").extract_first("") #評論數 comment_nums = response.css("div.entry-content li.comment a::text").extract_first("0") comment_nums=comment_nums.replace("發表評論", "0") #點贊數 praise_nums = response.css("a.dingzan .count::text").extract_first("0").strip() #tags tags = ",".join(response.css("ul.wow li a::text").extract()) content = response.css(".single-content").extract() from ScrapyProject.items import TtlsaItem ttlsa_item = TtlsaItem() ttlsa_item["title"] = title ttlsa_item["comment_nums"] = comment_nums ttlsa_item["praise_nums"] = praise_nums ttlsa_item["tags"] = tags ttlsa_item["content"] = content ttlsa_item["create_time"] = create_time ttlsa_item["front_img_url"] = [front_img_url] ttlsa_item["url"] = response.url ttlsa_item["url_object_id"] = get_object_id(response.url) #使用yield,將會跳轉到pipelines裏執行相關類中,須要在settings.py中開啓而且設置正確的ITEM_PIPELINES yield ttlsa_item
items.py數據庫
class TtlsaItem(scrapy.Item): title = scrapy.Field() comment_nums = scrapy.Field() praise_nums = scrapy.Field() tags = scrapy.Field() content = scrapy.Field() create_time = scrapy.Field() front_img_url = scrapy.Field() #記錄下載的圖片本地路徑 front_img_path = scrapy.Field() url=scrapy.Field() #由於url是固定長度,因此咱們但願能獲取一個固定長度的url對象值,供之後重複收集數據以斷定是添加仍是更新 url_object_id=scrapy.Field()
pipeline.pyjson
class TtlsaPipeline(object): def process_item(self,item,spider): return item
settings.pyapi
# Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'ScrapyProject.pipelines.TtlsaPipeline': 300, #注意這裏的數字,越小越優先執行 }
經過代碼調試功能,即可以看到咱們已經獲取到咱們想要的數據了.
而且在pipeline內打上斷點,也能從pipeline中獲取到數據了.
使用scrapy自帶的pipeline下載圖片,而且將其下載到本地,而且將圖片路徑保存到item中運維
1.重寫pipelinedom
from scrapy.pipelines.images import ImagesPipeline,DropItem class TtlsaImagesPipeline(ImagesPipeline): def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item[' front_img_path '] = image_paths return item
2.設置settings.py
ITEM_PIPELINES = { ' ScrapyProject.pipelines. TtlsaImagesPipeline ': 1, } IMAGES_URLS_FIELD = "front_img_url" project_dir=os.path.abspath(os.path.dirname(__file__)) IMAGES_STORE = os.path.join(project_dir,"images")
3.在項目錄下新建一個images目錄,如圖:
這樣,我在抓取網站圖片後,就能夠將其下載到項目的images目錄下了.
而且能夠看到咱們下載的圖片路徑存儲到item[‘front_img_url’]中了.
對與此字段url_object_id=scrapy.Field()咱們可使用hashlib庫來實現.
utils/common.py
import hashlib def get_object_id(url): md5 = hashlib.md5() md5.update(url.encode('utf-8')) return md5.hexdigest() if __name__ == '__main__': print(get_object_id("http://www.baidu.com"))
這時候咱們在ttlsa.py中這樣改寫.
from ScrapyProject.utils.common import get_object_id
ttlsa_item["url_object_id"] = get_object_id(response.url)
好了,該獲取的數據都已經獲取了,下面咱們將其數據存儲起來,供之後分析.
存儲這塊,我考略將其分2部分,第一部分存儲到文件,另外一部分存儲到mysql
1.存儲到文件
1.1修改pipelines.py
class JsonWithEncodingPipeline(object): #自定義json文件的導出 def __init__(self): self.file = codecs.open('ttlsa.json', 'w', encoding="utf-8") def process_item(self, item, spider): lines = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(lines) return item def spider_closed(self, spider): self.file.close()
1.2 修改settings.py
ITEM_PIPELINES = {
'ScrapyProject.pipelines.TtlsaImagesPipeline': 1,
'ScrapyProject.pipelines.JsonWithEncodingPipeline':2,
}
這樣咱們就能夠將數據存儲到ttlsa.json文件中了.
2.存儲到mysql中
2.1設計存儲mysql庫ttlsa_spider表article
讓咱們編寫一個MysqlPipeline,讓其抓取的數據存儲到mysql中吧.
import MySQLdb class MysqlPipeline(object): def __init__(self): self.conn = MySQLdb.connect('localhost', 'root', 'root', 'ttlsa_spider', charset="utf8", use_unicode=True) self.cursor = self.conn.cursor() def process_item(self,item,spider): insert_sql = """ insert into article(title,url,url_object_id,comment_nums,praise_nums,tags,content,create_time, front_img_url,front_img_path) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) """ self.cursor.execute(insert_sql,(item["title"], item["url"], item["url_object_id"], int(item["comment_nums"]), int(item["praise_nums"]), item["tags"], item["content"], item["create_time"], item["front_img_url"], item["front_img_path"])) self.conn.commit()
注意:咱們將評論數,佔贊數強制轉換成int類型了.
修改settings.py
ITEM_PIPELINES = {
'ScrapyProject.pipelines.TtlsaImagesPipeline': 1,
#'ScrapyProject.pipelines.JsonWithEncodingPipeline': 2,
'ScrapyProject.pipelines.MysqlPipeline': 3,
}
Debug跑下看看可有什麼問題.
經過不斷按F8,數據源源不斷的流進數據庫了,哈哈。
可是有一個問題,那就是當咱們的數據量很大,大量的向數據庫寫入的時候,可能會致使數據庫出現異常,這時咱們應該使用異步的方式向數據庫插入數據.下面我將使用異步插入的方式來重寫pipeline.
首先們將數據庫的配置文件寫入到settings.py中.
#MYSQL
MYSQL_HOST="127.0.0.1"
MYSQL_USER="root"
MYSQL_PWD="4rfv%TGB^"
MYSQL_DB="ttlsa_spider"
後面咱們若是想使用settings.py文件裏定義的變量,能夠在pipeline.py文件中的定義的類中使用from_settings(cls,settings)這個方法來獲取.
from twisted.enterprise import adbapi class MysqlTwsitedPipeline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls, settings): dbparms = { 'host': settings["MYSQL_HOST"], 'db': settings["MYSQL_DB"], 'user': settings["MYSQL_USER"], 'passwd': settings["MYSQL_PWD"], 'charset': 'utf8', 'use_unicode': True } dbpool=adbapi.ConnectionPool("MySQLdb", cp_min=10, cp_max=20, **dbparms) return cls(dbpool) def process_item(self,item,spider): """ 使用twisted將mysql插入變成異步執行 """ query = self.dbpool.runInteraction(self.doInsert,item) query.addErrback(self.handle_error,item,spider) #處理異步寫入錯誤 def handle_error(self,failurer,item,spider): if failurer: print(failurer) def doInsert(self,cursor,item): insert_sql = """ insert into article(title,url,url_object_id,comment_nums,praise_nums,tags,content,create_time, front_img_url,front_img_path) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) """ cursor.execute(insert_sql, (item["title"], item["url"], item["url_object_id"], int(item["comment_nums"]), int(item["praise_nums"]), item["tags"], item["content"], item["create_time"], item["front_img_url"], item["front_img_path"]))
再將MysqlTwsitedPipeline類寫入到settings.py文件中.
ITEM_PIPELINES = {
'ScrapyProject.pipelines.TtlsaImagesPipeline': 1,
#'ScrapyProject.pipelines.JsonWithEncodingPipeline': 2,
'ScrapyProject.pipelines.MysqlTwsitedPipeline': 3,
}
調試代碼.
好了,數據又源源不斷的寫到數據庫中了.
再看下與數據庫鏈接的數目:
數了一下,有12個。也就是在鏈接池中的數量是由cp_min=10,cp_max=20定義的.
到此,數據便存儲到mysql中了.
若是想了解更多,請關注咱們的公衆號
公衆號ID:opdevos
掃碼關注