使用工具,ubuntu,python,pycharm
1、使用pycharm建立項目:過程略
2、安裝scrapy框架html
pip install Scrapy
3、建立scrapy項目:python
1.建立爬蟲項目
scrapy startproject qidian
2.建立爬蟲,先進入爬蟲項目目錄
cd qidian/ scrapy genspider book book.qidian.com
建立完成後項目目錄以下mongodb
目錄下的的book.py就是咱們的爬蟲文件ubuntu
4、打開book.py編寫爬蟲的代碼框架
1.進入須要爬去的書的目錄,找到開始url 設置start_url:
#鬼吹燈圖書目錄 start_urls = ['https://book.qidian.com/info/53269#Catalog']
二、在建立項目的時候,篩選的url地址爲:
allowed_domains = ['book.qidian.com']dom
打開圖書章節後發現章節的url以下: # https://read.qidian.com/chapter/PNjTiyCikMo1/FzxWdm35gIE1 因此須要將read.qidian.com 加入allowed_domains 中,
allowed_domains = ['book.qidian.com', 'read.qidian.com']
剩下的就是經過xpath 獲取抓取到的內容,提取咱們須要的內容 完整代碼以下
# -*- coding: utf-8 -*- import scrapy import logging logger = logging.getLogger(__name__) class BookSpider(scrapy.Spider): name = 'book' allowed_domains = ['book.qidian.com', 'read.qidian.com'] start_urls = ['https://book.qidian.com/info/53269#Catalog'] def parse(self, response): # 獲取章節列表 li_list = response.xpath('//div[@class="volume"][2]/ul/li') # 列表循環取出章節名稱和章節對應的url for li in li_list: item = {} # 章節名稱 item['chapter_name'] = li.xpath('./a/text()').extract_first() # 章節url item['chapter_url'] = li.xpath('./a/@href').extract_first() # 獲取到的url //read.qidian.com/chapter/PNjTiyCikMo1/TpiSLsyH5Hc1 # 須要從新構造 item['chapter_url'] = 'https:' + item['chapter_url'] # 循環抓取每一個章節的內容 if item['chapter_url'] is not None: # meta:傳遞item數據 yield scrapy.Request(item['chapter_url'], callback=self.parse_chapter, meta={'item': item}) def parse_chapter(self, response): item = response.meta['item'] # 獲取文章內容 item['chapter_content'] = response.xpath('//div[@class="read-content j_readContent"]/p/text()').extract() yield item
5、將爬去數據保存到mongodb中scrapy
1.修改setting文件 找到並打開註釋:
ITEM_PIPELINES = { 'qidain.pipelines.QidainPipeline': 300, }
2.添加monggodb相關配置
# 主機地址 MONGODB_HOST = '127.0.0.1' # 端口 MONGODB_PORT = 27017 # 須要保存的數據哭名字 MONGODB_DBNAME = 'qidian' # 保存的文件名 MONGODB_DOCNAME = 'dmbj'
3.在pipelines.py文件中保存數據,最終文件內容以下
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.conf import settings import pymongo class QidainPipeline(object): def __init__(self): '''在__init__中配置mongodb''' host = settings['MONGODB_HOST'] port = settings['MONGODB_PORT'] db_name = settings['MONGODB_DBNAME'] client = pymongo.MongoClient(host=host, port=port) db = client[db_name] self.post = db[settings['MONGODB_DOCNAME']] def process_item(self, item, spider): self.post.insert(item) return item