今天爬取的是本人特別喜歡的一個音樂網站,www.luoo.net,git
首先是設置item中須要保存的字段。github
items.pyjson
字段名稱包括期刊號,期刊名,期刊建立時間,單期期刊下的音樂名,做者名,音樂文件url,文件下載結果。app
1 import scrapy 2 3 4 class LuoWangSpiderItem(scrapy.Item): 5 vol_num = scrapy.Field() 6 vol_title = scrapy.Field() 7 vol_time = scrapy.Field() 8 music_name = scrapy.Field() 9 music_author = scrapy.Field() 10 music_urls = scrapy.Field() 11 music_files = scrapy.Field()
接下來個人爬蟲文件。dom
luowang.pyscrapy
該模塊須要的注意的地方可能就是期刊號和期刊名稱是單一值,每一個期刊下面都有十幾首歌曲,須要將獲取的url添加到一個url列表中,在最終造成一個music_urls字段,music_author字段同理。ide
而後,當進入期刊詳情頁的時候,期刊中的歌曲是自動播放或點擊url播放,對於這種動態加載的過程,我是進入network查看音樂的真是url,而後經過拼接url,獲得最終每首歌曲的url,保存在music_urls中,稍後等待Pipelines中的FilesPipeline類發送請求。網站
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from LuoWangSpider.items import LuoWangSpiderItem 4 5 6 class LuowangSpider(scrapy.Spider): 7 name = 'luowang' 8 allowed_domains = ['luoo.net'] 9 offset = 1 10 url = 'http://www.luoo.net/tag/?p=' 11 # 字符串拼接出起始頁的url 12 start_urls = [url + str(offset)] 13 14 def parse(self, response): 15 vol_list = response.xpath("//div[@class='vol-list']/div/a/@href").extract() 16 total_page = response.xpath("//div[@class='paginator']/a[12]/text()").extract()[0] 17 18 for vol in vol_list: 19 yield scrapy.Request(url=vol, callback=self.parse_1) 20 21 # 字符串拼接出全部分頁的url 22 self.offset += 1 23 if self.offset < int(total_page): 24 url = self.url + str(self.offset) 25 yield scrapy.Request(url=url, callback=self.parse) 26 27 def parse_1(self , response): 28 # 進入到期刊詳情頁 29 30 item = LuoWangSpiderItem() 31 item['vol_num'] = response.xpath("//div[@class='container ct-sm']/h1/span[1]/text()").extract()[0] 32 item['vol_title'] = response.xpath("//div[@class='container ct-sm']/h1/span[2]/text()").extract()[0] 33 item['vol_time'] = response.xpath("//div[@class='clearfix vol-meta']/span[2]/text()").extract()[0] 34 35 music_list = response.xpath("//*[@id='luooPlayerPlaylist']/ul/li") 36 # vol系列字段爲單一值,music系列字段爲列表 37 music_name_list = [] 38 music_author_list = [] 39 music_url_list = [] 40 for music in music_list: 41 music_name = music.xpath('./div/a[1]/text()').extract()[0] 42 music_author = music.xpath('./div/span[2]/text()').extract()[0] 43 music_name_list.append(music_name) 44 music_author_list.append(music_author) 45 46 item['music_author'] = music_author_list 47 item['music_name'] = music_name_list 48 49 # 因爲該頁面爲js動態加載,經過抓包獲得文件url,利用期刊號加音樂編號,拼接url 50 for j in item['music_name']: 51 music_url = 'http://mp3-cdn2.luoo.net/low/luoo/radio' + item['vol_num'] + '/' + str(j[0:2]) + '.mp3' 52 music_url_list.append(music_url) 53 item['music_urls'] = music_url_list 54 55 response.meta['item'] = item 56 57 yield item
pipelines.pyurl
管道文件,建立一個繼承於FilesPipeline的類,改寫item_completed方法和file_path方法,用來自定義文件下載路徑和和給文件重命名。spa
值得注意的有的時候程序不報錯可是也沒有按照預期來執行下載,咱們就能夠打印results,這是get_media_request方法發送請求回來的響應結果。
# -*- coding: utf-8 -*- import os import json import scrapy from scrapy.exceptions import DropItem from settings import FILES_STORE from scrapy.pipelines.files import FilesPipeline class Mp3Pipeline(FilesPipeline): ''' 自定義文件下載管道 ''' def get_media_requests(self, item, info): ''' 根據文件的url逐一發送請求 :param item: :param info: :return: ''' for music_url in item['music_urls']: yield scrapy.Request(url=music_url, meta={'item':item}) def item_completed(self, results, item, info): ''' 處理請求結果 :param results: :param item: :param info: :return: ''' print results,'RRRRRRRRRRRRRRRRRRRRRRR' file_paths = [x['path'] for ok, x in results if ok] if not file_paths: raise DropItem("Item contains no files") for i in range(0, len(item['music_name'])): old_name = FILES_STORE + file_paths[i] new_name = FILES_STORE + 'Vol.' + item['vol_num'] + '_' + item['vol_title'] + '/' + item['music_name'][i] + '.mp3' # 文件重命名 os.rename(old_name, new_name) return item def file_path(self, request, response=None, info=None): ''' 自定義文件保存路徑 :param request: :param response: :param info: :return: ''' vol_title = request.meta['item']['vol_title'] vol_num = request.meta['item']['vol_num'] file_name = request.url.split('/')[-1] folder_name = 'Vol.' + vol_num + '_' + vol_title return '%s/%s' % (folder_name, file_name) # # 將文本文件保存爲json格式 # class LuoWangSpiderPipeline(object): # def __init__(self): # self.json_file = open(r'F:\luowang\luowang.json', 'wb') # # def process_item(self, item, spider): # self.json_file.write(json.dumps(dict(item),ensure_ascii=False) + '\n') # return item
settings.py
如下是settings文件,只包含部分代碼。
1 FILES_STORE = 'F:/luowang/' 2 # 文件url保存在item中的字段 3 FILES_URLS_FIELD = 'music_urls' 4 # 文件的結果保存在item中的字段 5 FILES_RESULT_FIELD = 'music_files' 6 7 ITEM_PIPELINES = { 8 # 'LuoWangSpider.pipelines.LuoWangSpiderPipeline': 200, 9 # 'scrapy.pipelines.files.FilesPipeline': 300, 10 'LuoWangSpider.pipelines.Mp3Pipeline': 1,
最終效果:
源碼:
https://github.com/evolution707/LuoWangSpider