一直苦於找不到好的音樂資源,網傳python爬蟲異常強大,恰好有點python基礎就想着作一個腳本爬一點mp3資源,廢話很少說,先看看效果吧。php
/home/roland/PycharmProjects/Main.py please input the artist name: billie eilish Process finished with exit code 0
運行以後須要輸入爬取的歌手名,這裏咱們填寫一個喜歡的歌手名而後回車便可(支持中文歌手),日誌並無輸出內容,而是直接將mp3文件保存到save_path
路徑下,以下圖所示:
html
這裏用的python版本是3.6 理論上3.X版本均可以直接運行,不用額外裝request
庫。
代碼分析python
POST
以chrome爲例
按下F12
打開控制檯
依次依照下圖找到Form Data
即爲所需,固然,並非全部的請求方式都會用到data
報頭,咱們這麼作的目的是模仿瀏覽器訪問網頁的過程。
然後把須要添加到報頭的字段依次添加進來,這裏pages
和content
是動態變量(原網址爲ajax
異步加載)content
即爲查找的歌手ajax
def resemble_data(content, index): data = {} data['types'] = 'search' data['count'] = '30' data['source'] = 'netease' data['pages'] = index data['name'] = content data = urllib.parse.urlencode(data).encode('utf-8') return data
此外咱們還須要獲取到User-Agent
對應的代碼是:chrome
opener.addheaders = [('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36')]
這個東西在那裏獲取呢?
咱們直接從瀏覽器的控制檯獲取就行了,目的將python的request
請求假裝成瀏覽器訪問。json
# set proxy agent proxy_support = urllib.request.ProxyHandler({'http:': '119.6.144.73:81'}) opener = urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener) # set proxy agent
這裏咱們又設置了一個代理ip,防止服務器的反爬蟲機制(一個ip頻繁訪問會被認爲是爬蟲而不是訪客操做,這裏僅僅是實例,咱們能夠爬取代理ip的地址和端口號來讓更多ip同時訪問,減少被認定爲爬蟲的可能)
下面是爬取代理ip的例子(不感興趣的話能夠直接跳過)api
import urllib.request import urllib.parse import re url = 'http://31f.cn/' head = {} head['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36' response = urllib.request.urlopen(url) html_document = response.read().decode('utf-8') pattern_ip = re.compile(r'<td>(\d+\.\d+\.\d+\.\d+)</td>[\s\S]*? <td>(\d{2,4})</td>') ip_list = pattern_ip.findall(html_document) print(len(ip_list)) for item in ip_list: print("ip地址是:%s 端口號是:%s" % (item[0], item[1]))
這裏的response
返回的實際上是一個音樂文件的連接地址,格式相似於xxxxuuid.mp3
咱們把默認的uuid.mp3
直接命名爲歌曲名.mp3,而後以二進制文件格式寫入文件。瀏覽器
data = {} data['types'] = 'url' data['id'] = id data['source'] = 'netease' data = urllib.parse.urlencode(data).encode('utf-8') response = urllib.request.urlopen(url, data) music_url_str = response.read().decode('utf-8') music_url = pattern.findall(music_url_str) result = urllib.request.urlopen(music_url[0]) file = open(save_path+name+'.mp3', 'wb') file.write(result.read())
至於Request url
,能夠在這裏獲取(固然,這只是一個例子,這個url並非例子所用的url):
服務器
如下是完整的代碼,把音樂文件的保存路徑save_path = '/home/roland/Spider/Img/
修改爲本身的保存路徑就能夠了app
import urllib.request import urllib.parse import json import re def resemble_data(content, index): data = {} data['types'] = 'search' data['count'] = '30' data['source'] = 'netease' data['pages'] = index data['name'] = content data = urllib.parse.urlencode(data).encode('utf-8') return data def request_music(url, content): # set proxy agent proxy_support = urllib.request.ProxyHandler({'http:': '119.6.144.73:81'}) opener = urllib.request.build_opener(proxy_support) opener.addheaders = [('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36')] urllib.request.install_opener(opener) # set proxy agent total = [] pattern = re.compile(r'\(([\s\S]*)\)') for i in range(1, 10): data = resemble_data(content, str(i)) response = urllib.request.urlopen(url, data) result = response.read().decode('unicode_escape') json_result = pattern.findall(result) total.append(json_result) return total def save_music_file(id, name): save_path = '/home/roland/Spider/Img/' pattern = re.compile('http.*?mp3') url = 'http://www.gequdaquan.net/gqss/api.php?callback=jQuery111307210973120745481_1533280033798' data = {} data['types'] = 'url' data['id'] = id data['source'] = 'netease' data = urllib.parse.urlencode(data).encode('utf-8') response = urllib.request.urlopen(url, data) music_url_str = response.read().decode('utf-8') music_url = pattern.findall(music_url_str) result = urllib.request.urlopen(music_url[0]) file = open(save_path+name+'.mp3', 'wb') file.write(result.read()) file.flush() file.close() def main(): url = 'http://www.gequdaquan.net/gqss/api.php?callback=jQuery11130967955054499249_1533275477385' content = input('please input the artist name:') result = request_music(url, content) for group in result[0]: target = json.loads(group) for item in target: save_music_file(str(item['id']), str(item['name'])) main()