Python爬蟲實戰(3)-爬取豆瓣音樂Top250數據(超詳細)

時間 2019-11-30

標籤 python 爬蟲實戰豆瓣音樂 top250 數據詳細欄目 Python 简体版

原文原文鏈接

###前言html

首先咱們先來回憶一下上兩篇爬蟲實戰文章：正則表達式

第一篇：講到了requests和bs4和一些網頁基本操做。bash

Python爬蟲實戰(1)-爬取「房天下」租房信息(超詳細)微信

第二篇：用到了正則表達式-re模塊ui

Python爬蟲實戰(2)-爬取小說"斗羅大陸3龍王傳說」(超詳細)url

今天咱們用lxml庫和xpath語法來爬蟲實戰。spa

**1.**安裝lxml庫3d

window：直接用pip去安裝，注意必定要找到pip的安裝路徑code

pip install lxml

複製代碼

**2.**xpath語法orm

xpath語法不會的能夠參考下面的地址：

http://www.w3school.com.cn/xpath/index.asp

爬蟲實戰

先上部分效果圖：

今天咱們來爬一下「豆瓣音樂Top250的數據」

**1.**觀察網頁切換規律

https://music.douban.com/top250?start=0

https://music.douban.com/top250?start=25

https://music.douban.com/top250?start=50

從中咱們已發現了規律。

2.爬取豆瓣音樂中的歌名、信息、星評爬蟲完整代碼以下：

import  requests
from  lxml import  etree

headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}

list=[1]

def getResult():
   urls=["https://music.douban.com/top250?start={}".format(str(i)) for i in  range(0,250,25)]
   for url in  urls:
       data = requests.get(url, headers=headers)
       html = etree.HTML(data.text)
       #循環標籤
       count = html.xpath("//tr[@class='item']")
       for info in count:
           title = info.xpath("normalize-space(td[2]/div/a/text())")#標題
           list[0]=title #由於title用normalize-space去掉空格了，再生產result時標題顯示不全，因此我用了list替換它
           star = info.xpath("td[2]/div/div/span[2]/text()")  # 星評
           brief_introduction = info.xpath("td[2]/div/p//text()") #簡介
           #生成result串
           for star, title, brief_introduction in zip(star, list, brief_introduction):
               result = {
                   "title": title,
                   "star": star,
                   "brief_introduction": brief_introduction,

               }
               print(result)

if __name__ == '__main__':
   getResult()

複製代碼