爬取豆瓣電影top250提取電影分類進行數據分析

時間 2019-11-07

標籤豆瓣 top250 提取分類進行數據分析简体版

原文原文鏈接

標籤（空格分隔）：python爬蟲html

1、爬取網頁，獲取須要內容

咱們今天要爬取的是豆瓣電影top250
頁面以下所示：
python

咱們須要的是裏面的電影分類，經過查看源代碼觀察能夠分析出咱們須要的東西。直接進入主題吧！
mysql

知道咱們須要的內容在哪裏了，接下來就使用咱們python強大的request庫先獲取網頁內容下來吧！獲取內容後，再使用一個好用的lxml庫來分析網頁內容，而後獲取咱們的內容就能夠作下一步操做了。
先貼出使用request庫和lxml分析的代碼sql

def get_page(i):
            url = 'https://movie.douban.com/top250?start={}&filter='.format(i)
                
            html = requests.get(url).content.decode('utf-8')    # 使用request庫獲取網頁內容
        
            selector = etree.HTML(html)    # 使用lxml庫提取內容
            '''
                經過觀察頁面就能發現內容在<div class="info">下的一部分
            '''
            content = selector.xpath('//div[@class="info"]/div[@class="bd"]/p/text()')
            print(content)
        
            for i in content[1::2]:
                print(str(i).strip().replace('\n\r', ''))
                # print(str(i).split('/'))
                i = str(i).split('/')  
                i = i[len(i) - 1]
                key = i.strip().replace('\n', '').split(' ') # 這裏的strip和replace的使用目的是去除空格和空行之類
                print(key)

經過獲取下來的內容咱們發現一部電影的各項內容都是用'/'分隔着，咱們只須要提取電影分類中的東西，因此咱們須要使用數據庫

i = str(i).split('/')

來把內容分隔成幾項內容，由於電影分類排在最後，因此咱們經過app

i = i[len(i) - 1]

來獲取分隔後的最後一項也就是咱們須要的電影分類，還有最後一步咱們須要完成的，由於一部電影裏面通常都有多個電影分類的標籤，因此咱們還要繼續分隔獲取到的電影分類，而且觀察能夠知道電影分類之間只是用一個空格隔開，因此咱們使用下面一行代碼就能夠分離出各個分類：python爬蟲

key = i.strip().replace('\n', '').split(' ')

2、接下來就是保存到mysql數據庫

把電影分類保存在mysql數據庫以便下面進行數據分析，這裏咱們使用到pymysql來鏈接mysql數據庫,首先咱們須要在mysql數據庫建好表：dom

而後咱們經過pymysql把數據保存到數據庫中，代碼以下：
首先要鏈接數據庫：函數

# 鏈接mysql數據庫
conn = pymysql.connect(host = 'localhost', user = 'root', passwd = '2014081029', db = 'mysql', charset = 'utf8')  # user爲數據庫的名字，passwd爲數據庫的密碼，通常把要把字符集定義爲utf8，否則存入數據庫容易遇到編碼問題
cur = conn.cursor()  # 獲取操做遊標
cur.execute('use douban')  # 使用douban這個數據庫

在保存到數據庫以前，咱們還有一個須要作得，那就是把250部電影的分類彙總數量，因此咱們定義了一個字典來統計電影分類的個數，這裏的代碼是get_page函數的一部分,代碼以下：fetch

for i in content[1::2]:
        print(str(i).strip().replace('\n\r', ''))
        # print(str(i).split('/'))
        i = str(i).split('/')
        i = i[len(i) - 1]
        key = i.strip().replace('\n', '').split(' ')
        print(key)
        for i in key:
            if i not in douban.keys():
                douban[i] = 1
            else:
                douban[i] += 1

而後定義一個保存函數，執行插入操做，若是出現插入失敗，就執行回滾操做，還有記得在操做完成以後，使用conn.close()和cur.close()來關閉數據庫鏈接,代碼以下：

def save_mysql(douban):
        print(douban)  # douban在主函數中定義的字典
        for key in douban:
            print(key)
            print(douban[key])
            if key != '':
                try:
                    sql = 'insert douban(類別, 數量) value(' + "\'" + key + "\'," + "\'" + str(douban[key]) + "\'" + ');'
                    cur.execute(sql)
                    conn.commit()
                except:
                    print('插入失敗')
                    conn.rollback()

3、使用matplotlib進行數據可視化操做

首先，從數據庫中把電影分類和每一個分類的數量分別存入一個列表中，而後使用matplotlib進行可視化操做，具體以下：

def pylot_show():
        sql = 'select * from douban;'  
        cur.execute(sql)
        rows = cur.fetchall()   # 把表中全部字段讀取出來
        count = []   # 每一個分類的數量
        category = []  # 分類
    
        for row in rows:
            count.append(int(row[2]))   
            category.append(row[1])
    
        y_pos = np.arange(len(category))    # 定義y軸座標數
        plt.barh(y_pos, count, align='center', alpha=0.4)  # alpha圖表的填充不透明度(0~1)之間
        plt.yticks(y_pos, category)  # 在y軸上作分類名的標記
    
        for count, y_pos in zip(count, y_pos):
            # 分類個數在圖中顯示的位置，就是那些數字在柱狀圖尾部顯示的數字
            plt.text(count, y_pos, count,  horizontalalignment='center', verticalalignment='center', weight='bold')  
        plt.ylim(+28.0, -1.0) # 可視化範圍，至關於規定y軸範圍
        plt.title(u'豆瓣電影250')   # 圖表的標題
        plt.ylabel(u'電影分類')     # 圖表y軸的標記
        plt.subplots_adjust(bottom = 0.15) 
        plt.xlabel(u'分類出現次數')  # 圖表x軸的標記
        plt.savefig('douban.png')   # 保存圖片

下面說明一下matplotlib的一些簡單使用，首先咱們要導入matplotlib和numpy的包

import numpy as np
import matplotlib.pyplot as plt

此次可視化是柱狀圖，這裏給出brah()函數的定義：

barh()
主要功能：作一個橫向條形圖，橫向條的矩形大小爲: left, left + width, bottom, bottom + height
參數：barh ( bottom , width , height =0.8, left =0, **kwargs )
返回類型：一個 class 類別， matplotlib.patches.Rectangle**實例
參數說明：

bottom: Bars 的垂直位置的底部邊緣
width: Bars 的長度
可選參數：
height: bars 的高度
left: bars 左邊緣 x 軸座標值
color: bars 顏色
edgecolor: bars 邊緣顏色
linewidth: bar 邊緣寬度;None 表示默認寬度;0 表示不 i 繪製邊緣
xerr: 若不爲 None,將在 bar 圖上生成 errobars
yerr: 若不爲 None,將在 bar 圖上生成 errobars
ecolor: 指定 errorbar 顏色
capsize: 指定 errorbar 的頂部(cap)長度
align: ‘edge’ (默認) | ‘center’:‘edge’以底部爲準對齊;‘center’以 y 軸做爲中心
log: [False|True] False (默認),若爲 True,使用 log 座標

而後就能夠顯示出圖片來了

源碼在這裏：

# -*- coding: utf-8 -*-
# !/usr/bin/env python

from lxml import etree
import requests
import pymysql
import matplotlib.pyplot as plt
from pylab import *
import numpy as np

# 鏈接mysql數據庫
conn = pymysql.connect(host = 'localhost', user = 'root', passwd = '2014081029', db = 'mysql', charset = 'utf8')
cur = conn.cursor()
cur.execute('use douban')

def get_page(i):
    url = 'https://movie.douban.com/top250?start={}&filter='.format(i)

    html = requests.get(url).content.decode('utf-8')

    selector = etree.HTML(html)

    content = selector.xpath('//div[@class="info"]/div[@class="bd"]/p/text()')
    print(content)

    for i in content[1::2]:
        print(str(i).strip().replace('\n\r', ''))
        # print(str(i).split('/'))
        i = str(i).split('/')
        i = i[len(i) - 1]
        # print('zhe' +ｉ)
        # print(i.strip())
        # print(i.strip().split(' '))
        key = i.strip().replace('\n', '').split(' ')
        print(key)
        for i in key:
            if i not in douban.keys():
                douban[i] = 1
            else:
                douban[i] += 1

def save_mysql():
    print(douban)
    for key in douban:
        print(key)
        print(douban[key])
        if key != '':
            try:
                sql = 'insert douban(類別, 數量) value(' + "\'" + key + "\'," + "\'" + str(douban[key]) + "\'" + ');'
                cur.execute(sql)
                conn.commit()
            except:
                print('插入失敗')
                conn.rollback()


def pylot_show():
    sql = 'select * from douban;'
    cur.execute(sql)
    rows = cur.fetchall()
    count = []
    category = []

    for row in rows:
        count.append(int(row[2]))
        category.append(row[1])
    print(count)
    y_pos = np.arange(len(category))
    print(y_pos)
    print(category)
    colors = np.random.rand(len(count))
    # plt.barh()
    plt.barh(y_pos, count, align='center', alpha=0.4)
    plt.yticks(y_pos, category)
    for count, y_pos in zip(count, y_pos):
        plt.text(count, y_pos, count,  horizontalalignment='center', verticalalignment='center', weight='bold')
    plt.ylim(+28.0, -1.0)
    plt.title(u'豆瓣電影250')
    plt.ylabel(u'電影分類')
    plt.subplots_adjust(bottom = 0.15)
    plt.xlabel(u'分類出現次數')
    plt.savefig('douban.png')


if __name__ == '__main__':
    douban = {}
    for i in range(0, 250, 25):
        get_page(i)
    # save_mysql()
    pylot_show()
    cur.close()
    conn.close()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。