IMDB電影排行爬取分析

時間 2019-12-14

標籤 imdb 排行分析简体版

原文原文鏈接

一.打開IMDB電影T250排行能夠看見250條電影數據，電影名，評分等數據均可以看見html

按F12進入開發者模式，找到這些數據對應的HTML網頁結構，以下所示mysql

能夠看見裏面有連接，點擊連接能夠進入電影詳情頁面，這能夠看見導演，編劇，演員信息sql

一樣查看HTML結構，能夠找到相關信息的節點位置json

演員信息能夠在這個頁面的cast中查看完整的信息post

HTML頁面結構fetch

分析完整個要爬取的數據，如今來獲取首頁250條電影信息url

1.整個爬蟲代碼須要使用的相關庫spa

import re
import pymysql
import json
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException

2.請求首頁的HTML網頁，（若是請求不經過能夠添加相關Header），返回網頁內容code

def get_html(url):
    response=requests.get(url)
    if response.status_code==200:
        #判斷請求是否成功
        return  response.text
    else:
        return None

3.解析HTMLxml

def parse_html(html):
    #進行頁面數據提取
    soup = BeautifulSoup(html, 'lxml')
    movies = soup.select('tbody tr')
    for movie in movies:
        poster = movie.select_one('.posterColumn')
        score = poster.select_one('span[name="ir"]')['data-value']
        movie_link = movie.select_one('.titleColumn').select_one('a')['href']
        #電影詳情連接
        year_str = movie.select_one('.titleColumn').select_one('span').get_text()
        year_pattern = re.compile('\d{4}')
        year = int(year_pattern.search(year_str).group())
        id_pattern = re.compile(r'(?<=tt)\d+(?=/?)')
        movie_id = int(id_pattern.search(movie_link).group())
        #movie_id不使用默認生成的，從數據提取惟一的ID
        movie_name = movie.select_one('.titleColumn').select_one('a').string
        #使用yield生成器，生成每一條電影信息
        yield {
            'movie_id': movie_id,
            'movie_name': movie_name,
            'year': year,
            'movie_link': movie_link,
            'movie_rate': float(score)
        }

4.咱們能夠保存文件到txt文本

def write_file(content):
    with open('movie12.txt','a',encoding='utf-8')as f:
        f.write(json.dumps(content,ensure_ascii=False)+'\n')

def main():
    url='https://www.imdb.com/chart/top'
    html=get_html(url)
    for item in parse_html(html):
        write_file(item)

if __name__ == '__main__':
    main()

5.數據能夠看見

6.若是成功了，能夠修改代碼保存數據到MySQL，使用Navicat來操做很是方便先鏈接到MySQL

db = pymysql.connect(host="localhost", user="root", password="********", db="imdb_movie")
cursor = db.cursor()

建立數據表

CREATE TABLE `top_250_movies` (
`id` int(11) NOT NULL,
`name` varchar(45) NOT NULL,
`year` int(11) DEFAULT NULL,
`rate` float NOT NULL,
PRIMARY KEY (`id`)
)

接下來修改代碼，操做數據加入數據表

def store_movie_data_to_db(movie_data):


    sel_sql =  "SELECT * FROM top_250_movies \
       WHERE id =  %d" % (movie_data['movie_id'])
    try:
        cursor.execute(sel_sql)
        result = cursor.fetchall()
    except:
        print("Failed to fetch data")
    if result.__len__() == 0:
        sql = "INSERT INTO top_250_movies \
                    (id, name, year, rate) \
                 VALUES ('%d', '%s', '%d', '%f')" % \
              (movie_data['movie_id'], movie_data['movie_name'], movie_data['year'], movie_data['movie_rate'])
        try:
            cursor.execute(sql)
            db.commit()
            print("movie data ADDED to DB table top_250_movies!")
        except:
            # 發生錯誤時回滾
            db.rollback()
    else:
        print("This movie ALREADY EXISTED!!!")

運行

def main():
    url='https://www.imdb.com/chart/top'
    html=get_html(url)
    for item in parse_html(html):
        store_movie_data_to_db(item)

if __name__ == '__main__':
    main()

查看Navicat，能夠看到保存到mysql的數據。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。