python網絡爬蟲（11）近期電影票房或熱度信息爬取

時間 2019-12-04

標籤 python 網絡爬蟲近期票房熱度信息欄目 Python 简体版

原文原文鏈接

目標意義

爲了理解動態網站中一些數據如何獲取，作一個簡單的分析。python

說明

思路，原始代碼來源於：https://book.douban.com/subject/27061630/。sql

構造-下載器

構造分下載器，下載原始網頁，用於原始網頁的獲取，動態網頁中，js部分的響應獲取。數據庫

經過瀏覽器模仿，合理製做請求頭，獲取網頁信息便可。json

代碼以下：api

import requests
import chardet
class HtmlDownloader(object):
    def download(self,url):
        if url is None:
            return None
        user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
        headers={'User-Agent':user_agent}
        r=requests.get(url,headers=headers)
        if r.status_code is 200:
            r.encoding=chardet.detect(r.content)['encoding']
            return r.text
        return None

構造-解析器

解析器解析數據使用。瀏覽器

獲取的票房信息，電影名稱等，使用解析器完成。網絡

被解析的動態數據來源於js部分的代碼。app

js地址的獲取則經過F12控制檯-->網絡-->JS，而後觀察，獲得。ide

地址如正上映的電影：函數

http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http://movie.mtime.com/257982/&t=201907121611461266&Ajax_CallBackArgument0=257982

返回信息中，解析出json格式的部分，經過json的一些方法，獲取其中的票房等信息。

其中，json解析工具地址如：https://www.json.cn/

未上映的電影是同理的。

這些數據的解析有差別，因此定製了函數分支，處理解析過程當中可能遇到的不一樣情景。

代碼以下：

import re
import json
class HtmlParser(object):
    def parser_url(self,page_url,response):
        pattern=re.compile(r'(http://movie.mtime.com/(\d+)/)')
        urls=pattern.findall(response)
        if urls != None:
            return list(set(urls))#Duplicate removal
        else:
            return None
        
    def parser_json(self,url,response):
        #parsing json. input page_url as js url and response for parsing
        pattern=re.compile(r'=(.*?);')
        result=pattern.findall(response)[0]
        if result != None:
            value=json.loads(result)
            isRelease=value.get('value').get('isRelease')
            if isRelease:
                isRelease=1
                return self.parser_json_release(value,url)
            else:
                isRelease=0
                return self.parser_json_notRelease(value,url)
        return None
    def parser_json_release(self,value,url):
        isRelease=1
        movieTitle=value.get('value').get('movieTitle')
        RatingFinal=value.get('value').get('movieRating').get('RatingFinal')
        try:
            TotalBoxOffice=value.get('value').get('boxOffice').get('TotalBoxOffice')
            TotalBoxOfficeUnit=value.get('value').get('boxOffice').get('TotalBoxOfficeUnit')
        except:
            TotalBoxOffice="None"
            TotalBoxOfficeUnit="None"
        return isRelease,movieTitle,RatingFinal,TotalBoxOffice,TotalBoxOfficeUnit,url
        
    def parser_json_notRelease(self,value,url):
        isRelease=0
        movieTitle=value.get('value').get('movieTitle')
        try:
            RatingFinal=Ranking=value.get('value').get('hotValue').get('Ranking')
        except:
            RatingFinal=-1
        TotalBoxOffice='None'
        TotalBoxOfficeUnit='None'
        return isRelease,movieTitle,RatingFinal,TotalBoxOffice,TotalBoxOfficeUnit,url

構造-存儲器

存儲方案爲Sqlite，因此在解析器中isRelease部分，使用了0和1進行的存儲。

存儲須要鏈接sqlite3，建立數據庫，獲取執行數據庫語句的方法，插入數據等。

按照原做者思路，存儲時，先暫時存儲到內存中，條數大於10之後，將內存中的數據插入到sqlite數據庫中。

代碼以下：

import sqlite3
class DataOutput(object):
    def __init__(self):
        self.cx=sqlite3.connect("MTime.db")
        self.create_table('MTime')
        self.datas=[]
    
    def create_table(self,table_name):
        values='''
        id integer primary key autoincrement,
        isRelease boolean not null,
        movieTitle varchar(50) not null,
        RatingFinal_HotValue real not null default 0.0,
        TotalBoxOffice varchar(20),
        TotalBoxOfficeUnit varchar(10),
        sourceUrl varchar(300)
        '''
        self.cx.execute('create table if not exists %s(%s)' %(table_name,values))
        
    def store_data(self,data):
        if data is None:
            return
        self.datas.append(data)
        if len(self.datas)>10:
            self.output_db('MTime')
            
    def output_db(self,table_name):
        for data in self.datas:
            cmd="insert into %s (isRelease,movieTitle,RatingFinal_HotValue,TotalBoxOffice,TotalBoxOfficeUnit,sourceUrl) values %s" %(table_name,data)
            self.cx.execute(cmd)
            self.datas.remove(data)
        self.cx.commit()
        
    def output_end(self):
        if len(self.datas)>0:
            self.output_db('MTime')
        self.cx.close()

主函數部分

建立以上對象做爲初始化

而後獲取根路徑。從根路徑下找到百餘條電影網址信息。

對每一個電影網址信息一一解析，而後存儲。

import HtmlDownloader
import HtmlParser
import DataOutput
import time
class Spider(object):
    def __init__(self):
        self.downloader=HtmlDownloader.HtmlDownloader()
        self.parser=HtmlParser.HtmlParser()
        self.output=DataOutput.DataOutput()
    
    def crawl(self,root_url):
        content=self.downloader.download(root_url)
        urls=self.parser.parser_url(root_url, content)
        for url in urls:
            print('.')
            t=time.strftime("%Y%m%d%H%M%S1266",time.localtime())
            rank_url='http://service.library.mtime.com/Movie.api'\
            '?Ajax_CallBack=true'\
            '&Ajax_CallBackType=Mtime.Library.Services'\
            '&Ajax_CallBackMethod=GetMovieOverviewRating'\
            '&Ajax_CrossDomain=1'\
            '&Ajax_RequestUrl=%s'\
            '&t=%s'\
            '&Ajax_CallBackArgument0=%s' %(url[0],t,url[1])
            rank_content=self.downloader.download(rank_url)
            try:
                data=self.parser.parser_json(rank_url, rank_content)
            except:
                print(rank_url)
            self.output.store_data(data)

        self.output.output_end()
        print('ed')
if __name__=='__main__':
    spider=Spider()
    spider.crawl('http://theater.mtime.com/China_Beijing/')