在上一部分《【python數據分析實戰】電影票房數據分析(一)數據採集》 已經獲取到了2011年至今的票房數據,並保存在了mysql中。
本文將在實操中講解如何將mysql中的數據抽取出來並作成動態可視化。html
第一張圖,咱們要看一下每個月的票房走勢,毫無疑問要作成折線圖,將近10年的票房數據放在一張圖上展現。python
數據抽取:
採集到的票房數據是按天統計的,而且咱們只看正常上映的和點映的,其餘如重映等場次均不在本次統計內。
所以咱們先對mysql中的數據releaseInfo字段進行篩選,而後根據上映年份和月份進行分組聚合,獲得10年內每個月的票房數據。
用sql取到數據後,再將不一樣年份的數據分別放入list中,原始數據是以"萬"爲單位的str,這裏咱們折算爲以"億"爲單位的float。mysql
構造圖像:
x軸數據爲年份,
再分別將不一樣年份的票房數據添加到y軸中,
最後配置下圖像的屬性便可。sql
config = {...} # db配置省略 conn = pymysql.connect(**config) cursor = conn.cursor() sql = ''' select substr(`date`,1,4) year, substr(`date`,5,2) month, round(sum(`boxInfo`),2) monthbox from movies_data where (substr(`releaseInfo`,1,2) = '上映' or `releaseInfo`='點映' ) group by year,month order by year,month ''' cursor.execute(sql) data = cursor.fetchall() x_data = list(set([int(i[1]) for i in data])) x_data.sort() x_data = list(map(str, x_data)) y_data1 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2011'] y_data2 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2012'] y_data3 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2013'] y_data4 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2014'] y_data5 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2015'] y_data6 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2016'] y_data7 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2017'] y_data8 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2018'] y_data9 = [round(int(i[2]) / 10000, 2) for i in data if i[0] == '2019'] cursor.close() conn.close() def line_base() -> Line: c = ( Line(init_opts=opts.InitOpts(height="600px", width="1300px")) .add_xaxis(x_data) .add_yaxis("2011", y_data1) .add_yaxis("2012", y_data2) .add_yaxis("2013", y_data3) .add_yaxis("2014", y_data4) .add_yaxis("2015", y_data5) .add_yaxis("2016", y_data6) .add_yaxis("2017", y_data7) .add_yaxis("2018", y_data8) .add_yaxis("2019", y_data9) .set_global_opts(title_opts=opts.TitleOpts(title="月票房走勢"), legend_opts=opts.LegendOpts( type_="scroll", pos_top="55%", pos_left="95%", orient="vertical"), xaxis_opts=opts.AxisOpts( axistick_opts=opts.AxisTickOpts(is_align_with_label=True), boundary_gap=False, ),) .set_series_opts(label_opts=opts.LabelOpts(is_show=False), # 不顯示柱體上的標註(數值) markline_opts=opts.MarkLineOpts( data=[opts.MarkLineItem(type_="max", name="最大值"), ]), ) .extend_axis(yaxis=opts.AxisOpts(name="票房(億元)", position='left'), # 設置y軸標籤顯示格式,數據+"人" xaxis=opts.AxisOpts(name="月份")) ) return c line_base().render("v1.html")
有本圖能夠看出:
一、近10年票房總數逐漸增加(固然這是廢話)
二、11-13年每個月票房波動很小,幾乎沒有明顯的高峯檔期,最近兩年高峯檔期最爲明顯,集中在春節、暑期和十一。
ide
第二張圖,咱們要看一下票房、上映影片數和觀影人次 逐年的變化狀況svg
數據抽取:
先篩選releaseInfo 爲正常上映和首映的數據,
再按年份分組,也就是date字段的前4位,fetch
構造圖像:
由於三類數據的x軸都是年份,因此可放在一張圖上展現,爲了觀察更直觀,將其中一項數據做成柱狀圖,另外兩項作成折線圖。ui
config = {...} # db配置省略 conn = pymysql.connect(**config) cursor = conn.cursor() sql2 = '''select substr(date,1,4), round(sum(boxInfo)/10000,2), count(DISTINCT movieId), round(sum(avgShowView*showInfo)/100000000,2) from movies_data where (substr(`releaseInfo`,1,2) = '上映' or `releaseInfo`='點映' ) GROUP by substr(date,1,4)''' cursor.execute(sql2) data2 = cursor.fetchall() x_data2 = [i[0] for i in data2] y_data2_1 = [i[1] for i in data2] y_data2_2 = [i[2] for i in data2] y_data2_3 = [i[3] for i in data2] cursor.close() conn.close() def bar_base() -> Line: c = ( Line() .add_xaxis(x_data2) .add_yaxis("總票房", y_data2_1, yaxis_index=0) .add_yaxis("上映電影總數", y_data2_2, color='LimeGreen', yaxis_index=0, ) .set_global_opts(title_opts=opts.TitleOpts(title="年票房總值、上映影片總數及觀影總人次"), legend_opts=opts.LegendOpts(pos_left="40%"), ) .extend_axis( yaxis=opts.AxisOpts(name="票房/數量(億元/部)", position='left')) .extend_axis( yaxis=opts.AxisOpts(name="人次(億)", type_="value", position="right", # 設置y軸的名稱,類型,位置 axisline_opts=opts.AxisLineOpts(linestyle_opts=opts.LineStyleOpts(color="#483D8B")), )) ) bar = ( Bar() .add_xaxis(x_data2) .add_yaxis("觀影人次", y_data2_3, yaxis_index=2, category_gap="1%", label_opts=opts.LabelOpts(position="inside")) ) c.overlap(bar) return Grid().add(c, opts.GridOpts(pos_left="10%",pos_top='20%'), is_control_axis_index=True) # 調整位置 bar_base().render("v2.html")
本圖能夠看出:
(2019年數據下滑是由於統計時 2019年剛到10月下旬,尚未獲得一年完整的數據。)
一、上映影片數增加幅度不大,票房和觀影人次漲幅相近,所以票房逐年增加的最主要緣由是觀影人次的增加,年平均票價應該變化不大。code
影片的上映期長短不一,這也影響了影片的票房狀況,因此這張圖咱們要看一下單片總票房和日均票房的狀況。orm
config = {...} # db配置省略 conn = pymysql.connect(**config) cursor = conn.cursor() sql2 = '''select a.*,b.releasemonth from (select movieid, moviename, round(sum(boxinfo)/10000,2) sumBox, count(movieid) releasedays, round(sum(boxinfo)/count(movieid)/10000,2) avgdaybox from movies_data where (substr(`releaseInfo`,1,2) = '上映' or `releaseInfo`='點映' ) group by movieid,moviename) a , (select substr(date,5,2) releasemonth,movieId,movieName,releaseInfo from movies_data where releaseInfo='上映首日') b where a.movieid = b.movieid order by sumBox desc''' cursor.execute(sql2) data3 = cursor.fetchall() x_data3 = [i[1] for i in data3[:30]] # 名稱 y_data3_1 = [i[2] for i in data3[:30]] # 總票房 y_data3_2 = [i[4] for i in data3[:30]] # 日均票房 y_data3_3 = [int(i[5]) for i in data3[:30]] # 上映月份 cursor.close() conn.close() def bar_base() -> Line: c = ( Bar(init_opts=opts.InitOpts(height="600px", width="1500px")) .add_xaxis(x_data3) .add_yaxis("影片總票房", y_data3_1, yaxis_index=0) # .add_yaxis("影片日均票房", y_data3_2, yaxis_index=1, gap='-40%') .set_global_opts(title_opts=opts.TitleOpts(title="單片總票房及日均票房"), xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-45)), datazoom_opts=opts.DataZoomOpts(), ) .set_series_opts(label_opts=opts.LabelOpts(is_show=False), # 不顯示柱體上的標註(數值) markpoint_opts=opts.MarkPointOpts( data=[opts.MarkPointItem(type_="max", name="最大值"), opts.MarkPointItem(type_="min", name="最小值"), ]),) .extend_axis( yaxis=opts.AxisOpts(name="億元", position='left')) .extend_axis( yaxis=opts.AxisOpts(name="億元", type_="value", position="right", # 設置y軸的名稱,類型,位置 axisline_opts=opts.AxisLineOpts(linestyle_opts=opts.LineStyleOpts(color="#483D8B")), )) ) bar = ( Bar(init_opts=opts.InitOpts(height="600px", width="1500px")) .add_xaxis(x_data3) # .add_yaxis("影片總票房", y_data3_1, yaxis_index=0) .add_yaxis("影片日均票房", y_data3_2, yaxis_index=2, gap='-40%') .set_global_opts(title_opts=opts.TitleOpts(title="單片總票房及日均票房"),) .set_series_opts(label_opts=opts.LabelOpts(is_show=False), # 不顯示柱體上的標註(數值) markpoint_opts=opts.MarkPointOpts( data=[opts.MarkPointItem(type_="max", name="最大值"), opts.MarkPointItem(type_="min", name="最小值"), ]), markline_opts=opts.MarkLineOpts( data=[opts.MarkLineItem(type_="average", name="平均值"), ] ),) ) c.overlap(bar) return Grid().add(c, opts.GridOpts(pos_left="5%", pos_right="20%"), is_control_axis_index=True) # 調整位置 bar_base().render("v3.html")
能夠看出有的電影雖然總票房通常,可是日均票房很高,說明上映時間不長但卻很火爆。
而對於總票房很高,但日均票房通常的影片,多是因爲上映時間較長,後期較低的上座率拉低了日均票房。
因此看一個影片的火爆程度,總票房只是一方面,在相同上映時間內的上座率變化趨勢也很重要。
本圖至關於圖一的補充,主要是看一下高票房的影片和上映時間的關係
def dayformat(i): mm = int(i[-2]) dd = int(i[-1]) mmdd = mm + dd/100*3.3 return mmdd config = {...} # db配置省略 conn = pymysql.connect(**config) cursor = conn.cursor() sql2 = '''select a.*,b.releaseyear,b.releasemonth,b.releaseday from (select movieid, moviename, round(sum(boxinfo)/10000,2) sumBox, count(movieid) releasedays, round(sum(boxinfo)/count(movieid)/10000,2) avgdaybox from movies_data where (substr(`releaseInfo`,1,2) = '上映' or `releaseInfo`='點映' ) group by movieid,moviename) a , (select substr(date,1,4) releaseyear, substr(date,5,2) releasemonth, substr(date,7,2) releaseday, movieId, movieName, releaseInfo from movies_data where releaseInfo='上映首日') b where a.movieid = b.movieid order by sumBox desc''' cursor.execute(sql2) data4 = cursor.fetchall() x_data4 = [i for i in range(1, 13)] y_data4_1 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2011'] y_data4_2 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2012'] y_data4_3 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2013'] y_data4_4 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2014'] y_data4_5 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2015'] y_data4_6 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2016'] y_data4_7 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2017'] y_data4_8 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2018'] y_data4_9 = [(dayformat(i), i[2]) for i in data4 if i[-3] == '2019'] cursor.close() conn.close() my_config = pygal.Config() # 建立Config實例 my_config.show_y_guides = False # 隱藏水平虛線 my_config.show_x_guides = True xy_chart = pygal.XY(stroke=False, config=my_config) xy_chart.title = '單片票房及上映月份關係圖' xy_chart.add('2011', y_data4_1) xy_chart.add('2012', y_data4_2) xy_chart.add('2013', y_data4_3) xy_chart.add('2014', y_data4_4) xy_chart.add('2015', y_data4_5) xy_chart.add('2016', y_data4_6) xy_chart.add('2017', y_data4_7) xy_chart.add('2018', y_data4_8) xy_chart.add('2019', y_data4_9) xy_chart.render_to_file("v4.svg")