---恢復內容開始---spring
注: 本博文系原創,轉載請標明原處。app
題外話:春節事後,回到學校無所事事,感受整我的都生鏽通常,沒什麼動力,姑且稱爲「春節後遺症」。在科賽官網獲得關於NBA的詳細數據,並且又想對於本身學習數據挖掘半年以來作一次系統性梳理,就打算作一份關於杜蘭特的技術特色的數據分析報告(本人是杜迷),能夠稱得上寓學於樂吧。話很少說,開工。。。。。函數
既然要說杜蘭特的技術特色,老是要對比吧,否則怎麼知道他的特色呢?這裏我主要是從幾個方面選擇:1、球員的位置小前鋒和後衛,杜蘭特是小前鋒,固然也會打打後衛。2、基本是同一個時代的球員,先後差幾年也是能夠的(如科比)。3、能夠稱爲巨星的球員。最終選擇瞭如下幾名球員做爲對比:科比、詹姆斯、庫裏、威斯布魯克、喬治、安東尼、哈登、保羅、倫納德。對於新星和前輩們就不作對比,時代不同數據的意義也有差異,新星的數據比較少,對比沒有必要。固然選的人也不是很完美,我的主觀選擇(哈哈......)學習
數據來源:https://www.kesci.com/apps/home/dataset/599a6e66c8d2787da4d1e21d/document字體
巨星表演最佳舞臺是季後賽,他們給予咱們太多太多的經典時刻,而那些被咱們所津津稱道時刻就是他們榮譽加身的時刻。因此我打算從季後賽開始分析。。。(就是這麼任性)ui
3.1 首先,咱們先看看季後賽的數據有哪些spa
>>> import pandas as pd data >>> data_player_playoff = pd.read_csv('E:\Python\Program\NBA_Data\data\player_playoff.csv') >>> data_player_playoff.head()
球員 賽季 球隊 結果 比分 時間 投籃 命中 出手 三分 ... \ 0 Kelenna Azubuike 11-12 DAL L OKC95-79DAL 5 0.333 1 3 1.0 ... 1 Kelenna Azubuike 06-07 GSW L UTA115-101GSW 1 NaN 0 0 NaN ... 2 Kelenna Azubuike 06-07 GSW W UTA105-125GSW 3 0.000 0 1 NaN ... 3 Kelenna Azubuike 06-07 GSW W DAL86-111GSW 2 1.000 1 1 NaN ... 4 Kelenna Azubuike 06-07 GSW L DAL118-112GSW 0 NaN 0 0 NaN ... 罰球出手 籃板 前場 後場 助攻 搶斷 蓋帽 失誤 犯規 得分 0 0 1 1 0 0 1 0 1 0 3 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 0 0 0 [5 rows x 24 columns]
pd.head(n) 函數是對數據前n 行輸出,默認5行,pd.tail() 對數據後幾行的輸出。3d
3.2 數據的基本信息code
>>> data_player_playoff.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49743 entries, 0 to 49742 Data columns (total 24 columns): 球員 49615 non-null object 賽季 49743 non-null object 球隊 49743 non-null object 結果 49743 non-null object 比分 49743 non-null object 時間 49743 non-null int64 投籃 45767 non-null float64 命中 49743 non-null int64 出手 49743 non-null int64 三分 24748 non-null float64 三分命中 49743 non-null int64 三分出手 49743 non-null int64 罰球 29751 non-null float64 罰球命中 49743 non-null int64 罰球出手 49743 non-null int64 籃板 49743 non-null int64 前場 49743 non-null int64 後場 49743 non-null int64 助攻 49743 non-null int64 搶斷 49743 non-null int64 蓋帽 49743 non-null int64 失誤 49743 non-null int64 犯規 49743 non-null int64 得分 49743 non-null int64 dtypes: float64(3), int64(16), object(5) memory usage: 9.1+ MB
3.3 因爲中文的列名對後面的數據處理帶來麻煩,更改列名blog
>>> data_player_playoff.columns = ['player','season','team','result','team_score','time','shoot','hit','shot','three_pts','three_pts_hit','three_pts_shot','free_throw','free_throw_hit','free_throw_shot','backboard','front_court','back_court','assists','steals','block_shot','errors','foul','player_score']
3.4 從數據表中選擇杜蘭特、科比、詹姆斯、庫裏、威斯布魯克、喬治、安東尼、哈登、保羅、倫納德的數據
>>> kd_data_off = data_player_playoff[data_player_playoff .player == 'Kevin Durant'] >>> jh_data_off = data_player_playoff [data_player_playoff .player == 'James Harden'] >>> kb_data_off = data_player_playoff [data_player_playoff .player == 'Kobe Bryant'] >>> lj_data_off = data_player_playoff [data_player_playoff .player == 'LeBron James'] >>> kl_data_off = data_player_playoff [data_player_playoff .player == 'Kawhi Leonard'] >>> sc_data_off = data_player_playoff [data_player_playoff .player == 'Stephen Curry'] >>> rw_data_off = data_player_playoff [data_player_playoff .player == 'Russell Westbrook'] >>> pg_data_off = data_player_playoff [data_player_playoff .player == 'Paul George'] >>> ca_data_off = data_player_playoff [data_player_playoff .player == 'Carmelo Anthony'] >>> cp_data_off = data_player_playoff [data_player_playoff .player == 'Chris Paul'] >>> super_data_off = pd.DataFrame () >>> super_data_off = pd.concat([kd_data_off ,kb_data_off ,jh_data_off ,lj_data_off ,sc_data_off ,kl_data_off ,cp_data_off ,rw_data_off ,pg_data_off ,ca_data_off ]) >>> super_data_off .info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1087 entries, 9721 to 904 Data columns (total 24 columns): player 1087 non-null object season 1087 non-null object team 1087 non-null object result 1087 non-null object team_score 1087 non-null object time 1087 non-null int64 shoot 1085 non-null float64 hit 1087 non-null int64 shot 1087 non-null int64 three_pts 1059 non-null float64 three_pts_hit 1087 non-null int64 three_pts_shot 1087 non-null int64 free_throw 1015 non-null float64 free_throw_hit 1087 non-null int64 free_throw_shot 1087 non-null int64 backboard 1087 non-null int64 front_court 1087 non-null int64 back_court 1087 non-null int64 assists 1087 non-null int64 steals 1087 non-null int64 block_shot 1087 non-null int64 errors 1087 non-null int64 foul 1087 non-null int64 player_score 1087 non-null int64 dtypes: float64(3), int64(16), object(5) memory usage: 212.3+ KB
>>> super_data_off .to_csv('super_star_playoff.csv',index = False )
>>> super_data_off.player.value_counts()
Kobe Bryant 220 LeBron James 217 Kevin Durant 106 James Harden 88 Kawhi Leonard 87 Russell Westbrook 87 Chris Paul 76 Stephen Curry 75 Carmelo Anthony 66 Paul George 65 Name: player, dtype: int64
這裏能夠看出詹姆斯的年年總決賽的霸氣,只比科比少三場,今年就會超過科比了,並且老詹還要進幾年總決賽啊。杜蘭特的場數和詹姆斯相差比較大的,估計最後和科比的場數差很少。
>>> super_data_off.groupby('player').player_score.describe()
count mean std min 25% 50% 75% max player Carmelo Anthony 66.0 25.651515 8.471658 2.0 21.0 25.0 31.00 42.0 Chris Paul 76.0 21.434211 7.691269 4.0 16.0 21.5 27.00 35.0 James Harden 88.0 20.681818 10.485398 0.0 13.0 19.0 28.00 45.0 Kawhi Leonard 87.0 16.459770 8.428640 2.0 11.0 16.0 21.00 43.0 Kevin Durant 106.0 28.754717 6.979987 10.0 25.0 29.0 33.75 41.0 Kobe Bryant 220.0 25.636364 9.856715 0.0 20.0 26.0 32.00 50.0 LeBron James 217.0 28.400922 7.826865 7.0 23.0 28.0 33.00 49.0 Paul George 65.0 18.984615 9.299685 2.0 12.0 19.0 26.00 39.0 Russell Westbrook 87.0 25.275862 8.187753 7.0 19.0 26.0 30.00 51.0 Stephen Curry 75.0 26.200000 8.109054 6.0 21.5 26.0 32.50 44.0
從這裏能夠看出杜蘭特是個得分高手,隱隱約約能夠看出穩如狗
得分的直方圖來了,坐穩
#coding:utf-8 import matplotlib.pyplot as plt import pandas as pd # 中文亂碼的處理 from pylab import mpl mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默認字體 mpl.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負號'-'顯示爲方塊的問題 super_data_off = pd.read_csv('super_star_playoff.csv') kd_off_score = super_data_off[super_data_off .player == 'Kevin Durant'] .player_score.describe() super_off_mean_score = super_data_off .groupby('player').mean()['player_score'] labels = [u'場數',u'均分',u'標準差',u'最小值','25%','50%','75%',u'最大值'] print super_off_mean_score .index super_name = [u'安東尼',u'保羅',u'哈登',u'倫納德',u'杜蘭特',u'科比',u'詹姆斯',u'喬治',u'威少',u'庫裏'] # 繪圖 plt.bar(range(len(super_off_mean_score )),super_off_mean_score ,align = 'center') plt.ylabel(u'得分') plt.title(u'巨星季後賽得分數據對比') #plt.xticks(range(len(labels)),labels) plt.xticks(range(len(super_off_mean_score )),super_name) plt.ylim(15,35) for x,y in enumerate (super_off_mean_score ): plt.text (x, y+1, '%s' % round(y, 2) , ha = 'center') plt.show()
從得分的角度看杜蘭特和詹姆斯是一檔,安東尼、科比、威少和庫裏是一檔,保羅、哈登、倫納德、喬治一檔。哈登今年應該會有比較明顯的提高,畢竟他是從第六人打的季後賽。杜蘭特的四個得分王不是白拿的,在得分方面確實聯盟的超巨。
再看看巨星的每一個賽季的季後賽的平均值的走勢
season_kd_score = super_data_off[super_data_off .player == 'Kevin Durant'] .groupby('season').mean()['player_score'] plt.figure() plt.subplot(321) plt.title(u'杜蘭特賽後季平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_kd_score,'k',season_kd_score,'bo') for x,y in enumerate (season_kd_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_lj_score = super_data_off [super_data_off .player == 'LeBron James'].groupby('season').mean()['player_score'] plt.subplot(322) plt.title(u'詹姆斯賽後季平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_lj_score ,'k',season_lj_score ,'bo') for x,y in enumerate (season_lj_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_kb_score = super_data_off[super_data_off.player == 'Kobe Bryant'].groupby('season').mean()['player_score'] a = season_kb_score [0:-4] b =season_kb_score [-4:] season_kb_score = pd.concat([b,a]) plt.subplot(323) plt.title(u'科比賽季後賽平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.xticks(range(len(season_kb_score )),season_kb_score.index) plt.plot(list(season_kb_score) ,'k',list(season_kb_score),'bo') for x,y in enumerate (season_kb_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_rw_score = super_data_off[super_data_off.player == 'Russell Westbrook'].groupby('season').mean()['player_score'] plt.subplot(324) plt.title(u'威少賽季後賽平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_rw_score ,'k',season_rw_score ,'bo') for x,y in enumerate (season_rw_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_sc_score = super_data_off[super_data_off.player == 'Stephen Curry'].groupby('season').mean()['player_score'] plt.subplot(325) plt.title(u'庫裏賽季後賽平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_sc_score ,'k',season_sc_score ,'bo') for x,y in enumerate (season_sc_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_ca_score = super_data_off[super_data_off.player == 'Carmelo Anthony'].groupby('season').mean()['player_score'] plt.subplot(326) plt.title(u'安東尼賽季後賽平均得分',color = 'red') #plt.xlabel(u'賽季') plt.ylabel(u'得分') plt.plot(season_ca_score ,'k',season_ca_score ,'bo') for x,y in enumerate (season_ca_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') plt.show()
再使用餅狀圖觀察他們的的得分分佈
super_name_E = ['Kevin Durant','LeBron James','Kobe Bryant','Russell Westbrook','Stephen Curry','Carmelo Anthony'] super_name_C = [u'杜蘭特',u'詹姆斯',u'科比',u'威少',u'庫裏',u'安東尼'] plt.figure(facecolor= 'bisque') colors = ['red', 'yellow', 'peru', 'springgreen'] for i in range(len(super_name_E)): player_labels = [u'20分如下',u'20~29分',u'30~39分',u'40分以上'] explode = [0,0.1,0,0] # 突出得分在20~29的比例 player_score_range = [] player_off_score_range = super_data_off[super_data_off .player == super_name_E [i]] player_score_range.append(len(player_off_score_range [player_off_score_range['player_score'] < 20])*1.0/len(player_off_score_range )) player_score_range.append(len(pd.merge(player_off_score_range[19 < player_off_score_range.player_score], player_off_score_range[player_off_score_range.player_score < 30], how='inner')) * 1.0 / len(player_off_score_range)) player_score_range.append(len(pd.merge(player_off_score_range[29 < player_off_score_range.player_score], player_off_score_range[player_off_score_range.player_score < 40], how='inner')) * 1.0 / len(player_off_score_range)) player_score_range.append(len(player_off_score_range[39 < player_off_score_range.player_score]) * 1.0 / len(player_off_score_range)) plt.subplot(231 + i) plt.title(super_name_C [i] + u'得分分佈', color='blue') plt.pie(player_score_range, labels=player_labels, colors=colors, labeldistance=1.1, autopct='%.01f%%', shadow=False, startangle=90, pctdistance=0.8, explode=explode) plt.axis('equal') plt.show()
從這些餅狀圖可知,杜蘭特和詹姆斯在得分的穩定性上一騎絕塵,得分主要集中在 20 ~ 40 之間,佔到所有的八成左右。他們的不只得分高,並且穩定性也是極高。其中40+的得分中佔比最高的是詹姆斯,其次是庫裏和杜蘭特。這也從側面得知杜蘭特是這些球員中得分最穩的人,真是穩如狗!!!!從數據上看穩定性,那麼下面我給出他們的得分的標準差的直方圖:
std = super_data_off.groupby('player').std()['player_score'] color = ['red','red','red','red','blue','red','red','red','red','red',] print std plt.barh(range(10), std, align = 'center',color = color ,alpha = 0.8) plt.xlabel(u'標準差',color = 'blue') plt.ylabel(u'球員', color = 'blue') plt.yticks(range(len(super_name )),super_name) plt.xlim(6,11) for x,y in enumerate (std): plt.text(y + 0.1, x, '%s' % round(y,2), va = 'center') plt.show()
標準差的直方圖能夠明顯地說明杜蘭特的穩定性極高(標準差越小說明數據的平穩性越好)
在評價一個球員時,每每其投籃的區域和命中率是一項很重要的指標,能夠把分爲神射手,三分投手、中投王和衝擊內線(善突),固然也有造犯規的高手,如哈登。
super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant', u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry'] bar_width = 0.25 import numpy as np shoot = super_data_off.groupby('player') .mean()['shoot'] three_pts = super_data_off.groupby('player') .mean()['three_pts'] free_throw = super_data_off.groupby('player') .mean()['free_throw'] plt.bar(np.arange(10),shoot,align = 'center',label = u'投籃命中率',color = 'red',width = bar_width ) plt.bar(np.arange(10)+ bar_width, three_pts ,align = 'center',color = 'blue',label = u'三分命中率',width = bar_width ) plt.bar(np.arange(10)+ 2*bar_width, free_throw ,align = 'center',color = 'green',label = u'罰球命中率',width = bar_width ) for x,y in enumerate (shoot): plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (three_pts ): plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (free_throw): plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center') plt.legend () plt.ylim(0.3,1.0) plt.title(u'球員的命中率的對比') plt.xlabel(u'球員') plt.xticks(np.arange(10)+bar_width ,super_name) plt.ylabel(u'命中率') plt.show()
投籃命中率、三分球命中率和罰球命中率最高的依次是倫納德、庫裏和庫裏,因而可知,庫裏三分能力的強悍。杜蘭特這三項的數據都是排在第三位,代表他的得分的全面性。
super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant', u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry'] bar_width = 0.25 import numpy as np three_pts = super_data_off.groupby('player').sum()['three_pts_hit'] free_throw_pts = super_data_off.groupby('player').sum()['free_throw_hit'] sum_pts = super_data_off.groupby('player').sum()['player_score'] three_pts_rate = np.array(list(three_pts ))*3.0 /np.array(list(sum_pts )) free_throw_pts_rate = np.array(list(free_throw_pts ))*1.0/np.array(list(sum_pts )) two_pts_rate = 1.0 - three_pts_rate - free_throw_pts_rate print two_pts_rate plt.bar(np.arange(10),two_pts_rate ,align = 'center',label = u'二分球得分佔比',color = 'red',width = bar_width ) plt.bar(np.arange(10)+ bar_width, three_pts_rate ,align = 'center',color = 'blue',label = u'三分球得分佔比',width = bar_width ) plt.bar(np.arange(10)+ 2*bar_width, free_throw_pts_rate ,align = 'center',color = 'green',label = u'罰球得分佔比',width = bar_width ) for x,y in enumerate (two_pts_rate): plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (three_pts_rate ): plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (free_throw_pts_rate): plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center') plt.legend () plt.title(u'球員的得分方式的對比') plt.xlabel(u'球員') plt.xticks(np.arange(10)+bar_width ,super_name) plt.ylabel(u'佔比率') plt.show()
能夠看出,二分球佔比、三分球佔比和罰球佔比最高依次是:安東尼和科比、庫裏 、哈登。這也跟咱們的主觀相符的,安東尼絕招中距離跳投,科比的後仰跳投,庫裏不講理的三分,哈登在罰球的造詣之高,碰瓷王不是白叫的。固然,詹姆斯的二分球佔比也是很高,跟他的身體的天賦分不開的。而杜蘭特這三項的數據都是中規中矩,也保持着中距離的特色,這也說明了他的進攻的手段的豐富性和全面性。
球星的能力不光光體現進攻端,而防守端的能力也是一個重要的指標。強如喬丹、科比和詹姆斯都是最佳防守陣容的常客,因此,這裏給出他們在攻防兩端的數據值。
import seaborn as sns import numpy as np player_adavance = pd.read_csv('super_star_advance_data.csv') player_labels = [ u'籃板率', u'助攻率', u'搶斷率', u'蓋帽率',u'失誤率', u'使用率', u'勝利貢獻值', u'霍格林效率值'] player_data = player_adavance[['player','total_rebound_rate','assist_rate','steals_rate','cap_rate','error_rate', 'usage_rate','ws','per']] .groupby('player').mean() num = [100,100,100,100,100,100,1,1] np_num = np.array(player_data)*np.array(num) plt.title(u'球員攻防兩端的能力熱力圖') sns.heatmap(np_num , annot=True,xticklabels= player_labels ,yticklabels=super_name ,cmap='YlGnBu') plt.show()
在籃板的數據小前鋒的數據差很少,都是11 左右,然後衛中最能搶板是威少,畢竟是上賽季的場均三雙(歷史第二人)。助攻率最高的固然是保羅,其次是威少和詹姆斯;而杜蘭特的助攻率比較平庸,但在小前鋒裏面也是不錯了。搶斷率方面是保羅和倫納德的優點明顯,顯示了倫納德的死亡纏繞的效果了。蓋帽率最高的是杜蘭特,身體的優點在這項數據的體現的很明顯;在這個賽季杜蘭特的蓋帽能力又是提高了一個層次,高居聯盟前五(杜中鋒,哈哈)。失誤率方面後衛高於前鋒,最高的是威少。使用率最高的是威少,其次是詹姆斯,能夠看出他們的球權都是挺大,倫納德只有22(波波老爺子的總體籃球控制力真強)。貢獻值最高是詹姆斯,畢竟球隊都是圍繞他創建的,如今更是一我的扛着球隊前行;其次是保羅,畢竟球隊的大腦;杜蘭特第三,也是符合殺神的稱號的。效率值的前三和貢獻值同樣,老詹真是強,不服不行啊。。。。
在數據面前,能夠得出:從進攻的角度講,杜蘭特是最強的,主要體如今:高得分、穩定性強、得分方式全面和得分效率高。從防守的方面,杜蘭特善於封蓋,而串聯球隊方面,杜蘭特仍是與詹姆斯有着明顯差距。這兩年杜蘭特的防守是愈來愈好了,但願這個賽季能進入最佳防守陣容。這些數據顯示與平時對杜蘭特的瞭解相差不大,能夠說數據驗證了主觀的認識。季後賽的數據就分析就到這裏了,對模塊padans、numpy 、seaborn 和 matplotlib 系統的梳理一遍吧,也算是新學期的熱身吧。常規賽的數據分析就不分析了,何時有興趣了再搞。
---恢復內容結束---