基於數據形式說明杜蘭特的技術特色的分析（含Python實現講解部分）

時間 2020-07-10

標籤基於數據形式說明蘭特技術特色分析 python 實現講解部分欄目 Python 简体版

原文原文鏈接

---恢復內容開始---spring

注：本博文系原創，轉載請標明原處。app

題外話：春節事後，回到學校無所事事，感受整我的都生鏽通常，沒什麼動力，姑且稱爲「春節後遺症」。在科賽官網獲得關於NBA的詳細數據，並且又想對於本身學習數據挖掘半年以來作一次系統性梳理，就打算作一份關於杜蘭特的技術特色的數據分析報告（本人是杜迷），能夠稱得上寓學於樂吧。話很少說，開工。。。。。函數

1 杜蘭特 VS Who？

既然要說杜蘭特的技術特色，老是要對比吧，否則怎麼知道他的特色呢？這裏我主要是從幾個方面選擇：1、球員的位置小前鋒和後衛，杜蘭特是小前鋒，固然也會打打後衛。2、基本是同一個時代的球員，先後差幾年也是能夠的（如科比）。3、能夠稱爲巨星的球員。最終選擇瞭如下幾名球員做爲對比：科比、詹姆斯、庫裏、威斯布魯克、喬治、安東尼、哈登、保羅、倫納德。對於新星和前輩們就不作對比，時代不同數據的意義也有差異，新星的數據比較少，對比沒有必要。固然選的人也不是很完美，我的主觀選擇（哈哈......）學習

2 數據

數據來源：https://www.kesci.com/apps/home/dataset/599a6e66c8d2787da4d1e21d/document字體

3 殺向數據的第一刀

巨星表演最佳舞臺是季後賽，他們給予咱們太多太多的經典時刻，而那些被咱們所津津稱道時刻就是他們榮譽加身的時刻。因此我打算從季後賽開始分析。。。（就是這麼任性）ui

3.1 首先，咱們先看看季後賽的數據有哪些spa

>>> import pandas as pd
data
>>> data_player_playoff = pd.read_csv('E:\Python\Program\NBA_Data\data\player_playoff.csv')
>>> data_player_playoff.head()

                球員     賽季   球隊 結果             比分  時間     投籃  命中  出手   三分 ...  \
0  Kelenna Azubuike  11-12  DAL  L    OKC95-79DAL   5  0.333   1   3  1.0 ...   
1  Kelenna Azubuike  06-07  GSW  L  UTA115-101GSW   1    NaN   0   0  NaN ...   
2  Kelenna Azubuike  06-07  GSW  W  UTA105-125GSW   3  0.000   0   1  NaN ...   
3  Kelenna Azubuike  06-07  GSW  W   DAL86-111GSW   2  1.000   1   1  NaN ...   
4  Kelenna Azubuike  06-07  GSW  L  DAL118-112GSW   0    NaN   0   0  NaN ...   

   罰球出手  籃板  前場  後場  助攻  搶斷  蓋帽  失誤  犯規  得分  
0     0   1   1   0   0   1   0   1   0   3  
1     0   0   0   0   0   0   0   0   0   0  
2     0   0   0   0   0   0   0   0   1   0  
3     0   0   0   0   0   0   0   0   0   2  
4     0   0   0   0   0   0   0   0   0   0  

[5 rows x 24 columns]

pd.head(n) 函數是對數據前n 行輸出，默認5行，pd.tail() 對數據後幾行的輸出。3d

3.2 數據的基本信息code

>>> data_player_playoff.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49743 entries, 0 to 49742
Data columns (total 24 columns):
球員      49615 non-null object
賽季      49743 non-null object
球隊      49743 non-null object
結果      49743 non-null object
比分      49743 non-null object
時間      49743 non-null int64
投籃      45767 non-null float64
命中      49743 non-null int64
出手      49743 non-null int64
三分      24748 non-null float64
三分命中    49743 non-null int64
三分出手    49743 non-null int64
罰球      29751 non-null float64
罰球命中    49743 non-null int64
罰球出手    49743 non-null int64
籃板      49743 non-null int64
前場      49743 non-null int64
後場      49743 non-null int64
助攻      49743 non-null int64
搶斷      49743 non-null int64
蓋帽      49743 non-null int64
失誤      49743 non-null int64
犯規      49743 non-null int64
得分      49743 non-null int64
dtypes: float64(3), int64(16), object(5)
memory usage: 9.1+ MB

3.3 因爲中文的列名對後面的數據處理帶來麻煩，更改列名blog

>>> data_player_playoff.columns = ['player','season','team','result','team_score','time','shoot','hit','shot','three_pts','three_pts_hit','three_pts_shot','free_throw','free_throw_hit','free_throw_shot','backboard','front_court','back_court','assists','steals','block_shot','errors','foul','player_score']

3.4 從數據表中選擇杜蘭特、科比、詹姆斯、庫裏、威斯布魯克、喬治、安東尼、哈登、保羅、倫納德的數據

>>> kd_data_off = data_player_playoff[data_player_playoff .player == 'Kevin Durant']
>>> jh_data_off = data_player_playoff [data_player_playoff .player == 'James Harden']
>>> kb_data_off = data_player_playoff [data_player_playoff .player == 'Kobe Bryant']
>>> lj_data_off = data_player_playoff [data_player_playoff .player == 'LeBron James']
>>> kl_data_off = data_player_playoff [data_player_playoff .player == 'Kawhi Leonard']
>>> sc_data_off = data_player_playoff [data_player_playoff .player == 'Stephen Curry']
>>> rw_data_off = data_player_playoff [data_player_playoff .player == 'Russell Westbrook']
>>> pg_data_off = data_player_playoff [data_player_playoff .player == 'Paul George']
>>> ca_data_off = data_player_playoff [data_player_playoff .player == 'Carmelo Anthony']
>>> cp_data_off = data_player_playoff [data_player_playoff .player == 'Chris Paul']
>>> super_data_off = pd.DataFrame ()
>>> super_data_off = pd.concat([kd_data_off ,kb_data_off ,jh_data_off ,lj_data_off ,sc_data_off ,kl_data_off ,cp_data_off ,rw_data_off ,pg_data_off ,ca_data_off ])
>>> super_data_off .info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1087 entries, 9721 to 904
Data columns (total 24 columns):
player             1087 non-null object
season             1087 non-null object
team               1087 non-null object
result             1087 non-null object
team_score         1087 non-null object
time               1087 non-null int64
shoot              1085 non-null float64
hit                1087 non-null int64
shot               1087 non-null int64
three_pts          1059 non-null float64
three_pts_hit      1087 non-null int64
three_pts_shot     1087 non-null int64
free_throw         1015 non-null float64
free_throw_hit     1087 non-null int64
free_throw_shot    1087 non-null int64
backboard          1087 non-null int64
front_court        1087 non-null int64
back_court         1087 non-null int64
assists            1087 non-null int64
steals             1087 non-null int64
block_shot         1087 non-null int64
errors             1087 non-null int64
foul               1087 non-null int64
player_score       1087 non-null int64
dtypes: float64(3), int64(16), object(5)
memory usage: 212.3+ KB

3.5 把這十我的的數據單獨存放到一文件裏

>>> super_data_off .to_csv('super_star_playoff.csv',index = False )

4 數據分析

4.1 先看看他們參加了多少場季後賽

>>> super_data_off.player.value_counts()

Kobe Bryant          220
LeBron James         217
Kevin Durant         106
James Harden          88
Kawhi Leonard         87
Russell Westbrook     87
Chris Paul            76
Stephen Curry         75
Carmelo Anthony       66
Paul George           65
Name: player, dtype: int64

這裏能夠看出詹姆斯的年年總決賽的霸氣，只比科比少三場，今年就會超過科比了，並且老詹還要進幾年總決賽啊。杜蘭特的場數和詹姆斯相差比較大的，估計最後和科比的場數差很少。

4.2 簡單粗暴，直接看看他們的季後賽的得分

>>> super_data_off.groupby('player').player_score.describe()

                   count       mean        std   min   25%   50%    75%   max
player                                                                       
Carmelo Anthony     66.0  25.651515   8.471658   2.0  21.0  25.0  31.00  42.0
Chris Paul          76.0  21.434211   7.691269   4.0  16.0  21.5  27.00  35.0
James Harden        88.0  20.681818  10.485398   0.0  13.0  19.0  28.00  45.0
Kawhi Leonard       87.0  16.459770   8.428640   2.0  11.0  16.0  21.00  43.0
Kevin Durant       106.0  28.754717   6.979987  10.0  25.0  29.0  33.75  41.0
Kobe Bryant        220.0  25.636364   9.856715   0.0  20.0  26.0  32.00  50.0
LeBron James       217.0  28.400922   7.826865   7.0  23.0  28.0  33.00  49.0
Paul George         65.0  18.984615   9.299685   2.0  12.0  19.0  26.00  39.0
Russell Westbrook   87.0  25.275862   8.187753   7.0  19.0  26.0  30.00  51.0
Stephen Curry       75.0  26.200000   8.109054   6.0  21.5  26.0  32.50  44.0

從這裏能夠看出杜蘭特是個得分高手，隱隱約約能夠看出穩如狗

得分的直方圖來了，坐穩

#coding:utf-8

import matplotlib.pyplot as plt
import pandas as pd
# 中文亂碼的處理
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默認字體
mpl.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負號'-'顯示爲方塊的問題

super_data_off = pd.read_csv('super_star_playoff.csv')
kd_off_score = super_data_off[super_data_off .player == 'Kevin Durant'] .player_score.describe()
super_off_mean_score = super_data_off .groupby('player').mean()['player_score']
labels = [u'場數',u'均分',u'標準差',u'最小值','25%','50%','75%',u'最大值']
print super_off_mean_score .index
super_name = [u'安東尼',u'保羅',u'哈登',u'倫納德',u'杜蘭特',u'科比',u'詹姆斯',u'喬治',u'威少',u'庫裏']
# 繪圖
plt.bar(range(len(super_off_mean_score )),super_off_mean_score ,align = 'center')

plt.ylabel(u'得分')
plt.title(u'巨星季後賽得分數據對比')
#plt.xticks(range(len(labels)),labels)
plt.xticks(range(len(super_off_mean_score )),super_name)
plt.ylim(15,35)
for x,y in enumerate (super_off_mean_score ):
    plt.text (x, y+1, '%s' % round(y, 2) , ha = 'center')
plt.show()

從得分的角度看杜蘭特和詹姆斯是一檔，安東尼、科比、威少和庫裏是一檔，保羅、哈登、倫納德、喬治一檔。哈登今年應該會有比較明顯的提高，畢竟他是從第六人打的季後賽。杜蘭特的四個得分王不是白拿的，在得分方面確實聯盟的超巨。

再看看巨星的每一個賽季的季後賽的平均值的走勢

season_kd_score = super_data_off[super_data_off .player == 'Kevin Durant'] .groupby('season').mean()['player_score']
plt.figure()
plt.subplot(321)
plt.title(u'杜蘭特賽後季平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_kd_score,'k',season_kd_score,'bo')
for x,y in enumerate (season_kd_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')


season_lj_score = super_data_off [super_data_off .player == 'LeBron James'].groupby('season').mean()['player_score']
plt.subplot(322)
plt.title(u'詹姆斯賽後季平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_lj_score ,'k',season_lj_score ,'bo')
for x,y in enumerate (season_lj_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

season_kb_score = super_data_off[super_data_off.player == 'Kobe Bryant'].groupby('season').mean()['player_score']
a = season_kb_score [0:-4]
b =season_kb_score [-4:]
season_kb_score = pd.concat([b,a])
plt.subplot(323)
plt.title(u'科比賽季後賽平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.xticks(range(len(season_kb_score )),season_kb_score.index)
plt.plot(list(season_kb_score) ,'k',list(season_kb_score),'bo')
for x,y in enumerate (season_kb_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

season_rw_score = super_data_off[super_data_off.player == 'Russell Westbrook'].groupby('season').mean()['player_score']
plt.subplot(324)
plt.title(u'威少賽季後賽平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_rw_score ,'k',season_rw_score ,'bo')
for x,y in enumerate (season_rw_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

season_sc_score = super_data_off[super_data_off.player == 'Stephen Curry'].groupby('season').mean()['player_score']
plt.subplot(325)
plt.title(u'庫裏賽季後賽平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_sc_score ,'k',season_sc_score ,'bo')
for x,y in enumerate (season_sc_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

season_ca_score = super_data_off[super_data_off.player == 'Carmelo Anthony'].groupby('season').mean()['player_score']
plt.subplot(326)
plt.title(u'安東尼賽季後賽平均得分',color = 'red')
#plt.xlabel(u'賽季')
plt.ylabel(u'得分')
plt.plot(season_ca_score ,'k',season_ca_score ,'bo')
for x,y in enumerate (season_ca_score ):
    plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center')

plt.show()

再使用餅狀圖觀察他們的的得分分佈

super_name_E = ['Kevin Durant','LeBron James','Kobe Bryant','Russell Westbrook','Stephen Curry','Carmelo Anthony']
super_name_C = [u'杜蘭特',u'詹姆斯',u'科比',u'威少',u'庫裏',u'安東尼']
plt.figure(facecolor= 'bisque')
colors = ['red', 'yellow', 'peru', 'springgreen']
for i in range(len(super_name_E)):
    player_labels = [u'20分如下',u'20~29分',u'30~39分',u'40分以上']
    explode = [0,0.1,0,0] # 突出得分在20~29的比例
    player_score_range = []
    player_off_score_range = super_data_off[super_data_off .player == super_name_E [i]]
    player_score_range.append(len(player_off_score_range [player_off_score_range['player_score'] < 20])*1.0/len(player_off_score_range ))
    player_score_range.append(len(pd.merge(player_off_score_range[19 < player_off_score_range.player_score],
                                           player_off_score_range[player_off_score_range.player_score < 30],
                                       how='inner')) * 1.0 / len(player_off_score_range))
    player_score_range.append(len(pd.merge(player_off_score_range[29 < player_off_score_range.player_score],
                                           player_off_score_range[player_off_score_range.player_score < 40],
                                       how='inner')) * 1.0 / len(player_off_score_range))
    player_score_range.append(len(player_off_score_range[39 < player_off_score_range.player_score]) * 1.0 / len(player_off_score_range))
    plt.subplot(231 + i)
    plt.title(super_name_C [i] + u'得分分佈', color='blue')
    plt.pie(player_score_range, labels=player_labels, colors=colors, labeldistance=1.1,
            autopct='%.01f%%', shadow=False, startangle=90, pctdistance=0.8, explode=explode)
    plt.axis('equal')
plt.show()

從這些餅狀圖可知，杜蘭特和詹姆斯在得分的穩定性上一騎絕塵，得分主要集中在 20 ~ 40 之間，佔到所有的八成左右。他們的不只得分高，並且穩定性也是極高。其中40+的得分中佔比最高的是詹姆斯，其次是庫裏和杜蘭特。這也從側面得知杜蘭特是這些球員中得分最穩的人，真是穩如狗！！！！從數據上看穩定性，那麼下面我給出他們的得分的標準差的直方圖：

std = super_data_off.groupby('player').std()['player_score']
color = ['red','red','red','red','blue','red','red','red','red','red',]
print std
plt.barh(range(10), std, align = 'center',color = color ,alpha = 0.8)
plt.xlabel(u'標準差',color = 'blue')
plt.ylabel(u'球員', color = 'blue')
plt.yticks(range(len(super_name )),super_name)
plt.xlim(6,11)
for x,y in enumerate (std):
    plt.text(y + 0.1, x, '%s' % round(y,2), va = 'center')
plt.show()

標準差的直方圖能夠明顯地說明杜蘭特的穩定性極高（標準差越小說明數據的平穩性越好）

4.3 投籃方式和效率

在評價一個球員時，每每其投籃的區域和命中率是一項很重要的指標，能夠把分爲神射手，三分投手、中投王和衝擊內線（善突），固然也有造犯規的高手，如哈登。

super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant',
                u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry']
bar_width = 0.25
import numpy as np
shoot = super_data_off.groupby('player') .mean()['shoot']
three_pts = super_data_off.groupby('player') .mean()['three_pts']
free_throw = super_data_off.groupby('player') .mean()['free_throw']
plt.bar(np.arange(10),shoot,align = 'center',label = u'投籃命中率',color = 'red',width = bar_width )
plt.bar(np.arange(10)+ bar_width, three_pts ,align = 'center',color = 'blue',label = u'三分命中率',width = bar_width )
plt.bar(np.arange(10)+ 2*bar_width, free_throw  ,align = 'center',color = 'green',label = u'罰球命中率',width = bar_width )
for x,y in enumerate (shoot):
    plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center')
for x,y in enumerate (three_pts ):
    plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center')
for x,y in enumerate (free_throw):
    plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center')
plt.legend ()
plt.ylim(0.3,1.0)
plt.title(u'球員的命中率的對比')
plt.xlabel(u'球員')
plt.xticks(np.arange(10)+bar_width  ,super_name)
plt.ylabel(u'命中率')
plt.show()

投籃命中率、三分球命中率和罰球命中率最高的依次是倫納德、庫裏和庫裏，因而可知，庫裏三分能力的強悍。杜蘭特這三項的數據都是排在第三位，代表他的得分的全面性。

super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant',
                u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry']
bar_width = 0.25
import numpy as np
three_pts = super_data_off.groupby('player').sum()['three_pts_hit']
free_throw_pts = super_data_off.groupby('player').sum()['free_throw_hit']
sum_pts = super_data_off.groupby('player').sum()['player_score']
three_pts_rate = np.array(list(three_pts ))*3.0 /np.array(list(sum_pts ))
free_throw_pts_rate = np.array(list(free_throw_pts ))*1.0/np.array(list(sum_pts ))
two_pts_rate = 1.0 - three_pts_rate - free_throw_pts_rate
print two_pts_rate
plt.bar(np.arange(10),two_pts_rate ,align = 'center',label = u'二分球得分佔比',color = 'red',width = bar_width )
plt.bar(np.arange(10)+ bar_width, three_pts_rate ,align = 'center',color = 'blue',label = u'三分球得分佔比',width = bar_width )
plt.bar(np.arange(10)+ 2*bar_width, free_throw_pts_rate   ,align = 'center',color = 'green',label = u'罰球得分佔比',width = bar_width )
for x,y in enumerate (two_pts_rate):
    plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center')
for x,y in enumerate (three_pts_rate ):
    plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center')
for x,y in enumerate (free_throw_pts_rate):
    plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center')
plt.legend ()
plt.title(u'球員的得分方式的對比')
plt.xlabel(u'球員')
plt.xticks(np.arange(10)+bar_width  ,super_name)
plt.ylabel(u'佔比率')
plt.show()

能夠看出，二分球佔比、三分球佔比和罰球佔比最高依次是：安東尼和科比、庫裏、哈登。這也跟咱們的主觀相符的，安東尼絕招中距離跳投，科比的後仰跳投，庫裏不講理的三分，哈登在罰球的造詣之高，碰瓷王不是白叫的。固然，詹姆斯的二分球佔比也是很高，跟他的身體的天賦分不開的。而杜蘭特這三項的數據都是中規中矩，也保持着中距離的特色，這也說明了他的進攻的手段的豐富性和全面性。

4.4 防守端的數據

球星的能力不光光體現進攻端，而防守端的能力也是一個重要的指標。強如喬丹、科比和詹姆斯都是最佳防守陣容的常客，因此，這裏給出他們在攻防兩端的數據值。

import seaborn as sns
import numpy as np
player_adavance = pd.read_csv('super_star_advance_data.csv')
player_labels = [ u'籃板率', u'助攻率', u'搶斷率', u'蓋帽率',u'失誤率', u'使用率', u'勝利貢獻值',  u'霍格林效率值']
player_data = player_adavance[['player','total_rebound_rate','assist_rate','steals_rate','cap_rate','error_rate',
                             'usage_rate','ws','per']] .groupby('player').mean()
num = [100,100,100,100,100,100,1,1]
np_num = np.array(player_data)*np.array(num)
plt.title(u'球員攻防兩端的能力熱力圖')
sns.heatmap(np_num , annot=True,xticklabels= player_labels ,yticklabels=super_name  ,cmap='YlGnBu')
plt.show()

在籃板的數據小前鋒的數據差很少，都是11 左右，然後衛中最能搶板是威少，畢竟是上賽季的場均三雙（歷史第二人）。助攻率最高的固然是保羅，其次是威少和詹姆斯；而杜蘭特的助攻率比較平庸，但在小前鋒裏面也是不錯了。搶斷率方面是保羅和倫納德的優點明顯，顯示了倫納德的死亡纏繞的效果了。蓋帽率最高的是杜蘭特，身體的優點在這項數據的體現的很明顯；在這個賽季杜蘭特的蓋帽能力又是提高了一個層次，高居聯盟前五（杜中鋒，哈哈）。失誤率方面後衛高於前鋒，最高的是威少。使用率最高的是威少，其次是詹姆斯，能夠看出他們的球權都是挺大，倫納德只有22（波波老爺子的總體籃球控制力真強）。貢獻值最高是詹姆斯，畢竟球隊都是圍繞他創建的，如今更是一我的扛着球隊前行；其次是保羅，畢竟球隊的大腦；杜蘭特第三，也是符合殺神的稱號的。效率值的前三和貢獻值同樣，老詹真是強，不服不行啊。。。。

5 小結

在數據面前，能夠得出：從進攻的角度講，杜蘭特是最強的，主要體如今：高得分、穩定性強、得分方式全面和得分效率高。從防守的方面，杜蘭特善於封蓋，而串聯球隊方面，杜蘭特仍是與詹姆斯有着明顯差距。這兩年杜蘭特的防守是愈來愈好了，但願這個賽季能進入最佳防守陣容。這些數據顯示與平時對杜蘭特的瞭解相差不大，能夠說數據驗證了主觀的認識。季後賽的數據就分析就到這裏了，對模塊padans、numpy 、seaborn 和 matplotlib 系統的梳理一遍吧，也算是新學期的熱身吧。常規賽的數據分析就不分析了，何時有興趣了再搞。

---恢復內容結束---