[譯]剖析勇士如何成爲新賽季奪冠熱門:基於Spark GraphFrames的金州勇士傳球網絡分析

rank

databricks 最近發佈了 GraphFrames,這是一個用 DataFrames 封裝圖處理過程的Spark插件。html

我評估了網絡分析而且利用豐富的NBA.com的數據對金州勇士的傳球網絡進行可視化。node

head

金州勇士的傳球網絡

傳接球

聯盟 MVP Stephen Curry 接到了大多數的傳球,而團隊中的 MVP Draymond Green則發動了最多的傳球。git

咱們已經看到大多數的進攻是由 Curry 和 Green 的相互傳球開始的。github

圖片來自 GIPHY算法

入度 inDegree

id inDegree
CurryStephen 3993
GreenDraymond 3123
ThompsonKlay 2276
LivingstonShaun 1925
IguodalaAndre 1814
BarnesHarrison 1241
BogutAndrew 1062
BarbosaLeandro 946
SpeightsMarreese 826
ClarkIan 692
RushBrandon 685
EzeliFestus 559
McAdooJames Michael 182
VarejaoAnderson 67
LooneyKevon 22

出度 outDegree

id outDegree
GreenDraymond 3841
CurryStephen 3300
IguodalaAndre 1896
LivingstonShaun 1878
BogutAndrew 1660
ThompsonKlay 1460
BarnesHarrison 1300
SpeightsMarreese 795
RushBrandon 772
EzeliFestus 765
BarbosaLeandro 758
ClarkIan 597
McAdooJames Michael 261
VarejaoAnderson 94
LooneyKevon 36

標籤傳遞算法 (Label Propagation Algorithm)

標籤傳遞是一種在圖網絡中尋找隊伍的算法。
這種算法在沒有已有標籤的狀況下,依然能夠很好地將球員分爲前鋒和後衛。sql

名字 標籤
Thompson, Klay 3
Barbosa, Leandro 3
Curry, Stephen 3
Clark, Ian 3
Livingston, Shaun 3
Rush, Brandon 7
Green, Draymond 7
Speights, Marreese 7
Bogut, Andrew 7
McAdoo, James Michael 7
Iguodala, Andre 7
Varejao, Anderson 7
Ezeli, Festus 7
Looney, Kevon 7
Barnes, Harrison 7

網頁排名算法 (Pagerank Algorithm)

在一個網絡中 PageRank 能夠檢測節點的重要程度。
毫無疑問,Stephen Curry、 Draymond Green 和 Klay Thompson 是Top3.
這個算法能夠發現 Shaun Livingston 和 Andre Iguodala 在金州勇士的傳球中扮演着關鍵角色。json

name pagerank
Curry, Stephen 2.17
Green, Draymond 1.99
Thompson, Klay 1.34
Livingston, Shaun 1.29
Iguodala, Andre 1.21
Barnes, Harrison 0.86
Bogut, Andrew 0.77
Barbosa, Leandro 0.72
Speights, Marreese 0.66
Clark, Ian 0.59
Rush, Brandon 0.57
Ezeli, Festus 0.48
McAdoo, James Michael 0.27
Varejao, Anderson 0.19
Looney, Kevon 0.16

示例

library(networkD3)

setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")

passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50

groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100


forceNetwork(Links = passes,
             Nodes = nodes,
             Source = "source",
             fontFamily = "Arial",
             colourScale = JS("d3.scale.category10()"),
             Target = "target",
             Value = "PASS",
             NodeID = "nodeid",
             Nodesize = "pagerank",
             linkDistance = 350,
             Group = "group", 
             opacity = 0.8,
             fontSize = 16,
             zoom = TRUE,
             opacityNoHover = TRUE)

example

  • 節點大小: pagerank值segmentfault

  • 節點顏色: 隊伍微信

  • 連線寬度: 傳球次數(接球和發球)網絡

工做流

調用API

我使用 playerdashptpass 的端點而且將同隊全部球員數據保存到本地的 JSON 文件中。
數據來自 2015-16賽季的傳球記錄。

# 金州勇士球員 IDs
playerids = [201575,201578,2738,202691,101106,2760,2571,203949,203546,
203110,201939,203105,2733,1626172,203084]

# 調用 API 而且存儲結果爲 JSON
for playerid in playerids:
    os.system('curl "http://stats.nba.com/stats/playerdashptpass?'
        'DateFrom=&'
        'DateTo=&'
        'GameSegment=&'
        'LastNGames=0&'
        'LeagueID=00&'
        'Location=&'
        'Month=0&'
        'OpponentTeamID=0&'
        'Outcome=&'
        'PerMode=Totals&'
        'Period=0&'
        'PlayerID={playerid}&'
        'Season=2015-16&'
        'SeasonSegment=&'
        'SeasonType=Regular+Season&'
        'TeamID=0&'
        'VsConference=&'
        'VsDivision=" > {playerid}.json'.format(playerid=playerid))

JSON -> Panda’s DataFrame

接着,我結合每一個JSON文件到一個 DataFrame 中。

raw = pd.DataFrame()
for playerid in playerids:
    with open("{playerid}.json".format(playerid=playerid)) as json_file:
        parsed = json.load(json_file)['resultSets'][0]
        raw = raw.append(
            pd.DataFrame(parsed['rowSet'], columns=parsed['headers']))

raw = raw.rename(columns={'PLAYER_NAME_LAST_FIRST': 'PLAYER'})

raw['id'] = raw['PLAYER'].str.replace(', ', '')

準備節點和邊

你須要爲 Spark 中的 GraphFrames 準備一個像點+邊的特殊的數據格式。頂點表示了圖中的節點和運動員ID,邊表示節點之間的關係。你能夠添加一些附加特徵好比權重,可是你無法找出在稍後的分析中能夠更好表現的特徵。一個可行的辦法是嘗試窮舉全部的可能方案。(也歡迎你們留言討論)

# 生成初始節點
pandas_vertices = raw[['PLAYER', 'id']].drop_duplicates()
pandas_vertices.columns = ['name', 'id']

# 生成初始邊
pandas_edges = pd.DataFrame()
for passer in raw['id'].drop_duplicates():
    for receiver in raw[(raw['PASS_TO'].isin(raw['PLAYER'])) &
     (raw['id'] == passer)]['PASS_TO'].drop_duplicates():
        pandas_edges = pandas_edges.append(pd.DataFrame(
            {'passer': passer, 'receiver': receiver
            .replace(  ', ', '')}, 
            index=range(int(raw[(raw['id'] == passer) &
             (raw['PASS_TO'] == receiver)]['PASS'].values))))

pandas_edges.columns = ['src', 'dst']

圖分析

vertices = sqlContext.createDataFrame(pandas_vertices)
edges = sqlContext.createDataFrame(pandas_edges)

# Analysis part
g = GraphFrame(vertices, edges)
print("vertices")
g.vertices.show()
print("edges")
g.edges.show()
print("inDegrees")
g.inDegrees.sort('inDegree', ascending=False).show()
print("outDegrees")
g.outDegrees.sort('outDegree', ascending=False).show()
print("degrees")
g.degrees.sort('degree', ascending=False).show()
print("labelPropagation")
g.labelPropagation(maxIter=5).show()
print("pageRank")
g.pageRank(resetProbability=0.15, tol=0.01).vertices.sort(
    'pagerank', ascending=False).show()

網絡可視化

當你運行 GitHub 倉庫中的代碼 gsw_passing_network.py,你須要檢查在工做目錄下有 passes.csvgroups.csvsize.csv 這三個文件。我用R中的networkD3包來實現酷炫的可交互的 D3 製圖。

library(networkD3)

setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")

passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50

groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100


forceNetwork(Links = passes,
             Nodes = nodes,
             Source = "source",
             fontFamily = "Arial",
             colourScale = JS("d3.scale.category10()"),
             Target = "target",
             Value = "PASS",
             NodeID = "nodeid",
             Nodesize = "pagerank",
             linkDistance = 350,
             Group = "group", 
             opacity = 0.8,
             fontSize = 16,
             zoom = TRUE,
             opacityNoHover = TRUE)

參考資料

本文已得到原做者:YUKI KATOH 受權HarryZhu翻譯
英文原文地址:http://opiateforthemass.es/ar...

做爲分享主義者(sharism),本人全部互聯網發佈的圖文均聽從CC版權,轉載請保留做者信息並註明做者 Harry Zhu 的 FinanceR專欄:https://segmentfault.com/blog...,若是涉及源代碼請註明GitHub地址:https://github.com/harryprince。微信號: harryzhustudio商業使用請聯繫做者。

相關文章
相關標籤/搜索