databricks 最近發佈了 GraphFrames,這是一個用 DataFrames 封裝圖處理過程的Spark插件。html
我評估了網絡分析而且利用豐富的NBA.com的數據對金州勇士的傳球網絡進行可視化。node
聯盟 MVP Stephen Curry 接到了大多數的傳球,而團隊中的 MVP Draymond Green則發動了最多的傳球。git
咱們已經看到大多數的進攻是由 Curry 和 Green 的相互傳球開始的。github
圖片來自 GIPHY算法
id | inDegree |
---|---|
CurryStephen | 3993 |
GreenDraymond | 3123 |
ThompsonKlay | 2276 |
LivingstonShaun | 1925 |
IguodalaAndre | 1814 |
BarnesHarrison | 1241 |
BogutAndrew | 1062 |
BarbosaLeandro | 946 |
SpeightsMarreese | 826 |
ClarkIan | 692 |
RushBrandon | 685 |
EzeliFestus | 559 |
McAdooJames Michael | 182 |
VarejaoAnderson | 67 |
LooneyKevon | 22 |
id | outDegree |
---|---|
GreenDraymond | 3841 |
CurryStephen | 3300 |
IguodalaAndre | 1896 |
LivingstonShaun | 1878 |
BogutAndrew | 1660 |
ThompsonKlay | 1460 |
BarnesHarrison | 1300 |
SpeightsMarreese | 795 |
RushBrandon | 772 |
EzeliFestus | 765 |
BarbosaLeandro | 758 |
ClarkIan | 597 |
McAdooJames Michael | 261 |
VarejaoAnderson | 94 |
LooneyKevon | 36 |
標籤傳遞是一種在圖網絡中尋找隊伍的算法。
這種算法在沒有已有標籤的狀況下,依然能夠很好地將球員分爲前鋒和後衛。sql
名字 | 標籤 |
---|---|
Thompson, Klay | 3 |
Barbosa, Leandro | 3 |
Curry, Stephen | 3 |
Clark, Ian | 3 |
Livingston, Shaun | 3 |
Rush, Brandon | 7 |
Green, Draymond | 7 |
Speights, Marreese | 7 |
Bogut, Andrew | 7 |
McAdoo, James Michael | 7 |
Iguodala, Andre | 7 |
Varejao, Anderson | 7 |
Ezeli, Festus | 7 |
Looney, Kevon | 7 |
Barnes, Harrison | 7 |
在一個網絡中 PageRank 能夠檢測節點的重要程度。
毫無疑問,Stephen Curry、 Draymond Green 和 Klay Thompson 是Top3.
這個算法能夠發現 Shaun Livingston 和 Andre Iguodala 在金州勇士的傳球中扮演着關鍵角色。json
name | pagerank |
---|---|
Curry, Stephen | 2.17 |
Green, Draymond | 1.99 |
Thompson, Klay | 1.34 |
Livingston, Shaun | 1.29 |
Iguodala, Andre | 1.21 |
Barnes, Harrison | 0.86 |
Bogut, Andrew | 0.77 |
Barbosa, Leandro | 0.72 |
Speights, Marreese | 0.66 |
Clark, Ian | 0.59 |
Rush, Brandon | 0.57 |
Ezeli, Festus | 0.48 |
McAdoo, James Michael | 0.27 |
Varejao, Anderson | 0.19 |
Looney, Kevon | 0.16 |
library(networkD3) setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network') passes <- read.csv("passes.csv") groups <- read.csv("groups.csv") size <- read.csv("size.csv") passes$source <- as.numeric(as.factor(passes$PLAYER))-1 passes$target <- as.numeric(as.factor(passes$PASS_TO))-1 passes$PASS <- passes$PASS/50 groups$nodeid <- groups$name groups$name <- as.numeric(as.factor(groups$name))-1 groups$group <- as.numeric(as.factor(groups$label))-1 nodes <- merge(groups,size[-1],by="id") nodes$pagerank <- nodes$pagerank^2*100 forceNetwork(Links = passes, Nodes = nodes, Source = "source", fontFamily = "Arial", colourScale = JS("d3.scale.category10()"), Target = "target", Value = "PASS", NodeID = "nodeid", Nodesize = "pagerank", linkDistance = 350, Group = "group", opacity = 0.8, fontSize = 16, zoom = TRUE, opacityNoHover = TRUE)
節點大小: pagerank值segmentfault
節點顏色: 隊伍微信
連線寬度: 傳球次數(接球和發球)網絡
我使用 playerdashptpass 的端點而且將同隊全部球員數據保存到本地的 JSON 文件中。
數據來自 2015-16賽季的傳球記錄。
# 金州勇士球員 IDs playerids = [201575,201578,2738,202691,101106,2760,2571,203949,203546, 203110,201939,203105,2733,1626172,203084] # 調用 API 而且存儲結果爲 JSON for playerid in playerids: os.system('curl "http://stats.nba.com/stats/playerdashptpass?' 'DateFrom=&' 'DateTo=&' 'GameSegment=&' 'LastNGames=0&' 'LeagueID=00&' 'Location=&' 'Month=0&' 'OpponentTeamID=0&' 'Outcome=&' 'PerMode=Totals&' 'Period=0&' 'PlayerID={playerid}&' 'Season=2015-16&' 'SeasonSegment=&' 'SeasonType=Regular+Season&' 'TeamID=0&' 'VsConference=&' 'VsDivision=" > {playerid}.json'.format(playerid=playerid))
接着,我結合每一個JSON文件到一個 DataFrame 中。
raw = pd.DataFrame() for playerid in playerids: with open("{playerid}.json".format(playerid=playerid)) as json_file: parsed = json.load(json_file)['resultSets'][0] raw = raw.append( pd.DataFrame(parsed['rowSet'], columns=parsed['headers'])) raw = raw.rename(columns={'PLAYER_NAME_LAST_FIRST': 'PLAYER'}) raw['id'] = raw['PLAYER'].str.replace(', ', '')
你須要爲 Spark 中的 GraphFrames 準備一個像點+邊的特殊的數據格式。頂點表示了圖中的節點和運動員ID,邊表示節點之間的關係。你能夠添加一些附加特徵好比權重,可是你無法找出在稍後的分析中能夠更好表現的特徵。一個可行的辦法是嘗試窮舉全部的可能方案。(也歡迎你們留言討論)
# 生成初始節點 pandas_vertices = raw[['PLAYER', 'id']].drop_duplicates() pandas_vertices.columns = ['name', 'id'] # 生成初始邊 pandas_edges = pd.DataFrame() for passer in raw['id'].drop_duplicates(): for receiver in raw[(raw['PASS_TO'].isin(raw['PLAYER'])) & (raw['id'] == passer)]['PASS_TO'].drop_duplicates(): pandas_edges = pandas_edges.append(pd.DataFrame( {'passer': passer, 'receiver': receiver .replace( ', ', '')}, index=range(int(raw[(raw['id'] == passer) & (raw['PASS_TO'] == receiver)]['PASS'].values)))) pandas_edges.columns = ['src', 'dst']
vertices = sqlContext.createDataFrame(pandas_vertices) edges = sqlContext.createDataFrame(pandas_edges) # Analysis part g = GraphFrame(vertices, edges) print("vertices") g.vertices.show() print("edges") g.edges.show() print("inDegrees") g.inDegrees.sort('inDegree', ascending=False).show() print("outDegrees") g.outDegrees.sort('outDegree', ascending=False).show() print("degrees") g.degrees.sort('degree', ascending=False).show() print("labelPropagation") g.labelPropagation(maxIter=5).show() print("pageRank") g.pageRank(resetProbability=0.15, tol=0.01).vertices.sort( 'pagerank', ascending=False).show()
當你運行 GitHub 倉庫中的代碼 gsw_passing_network.py,你須要檢查在工做目錄下有 passes.csv、groups.csv、size.csv 這三個文件。我用R中的networkD3
包來實現酷炫的可交互的 D3 製圖。
library(networkD3) setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network') passes <- read.csv("passes.csv") groups <- read.csv("groups.csv") size <- read.csv("size.csv") passes$source <- as.numeric(as.factor(passes$PLAYER))-1 passes$target <- as.numeric(as.factor(passes$PASS_TO))-1 passes$PASS <- passes$PASS/50 groups$nodeid <- groups$name groups$name <- as.numeric(as.factor(groups$name))-1 groups$group <- as.numeric(as.factor(groups$label))-1 nodes <- merge(groups,size[-1],by="id") nodes$pagerank <- nodes$pagerank^2*100 forceNetwork(Links = passes, Nodes = nodes, Source = "source", fontFamily = "Arial", colourScale = JS("d3.scale.category10()"), Target = "target", Value = "PASS", NodeID = "nodeid", Nodesize = "pagerank", linkDistance = 350, Group = "group", opacity = 0.8, fontSize = 16, zoom = TRUE, opacityNoHover = TRUE)
本文已得到原做者:YUKI KATOH 受權HarryZhu翻譯
英文原文地址:http://opiateforthemass.es/ar...做爲分享主義者(sharism),本人全部互聯網發佈的圖文均聽從CC版權,轉載請保留做者信息並註明做者 Harry Zhu 的 FinanceR專欄:https://segmentfault.com/blog...,若是涉及源代碼請註明GitHub地址:https://github.com/harryprince。微信號: harryzhustudio商業使用請聯繫做者。