機器學習競賽分享：NFL大數據碗（上篇）

時間 2020-01-31

標籤機器學習競賽分享 nfl 數據上篇简体版

原文原文鏈接

kaggle競賽分享：NFL大數據碗 - 上

競賽簡介

一年一度的NFL大數據碗，今年的預測目標是經過兩隊球員的靜態數據，預測該次進攻推動的碼數，並轉換爲該機率分佈；php

競賽連接

https://www.kaggle.com/c/nfl-big-data-bowl-2020html

項目連接，該項目代碼已經public，你們能夠copy下來直接運行

https://www.kaggle.com/holoong9291/nfl-big-data-bowlgit

github倉庫連接，更多作的過程當中的一些思考、問題等能夠在個人github中看到

https://github.com/NemoHoHaloAi/Competition/tree/master/kaggle/Top61%25-0.01404-zzz-NFL-Big-Data-Bowlgithub

一些基本概念

美式足球：進攻方目的是經過跑動、傳球等儘快抵達對方半場，也就是達陣，而防守方的目的則是相反，盡全力去阻止對方的前進以及儘量斷球；
球場長120碼(109.728米），寬53碼（48.768米），周長是361.992米；
球員：雙方場上共22人，進攻方11人，防守方11人，進攻方持球；
進攻機會：進攻方共有四次機會，須要推動至少十碼；
進攻方：進攻方的職責是經過四次機會，儘量的向前推動10碼或者達陣，以得到下一個四次機會，不然就須要交出球權；
防守方：防守方則是相反，儘量的阻止對方前進，若是可以斷球那更好，直接球權交換；
handoff：傳球；
snap：發球；
橄欖球基本知識點我瞭解；
QB：四分衛，一般是發球後接球的那我的，通常口袋陣的中心，可是也不乏有像拉馬爾-傑克遜這樣的跑傳結合的QB，目前古典QB表明是新英格蘭愛國者NE的湯姆-布雷迪；
RB：跑衛，一般發球後進行衝刺、擺脫等，試圖接住本方QB的傳球后儘量遠的衝刺；

球場碼線圖

一個常見的開球前站位圖

數據字段介紹、繪圖分析

字段信息：數組

GameId - a unique game identifier - 比賽ID
PlayId - a unique play identifier -
Team - home or away - 主場仍是客場
X - player position along the long axis of the field. See figure below. - 在球場的位置x
Y - player position along the short axis of the field. See figure below. - 在球場的位置y
S - speed in yards/second - 速度，碼/秒
A - acceleration in yards/second^2
Dis - distance traveled from prior time point, in yards
Orientation - orientation of player (deg) 球員面向
Dir - angle of player motion (deg) 球員移動方向
NflId - a unique identifier of the player - NFL球員ID
DisplayName - player's name - 球員名
JerseyNumber - jersey number - 球衣號碼
Season - year of the season
YardLine - the yard line of the line of scrimmage
Quarter - game quarter (1-5, 5 == overtime) - 當前是第幾節比賽，5爲加時
GameClock - time on the game clock - 比賽時間
PossessionTeam - team with possession - 持球方
Down - the down (1-4) - 達陣
Distance - yards needed for a first down - 距離拿首攻所需距離
FieldPosition - which side of the field the play is happening on
HomeScoreBeforePlay - home team score before play started - 賽前主隊分數
VisitorScoreBeforePlay - visitor team score before play started - 賽前客隊分數
NflIdRusher - the NflId of the rushing player
OffenseFormation - offense formation
OffensePersonnel - offensive team positional grouping
DefendersInTheBox - number of defenders lined up near the line of scrimmage, spanning the width of the offensive line
DefensePersonnel - defensive team positional grouping
PlayDirection - direction the play is headed
TimeHandoff - UTC time of the handoff - 傳球時間
TimeSnap - UTC time of the snap - 發球時間
Yards - the yardage gained on the play (you are predicting this) - 目標
PlayerHeight - player height (ft-in) - 球員身高
PlayerWeight - player weight (lbs) - 球員體重
PlayerBirthDate - birth date (mm/dd/yyyy) - 生日、歲數
PlayerCollegeName - where the player attended college - 大學
Position - the player's position (the specific role on the field that they typically play) - 場上位置
HomeTeamAbbr - home team abbreviation - 主隊縮寫
VisitorTeamAbbr - visitor team abbreviation - 客隊縮寫
Week - week into the season
Stadium - stadium where the game is being played - 體育場
Location - city where the game is being player - 城市
StadiumType - description of the stadium environment - 體育場類型
Turf - description of the field surface - 草皮
GameWeather - description of the game weather - 比賽天氣
Temperature - temperature (deg F) - 溫度
Humidity - humidity - 溼度
WindSpeed - wind speed in miles/hour - 風速
WindDirection - wind direction - 風向

定義問題

迴歸預測，Target是碼數，可是最終結果須要轉換爲條件機率分佈；session

Evaluation Function

Continuous Ranked Probability Score (CRPS)；app

項目流程分享

定義模型輸出結果到機率分佈的轉換類

這裏競賽須要的並非具體的碼數，而是碼數對應的機率分佈，也就是全部碼數在一次進攻中的機率，因此須要這樣一個轉換類，以下：
機器學習

缺失值處理

訓練數據上看，缺失狀況不嚴重，缺失字段以下：
ide

這裏對缺失的處理根據不一樣類型的字段採起不一樣的方式：工具

天氣相關字段，因爲天氣具備連續性，所以採用前向填充較爲合理：
體育場類型，嚴格來講應該是經過baidu、google等去搜索，可是NFL的相關信息baidu搜到的太少，google上看也沒找到，因此用取值最多的來填充：
FieldPosition，這個字段的缺失不一樣於以上兩個，經過對數據的分析，它的缺失源於在中線開球時，此時無法明確指出是在哪一個半場，因此缺失，這裏用一個特別的值來填充，「Middle」；
OffenseFormation，進攻隊形，實際缺失了5條，統一用取值最多的來填充便可；
DefendersInTheBox，防守方在混戰線附近的人數，經過觀察數據能夠經過球隊、對手、以及防守組成員來填充DefendersInTheBox：
Orientation 球員方位-角度，Dir 球員移動-角度，只有一條缺失，且該球員正常上場了的，應該是技術型缺失，用mean填充便可；

異常、重複等處理

StadiumType：存在不一樣名可是贊成思的狀況，這裏要整理後歸一處理，避免對模型產生干擾；
存在PossessionTeam既不是HomeTeamAbbr也不是VisitorTeamAbbr，共有120場比賽中出現這種狀況；
草皮字段處理；
Location字段也存在重複含義可是不一樣值的狀況須要歸一；

EDA：探索性數據分析

下面是經過matplotlib繪製的一場比賽中的多個進攻防守回合的展現圖，黑色三角形是QB，紅色是進攻方，淡藍色是防守方：

能夠清楚的看到每次進攻不一樣的站位，以及整個推動的過程，這裏我記錄的一份NFL比賽手記，愛國者vs烏鴉，新老QB的正面交鋒，很是精彩，能夠對照着看一下；

特徵工程

這裏因爲我我的對橄欖球的瞭解也並非不少（強推電影弱點），因此特徵工程部分作的並非很好，從結果看Top61%也反映除了這個問題，可是我依然以爲具備必定的參考意義，下面我會分析每一個新特徵構建的目的，以及個人想法；

WindSpeed,WindDirection：直觀看，對比賽影響應該不大，可能存在某些傳球手喜歡順風或者逆風，可是影響應該很小，因此我這裏選擇丟棄；
PlayerHeight：轉爲球員身高，身高無疑對比賽是有關係的；
PlayerBirthDate：生日轉爲歲數，歲數能夠表示一個球員的身體情況是否處於巔峯等；
開球到傳球的時間 - (TimeHandoff-TimeSnap)：我認爲這一時間的長短必定程度上決定了戰術的選擇，而戰術確定是影響了進攻碼數的；
比賽進行時間 - (15-GameClock+Quarter*15)：比賽進行了多久對球員們的體力、戰術選擇等都有很大影響；
Position_XX：用於統計當前進攻中場上各個角色的人數組成，這也跟戰術選擇密切關係；
goal區：碼線對方半場10碼或10碼內，此時距離達陣不到10碼，通常這種狀況下戰術選擇會變得與以前不太同樣，不論是防守方仍是進攻方；
首攻危險：這是我本身定義的，即當目前進攻方僅有一次進攻機會，而所需繼續進攻的碼數大於5時，我認爲是有首攻危險的，此時極可能丟失球權，down爲4，且distance大於5；
距離達陣還有多少碼：通常距離的不一樣，防守方的防守策略會有不一樣，距離較遠通常會較爲保守，距離較近則會比較激進；
其他object特徵作label encode處理；