一年一度的NFL大數據碗,今年的預測目標是經過兩隊球員的靜態數據,預測該次進攻推動的碼數,並轉換爲該機率分佈;php
https://www.kaggle.com/c/nfl-big-data-bowl-2020html
https://www.kaggle.com/holoong9291/nfl-big-data-bowlgit
字段信息:數組
GameId
- a unique game identifier - 比賽IDPlayId
- a unique play identifier -Team
- home or away - 主場仍是客場X
- player position along the long axis of the field. See figure below. - 在球場的位置xY
- player position along the short axis of the field. See figure below. - 在球場的位置yS
- speed in yards/second - 速度,碼/秒A
- acceleration in yards/second^2Dis
- distance traveled from prior time point, in yardsOrientation
- orientation of player (deg) 球員面向Dir
- angle of player motion (deg) 球員移動方向NflId
- a unique identifier of the player - NFL球員IDDisplayName
- player's name - 球員名JerseyNumber
- jersey number - 球衣號碼Season
- year of the seasonYardLine
- the yard line of the line of scrimmageQuarter
- game quarter (1-5, 5 == overtime) - 當前是第幾節比賽,5爲加時GameClock
- time on the game clock - 比賽時間PossessionTeam
- team with possession - 持球方Down
- the down (1-4) - 達陣Distance
- yards needed for a first down - 距離拿首攻所需距離FieldPosition
- which side of the field the play is happening onHomeScoreBeforePlay
- home team score before play started - 賽前主隊分數VisitorScoreBeforePlay
- visitor team score before play started - 賽前客隊分數NflIdRusher
- the NflId of the rushing playerOffenseFormation
- offense formationOffensePersonnel
- offensive team positional groupingDefendersInTheBox
- number of defenders lined up near the line of scrimmage, spanning the width of the offensive lineDefensePersonnel
- defensive team positional groupingPlayDirection
- direction the play is headedTimeHandoff
- UTC time of the handoff - 傳球時間TimeSnap
- UTC time of the snap - 發球時間Yards
- the yardage gained on the play (you are predicting this) - 目標PlayerHeight
- player height (ft-in) - 球員身高PlayerWeight
- player weight (lbs) - 球員體重PlayerBirthDate
- birth date (mm/dd/yyyy) - 生日、歲數PlayerCollegeName
- where the player attended college - 大學Position
- the player's position (the specific role on the field that they typically play) - 場上位置HomeTeamAbbr
- home team abbreviation - 主隊縮寫VisitorTeamAbbr
- visitor team abbreviation - 客隊縮寫Week
- week into the seasonStadium
- stadium where the game is being played - 體育場Location
- city where the game is being player - 城市StadiumType
- description of the stadium environment - 體育場類型Turf
- description of the field surface - 草皮GameWeather
- description of the game weather - 比賽天氣Temperature
- temperature (deg F) - 溫度Humidity
- humidity - 溼度WindSpeed
- wind speed in miles/hour - 風速WindDirection
- wind direction - 風向迴歸預測,Target是碼數,可是最終結果須要轉換爲條件機率分佈;session
Continuous Ranked Probability Score (CRPS);app
這裏競賽須要的並非具體的碼數,而是碼數對應的機率分佈,也就是全部碼數在一次進攻中的機率,因此須要這樣一個轉換類,以下:
機器學習
訓練數據上看,缺失狀況不嚴重,缺失字段以下:
ide
這裏對缺失的處理根據不一樣類型的字段採起不一樣的方式:工具
下面是經過matplotlib繪製的一場比賽中的多個進攻防守回合的展現圖,黑色三角形是QB,紅色是進攻方,淡藍色是防守方:
能夠清楚的看到每次進攻不一樣的站位,以及整個推動的過程,這裏我記錄的一份NFL比賽手記,愛國者vs烏鴉,新老QB的正面交鋒,很是精彩,能夠對照着看一下;
這裏因爲我我的對橄欖球的瞭解也並非不少(強推電影弱點),因此特徵工程部分作的並非很好,從結果看Top61%也反映除了這個問題,可是我依然以爲具備必定的參考意義,下面我會分析每一個新特徵構建的目的,以及個人想法;
這裏要注意,訓練數據每一行表示的是一次進攻中一個球員的狀況,咱們預測的是每次進攻,所以須要把每22條數據聚合爲1條,這個過程當中會有一些數據統計特徵的產生,下面簡介整個流程:
一次進攻的成敗,大部分狀況下取決於四分衛的發揮,而對其發揮其重要做用的,除了他本身,就是他身邊的隊友以及對手的數量,這必定程度上影響了他的可選擇空間大小;
這一段的處理代碼較多,只截取了一部分,以下:
測試數據處理與訓練數據保持一致便可;
到此,數據處理完畢,後續就是建模、調參、combine等優化處理了,這一步我沒有花太多精力,模型選擇ExtraTreesRegressor,因爲其使用了oob,所以不須要CV,結果以下:
你們能夠到個人Github上看看有沒有其餘須要的東西,目前主要是本身作的機器學習項目、Python各類腳本工具、數據分析挖掘項目以及Follow的大佬、Fork的項目等:
https://github.com/NemoHoHaloAi