基於球員和裁判數據進行探索性數據分析實踐-大數據ML樣本集案例實戰

時間 2019-12-02

標籤基於球員裁判數據進行探索性分析實踐樣本案例實戰简体版

原文原文鏈接

1 數據簡介

數據包含球員和裁判的信息，2012-2013年的比賽數據，總共設計球員2053名，裁判3147名，特徵列表以下：ide

Variable Name:	Variable Description:
playerShort	short player ID
player	player name
club	player club
leagueCountry	country of player club (England, Germany, France, and Spain)
height	player height (in cm)
weight	player weight (in kg)
position	player position
games	number of games in the player-referee dyad
goals	number of goals in the player-referee dyad
yellowCards	number of yellow cards player received from the referee
yellowReds	number of yellow-red cards player received from the referee
redCards	number of red cards player received from the referee
photoID	ID of player photo (if available)
rater1	skin rating of photo by rater 1
rater2	skin rating of photo by rater 2
refNum	unique referee ID number (referee name removed for anonymizing purposes)
refCountry	unique referee country ID number
meanIAT	mean implicit bias score (using the race IAT) for referee country
nIAT	sample size for race IAT in that particular country
seIAT	standard error for mean estimate of race IAT
meanExp	mean explicit bias score (using a racial thermometer task) for referee country
nExp	sample size for explicit bias in that particular country
seExp	standard error for mean estimate of explicit bias measure

2 數據預處理

數據基本特徵挖掘this

# Uncomment one of the following lines and run the cell:
  df = pd.read_csv("redcard.csv.gz", compression='gzip')
  df.shape
  (146028, 28)
  
  df.head()
複製代碼

df.describe().T 
複製代碼

df.dtypesspa

playerShort       object
  player            object
  club              object
  leagueCountry     object
  birthday          object
  height           float64
  weight           float64
  position          object
  games              int64
  victories          int64
  ties               int64
  defeats            int64
  goals              int64
  yellowCards        int64
  yellowReds         int64
  redCards           int64
  photoID           object
  rater1           float64
  rater2           float64
  refNum             int64
  refCountry         int64
  Alpha_3           object
  meanIAT          float64
  nIAT             float64
  seIAT            float64
  meanExp          float64
  nExp             float64
  seExp            float64
  dtype: object
複製代碼

all_columns = df.columns.tolist()設計

all_columns

 ['playerShort',
  'player',
  'club',
  'leagueCountry',
  'birthday',
  'height',
  'weight',
  'position',
  'games',
  'victories',
  'ties',
  'defeats',
  'goals',
  'yellowCards',
  'yellowReds',
  'redCards',
  'photoID',
  'rater1',
  'rater2',
  'refNum',
  'refCountry',
  'Alpha_3',
  'meanIAT',
  'nIAT',
  'seIAT',
  'meanExp',
  'nExp',
  'seExp']
複製代碼

df['height'].mean()code
```
181.93593798236887
複製代碼
```
df['height'].mean()cdn
```
181.93593798236887
複製代碼
```
np.mean(df.groupby('playerShort').height.mean())blog
```
181.74372848007872
複製代碼
```

Tidy Dataip

df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
       'key2':['one', 'two', 'one', 'two', 'one'],
       'data1':np.random.randn(5),
       'data2':np.random.randn(5)})
複製代碼

分組聚合ci

grouped = df2['data1'].groupby(df['key1'])
  grouped.mean()

  key1
  a   -0.093686
  b   -0.322711
  Name: data1, dtype: float64  
      
   player_index = 'playerShort'
   player_cols = [#'player', # drop player name, we have unique identifier
                 'birthday',
                 'height',
                 'weight',
                 'position',
                 'photoID',
                 'rater1',
                 'rater2',
                ]   

  all_cols_unique_players = df.groupby(' ').agg({col:'nunique' for col in player_cols})
  all_cols_unique_players.head()
複製代碼

all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0
    True
複製代碼

去重

def get_subgroup(dataframe, g_index, g_columns):
      g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
      if g[g > 1].dropna().shape[0] != 0:
          print("Warning: you probably assumed this had all unique values but it doesn't.")
      return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})

  players = get_subgroup(df, player_index, player_cols)
  players.head()
複製代碼