基於球員和裁判數據進行探索性數據分析實踐-大數據ML樣本集案例實戰

版權聲明:本套技術專欄是做者(秦凱新)平時工做的總結和昇華,經過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和集羣環境容量規劃等內容,請持續關注本套博客。QQ郵箱地址:1120746959@qq.com,若有任何學術交流,可隨時聯繫。dom

1 數據簡介

  • 數據包含球員和裁判的信息,2012-2013年的比賽數據,總共設計球員2053名,裁判3147名,特徵列表以下:ide

    Variable Name: Variable Description:
    playerShort short player ID
    player player name
    club player club
    leagueCountry country of player club (England, Germany, France, and Spain)
    height player height (in cm)
    weight player weight (in kg)
    position player position
    games number of games in the player-referee dyad
    goals number of goals in the player-referee dyad
    yellowCards number of yellow cards player received from the referee
    yellowReds number of yellow-red cards player received from the referee
    redCards number of red cards player received from the referee
    photoID ID of player photo (if available)
    rater1 skin rating of photo by rater 1
    rater2 skin rating of photo by rater 2
    refNum unique referee ID number (referee name removed for anonymizing purposes)
    refCountry unique referee country ID number
    meanIAT mean implicit bias score (using the race IAT) for referee country
    nIAT sample size for race IAT in that particular country
    seIAT standard error for mean estimate of race IAT
    meanExp mean explicit bias score (using a racial thermometer task) for referee country
    nExp sample size for explicit bias in that particular country
    seExp standard error for mean estimate of explicit bias measure

2 數據預處理

  • 數據基本特徵挖掘this

    # Uncomment one of the following lines and run the cell:
      df = pd.read_csv("redcard.csv.gz", compression='gzip')
      df.shape
      (146028, 28)
      
      df.head()
    複製代碼

df.describe().T 
複製代碼

  • df.dtypesspa

    playerShort       object
      player            object
      club              object
      leagueCountry     object
      birthday          object
      height           float64
      weight           float64
      position          object
      games              int64
      victories          int64
      ties               int64
      defeats            int64
      goals              int64
      yellowCards        int64
      yellowReds         int64
      redCards           int64
      photoID           object
      rater1           float64
      rater2           float64
      refNum             int64
      refCountry         int64
      Alpha_3           object
      meanIAT          float64
      nIAT             float64
      seIAT            float64
      meanExp          float64
      nExp             float64
      seExp            float64
      dtype: object
    複製代碼
  • all_columns = df.columns.tolist()設計

    all_columns
    
     ['playerShort',
      'player',
      'club',
      'leagueCountry',
      'birthday',
      'height',
      'weight',
      'position',
      'games',
      'victories',
      'ties',
      'defeats',
      'goals',
      'yellowCards',
      'yellowReds',
      'redCards',
      'photoID',
      'rater1',
      'rater2',
      'refNum',
      'refCountry',
      'Alpha_3',
      'meanIAT',
      'nIAT',
      'seIAT',
      'meanExp',
      'nExp',
      'seExp']
    複製代碼
  • df['height'].mean()code

    181.93593798236887
    複製代碼
  • df['height'].mean()cdn

    181.93593798236887
    複製代碼
  • np.mean(df.groupby('playerShort').height.mean())blog

    181.74372848007872
    複製代碼
  • Tidy Dataip

    df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
           'key2':['one', 'two', 'one', 'two', 'one'],
           'data1':np.random.randn(5),
           'data2':np.random.randn(5)})
    複製代碼

  • 分組聚合ci

    grouped = df2['data1'].groupby(df['key1'])
      grouped.mean()
    
      key1
      a   -0.093686
      b   -0.322711
      Name: data1, dtype: float64  
          
       player_index = 'playerShort'
       player_cols = [#'player', # drop player name, we have unique identifier
                     'birthday',
                     'height',
                     'weight',
                     'position',
                     'photoID',
                     'rater1',
                     'rater2',
                    ]   
    
      all_cols_unique_players = df.groupby(' ').agg({col:'nunique' for col in player_cols})
      all_cols_unique_players.head()
    複製代碼

all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0
    True
複製代碼
  • 去重

    def get_subgroup(dataframe, g_index, g_columns):
          g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
          if g[g > 1].dropna().shape[0] != 0:
              print("Warning: you probably assumed this had all unique values but it doesn't.")
          return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})
    
      players = get_subgroup(df, player_index, player_cols)
      players.head()
    複製代碼

3 數據缺失值指標可視化

未完待續

相關文章
相關標籤/搜索