Pandas使用DataFrame進行數據分析比賽進階之路(一)

這篇文章中使用的數據集是一個足球球員各項技能及其身價的csv表,包含了60多個字段。數據集下載連接:數據集php

一、DataFrame.info()

這個函數能夠輸出讀入表格的一些具體信息。這對於加快數據預處理很是有幫助。app

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('dataset/soccer/train.csv')
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10441 entries, 0 to 10440
Data columns (total 65 columns):
id                          10441 non-null int64
club                        10441 non-null int64
league                      10441 non-null int64
birth_date                  10441 non-null object
height_cm                   10441 non-null int64
weight_kg                   10441 non-null int64
nationality                 10441 non-null int64
potential                   10441 non-null int64
                   ...
dtypes: float64(12), int64(50), object(3)
memory usage: 5.2+ MB
None

二、DataFrame.query()

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('dataset/soccer/train.csv')
print(data.query('lw>cf'))      # 這兩個方法是等價的
print(data[data.lw > data.cf])  # 這兩個方法是等價的

三、DataFrame.value_counts()

這個函數能夠統計某一列中不一樣值出現的頻率。函數

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('dataset/soccer/train.csv')
print(data.work_rate_att.value_counts())
Medium    7155
High      2762
Low        524
Name: work_rate_att, dtype: int64

四、DataFrame.sort_values()

按照某一列的數值進行排序後輸出。code

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('dataset/soccer/train.csv')
print(data.sort_values(['sho']).head(5))

五、DataFrame.groupby()

  • 根據國籍(nationality)這一列的屬性進行分組,而後分別計算相同國籍的潛力(potential)的平均值。
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('dataset/soccer/train.csv')
potential_mean = data['potential'].groupby(data['nationality']).mean().head(5)
print(potential_mean)
nationality
1    74.945338
2    72.914286
3    67.892857
4    69.000000
5    70.024242
Name: potential, dtype: float64
  • 根據國籍(nationality),俱樂部(club)這兩列的屬性進行分組,而後分別計算球員潛力(potential)的平均值。
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('dataset/soccer/train.csv')
potential_mean = data['potential'].head(20).groupby([data['nationality'], data['club']]).mean()
print(potential_mean)
nationality  club
1            148     76
             461     72
5            83      64
29           593     68
43           213     67
51           258     62
52           112     68
54           604     81
63           415     70
64           359     74
78           293     73
90           221     70
96           80      72
101          458     67
111          365     64
             379     83
             584     65
138          9       72
155          543     72
163          188     71
Name: potential, dtype: int64

值得注意的是,在分組函數後面使用一個size()函數能夠返回帶有分組大小的結果。排序

potential_mean = data['potential'].head(200).groupby([data['nationality'], data['club']]).size()
nationality  club
1            148     1
43           213     1
51           258     1
52           112     1
54           604     1
78           293     1
96           80      1
101          458     1
155          543     1
163          188     1
Name: potential, dtype: int64

六、DataFrame.agg()

這個函數通常在groupby函數以後使用。get

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('dataset/soccer/train.csv')
potential_mean = data['potential'].head(10).groupby(data['nationality']).agg(['max', 'min'])
print(potential_mean)
max  min
nationality          
1             76   76
43            67   67
51            62   62
52            68   68
54            81   81
78            73   73
96            72   72
101           67   67
155           72   72
163           71   71

七、DataFrame.apply()

將某一個函數應用到某一列或者某一行上,能夠極大加快處理速度。pandas

import pandas as pd
import matplotlib.pyplot as plt


#  返回球員出生日期中的年份
def birth_date_deal(birth_date):
    year = birth_date.split('/')[2]
    return year

data = pd.read_csv('dataset/soccer/train.csv')
result = data['birth_date'].apply(birth_date_deal).head() 
print(result)
0    96
1    84
2    99
3    88
4    80
Name: birth_date, dtype: object

固然若是使用lambda函數的話,代碼會更加簡潔:it

data = pd.read_csv('dataset/soccer/train.csv')
result = data['birth_date'].apply(lambda x: x.split('/')[2]).head()
print(result)
相關文章
相關標籤/搜索