這篇文章中使用的數據集是一個足球球員各項技能及其身價的csv表,包含了60多個字段。數據集下載連接:數據集php
這個函數能夠輸出讀入表格的一些具體信息。這對於加快數據預處理很是有幫助。app
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('dataset/soccer/train.csv') print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10441 entries, 0 to 10440 Data columns (total 65 columns): id 10441 non-null int64 club 10441 non-null int64 league 10441 non-null int64 birth_date 10441 non-null object height_cm 10441 non-null int64 weight_kg 10441 non-null int64 nationality 10441 non-null int64 potential 10441 non-null int64 ... dtypes: float64(12), int64(50), object(3) memory usage: 5.2+ MB None
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('dataset/soccer/train.csv') print(data.query('lw>cf')) # 這兩個方法是等價的 print(data[data.lw > data.cf]) # 這兩個方法是等價的
這個函數能夠統計某一列中不一樣值出現的頻率。函數
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('dataset/soccer/train.csv') print(data.work_rate_att.value_counts())
Medium 7155 High 2762 Low 524 Name: work_rate_att, dtype: int64
按照某一列的數值進行排序後輸出。code
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('dataset/soccer/train.csv') print(data.sort_values(['sho']).head(5))
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('dataset/soccer/train.csv') potential_mean = data['potential'].groupby(data['nationality']).mean().head(5) print(potential_mean)
nationality 1 74.945338 2 72.914286 3 67.892857 4 69.000000 5 70.024242 Name: potential, dtype: float64
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('dataset/soccer/train.csv') potential_mean = data['potential'].head(20).groupby([data['nationality'], data['club']]).mean() print(potential_mean)
nationality club 1 148 76 461 72 5 83 64 29 593 68 43 213 67 51 258 62 52 112 68 54 604 81 63 415 70 64 359 74 78 293 73 90 221 70 96 80 72 101 458 67 111 365 64 379 83 584 65 138 9 72 155 543 72 163 188 71 Name: potential, dtype: int64
值得注意的是,在分組函數後面使用一個size()函數能夠返回帶有分組大小的結果。排序
potential_mean = data['potential'].head(200).groupby([data['nationality'], data['club']]).size()
nationality club 1 148 1 43 213 1 51 258 1 52 112 1 54 604 1 78 293 1 96 80 1 101 458 1 155 543 1 163 188 1 Name: potential, dtype: int64
這個函數通常在groupby函數以後使用。get
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('dataset/soccer/train.csv') potential_mean = data['potential'].head(10).groupby(data['nationality']).agg(['max', 'min']) print(potential_mean)
max min nationality 1 76 76 43 67 67 51 62 62 52 68 68 54 81 81 78 73 73 96 72 72 101 67 67 155 72 72 163 71 71
將某一個函數應用到某一列或者某一行上,能夠極大加快處理速度。pandas
import pandas as pd import matplotlib.pyplot as plt # 返回球員出生日期中的年份 def birth_date_deal(birth_date): year = birth_date.split('/')[2] return year data = pd.read_csv('dataset/soccer/train.csv') result = data['birth_date'].apply(birth_date_deal).head() print(result)
0 96 1 84 2 99 3 88 4 80 Name: birth_date, dtype: object
固然若是使用lambda函數的話,代碼會更加簡潔:it
data = pd.read_csv('dataset/soccer/train.csv') result = data['birth_date'].apply(lambda x: x.split('/')[2]).head() print(result)