pandas的用法

時間 2019-12-11

標籤 pandas 用法简体版

原文原文鏈接

1.a = pandas.read_csv(filepath)：讀取.csv格式的文件到列表a中，文件在路徑filepath中python

　　pandas.core.frame.DataFrame是pandas的核心結構app

　　b = a.head(n)：b中存有文件前n行,默認爲5行函數

　　b = a.tail(n):b中存有文件後n行，默認爲5行學習

1 import pandas as pd
2 
3 food_info = pd.read_csv("C:/Users/婁斌/Desktop/.ipynb_checkpoints/food_info.csv")
4 print(type(food_info))
5 a = food_info.head()
6 b = food_info.tail(3)
7 print(a)
8 print(b)

2.pandas索引與計算。spa

　　設a爲DataFram類型。3d

　　a.loc[n]表示提取a的第 n行；a.loc[n:m]表示提取a的n到m行，固然，還能夠用列表做爲索引。
　　a['name']表示提取列名爲"name"的列。
　　加減乘除和numpy的向量同樣。
　　a.columns.tolist()將全部的列名存儲在一個向量中
a['name'].max()能夠取出該列的最大值。

1 import pandas as pd
2 
3 food_info = pd.read_csv("C:/Users/婁斌/Desktop/.ipynb_checkpoints/food_info.csv")
4 print(food_info.loc[3:5])
5 print(food_info["NDB_No"].head(3))
6 print(food_info["Water_(g)"].max())

運行結果以下code

下面的代碼是將全部的單位是g的列找出來，並轉化爲mg，而後求和並加入a中。blog

 1 import pandas as pd
 2 
 3 food_info = pd.read_csv("C:/Users/婁斌/Desktop/.ipynb_checkpoints/food_info.csv")
 4 columns = food_info.columns.tolist()
 5 gram_c = []
 6 for c in columns:
 7     if c.endswith("(g)"):
 8         gram_c.append(c)
 9 print(food_info[gram_c].head(3))
10 food_info[gram_c] = food_info[gram_c]/1000
11 print(food_info[gram_c].head(3))
12 
13 food_info["sum(mg)"] = 0
14 for c in gram_c:
15    food_info["sum(mg)"] += food_info[c]
16 print(food_info.head(3))

3.pandas排序和titanic數據集排序

　　sor_value()函數進行排序，當參數inplace = false時，原數據集不變，當inplace = true時，原數據集變成排序後的結果。索引

　　下面的代碼是讀取titanic_train.csv的數據並按照標籤「fare"進行排序，而後讀取全部年齡爲空的記錄，並統計該記錄集的長度。

 1 import pandas as pd
 2 
 3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv")
 4 print(titanic.head(5))
 5 titanic.sort_values("Fare", inplace=True)
 6 print(titanic.head(5))
 7 
 8 #將全部年齡爲空的記錄顯示出來並統計個數
 9 age = titanic['Age']
10 age_is_null_judge = pd.isnull(age)
11 age_is_null = age[age_is_null_judge]
12 print(age_is_null)
13 print(len(age_is_null))

4.數據預處理方法

　　計算某一個屬性的平均值，下面代碼是計算數據集中age屬性平均值

1 import pandas as pd
2 
3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv")
4 age = titanic['Age']
5 age_is_null_judge = pd.isnull(age)    #isnull函數判斷函數的age是否爲NaN，若是是則爲true，不然爲false
6 new_age = titanic['Age'][age_is_null_judge == False]  #注意倆箇中括號，一個是屬性，一個是判斷
7 mean = sum(new_age) / len(new_age)    #sun函數和len函數
8 print(mean)

　　還能夠用dropna函數去掉屬性爲空的記錄

1 import pandas as pd
2 
3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv")
4 age = titanic['Age']
5 
6 new_titanic = titanic.dropna(axis=0, subset=['Age'])   #subset是一個列表，能夠有多個屬性
7 new_age = new_titanic['Age']
8 print(sum(new_age)/len(new_age))

　　以上兩段代碼的運行結果都是

　　29.69911764705882

　　下面這幾行代碼能夠訪問DataFram中的某行某列的元素

1 titanic = pd.reavd_csv("C:/學習/python/hello/titanic_train.csv")
2 print(titanic.loc[24, 'Age'])

　　能夠用pivot_table(index='Pclass', values='Age', aggfunc=np.mean)對數據進行分類統計，例如這裏的參數index說明該函數先將全部的記錄按照Pclass的不一樣進行分類，

參數value = ‘Age'說明對於每一類的記錄，統計其屬性Age, aggfunc = np.mean參數說明對Age屬性求平均值。
　　下面的代碼就是分別統計1，2，3等艙的乘客的平均年齡

1 import pandas as pd
2 import numpy as np
3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv")
4 mean_age = titanic.pivot_table(index='Pclass', values='Age', aggfunc=np.mean)
5 print(mean_age)

　　運行結果以下

 　　利用panddas的sort_value函數能夠實現排序，可是排序好的記錄的索引值仍是原來的索引，即樣本再也不是從第0行到第n行了，以下圖所示

　　如今要把索引變成從0到1，只須要利用reset_index()函數

1 import pandas as pd
2 import numpy as np
3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv")
4 new_titanic = titanic.sort_values('Age')
5 new_titanic1 = new_titanic.reset_index()
6 print(new_titanic.head(5))
7 print(new_titanic1.head(5))

　　運行結果以下

　　能夠用apply(func， axis)函數實現自定義函數，其中第一個參數func是自定義的函數，第二個參數axis=0表示func函數逐列處理數據，axis=1表示逐行處理函數

　　以下代碼統計每一列的空值個數

 1 import pandas as pd
 2 import numpy as np
 3 
 4 #統計每一個屬性的空值個數
 5 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv")
 6 
 7 
 8 def nul_count(column):
 9     is_null_judge = pd.isnull(column)
10     is_null = column[is_null_judge]
11     return len(is_null)
12 
13 
14 column_null_count = titanic.apply(nul_count, axis=0)
15 print(column_null_count)

　　運行結果以下

　　如下的代碼將年齡離散化成成年人和未成年人

 1 import pandas as pd
 2 import numpy as np
 3 
 4 #統計每一個屬性的空值個數
 5 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv")
 6 
 7 
 8 def generate_age_label(row):
 9     age = row['Age']
10     if pd.isnull(age):
11         return "unknow"
12     elif age < 18:
13         return "minor"
14     else:
15         return "adult"
16 
17 
18 age_labels = titanic.apply(generate_age_label, axis=1)
19 print(age_labels)

　　運行結果以下

5.series結構

　　設a是DataFram結構，b爲a的某一行或者某一列，那麼b爲Series結構。 c = b.values,那麼c爲numpy的ndarray結構。以下代碼所示

 1 import pandas as pd
 2 import numpy as np
 3 
 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv")
 5 series_film = fandango["FILM"]
 6 series_rt = fandango["RottenTomatoes"]
 7 print(type(series_film))
 8 print(series_film.head(5))
 9 film_name = series_film.values
10 rt_scores = series_rt.values
11 print(type(rt_scores))
12 print(rt_scores)

　　下面是運行結果

　　經過pandas.Series(value, index)函數能夠將兩個ndarray類型的值組合成一個Series類型，這裏index是索引，value是值，以下代碼所示，將電影名和其RontenTomatoes的評分對應起來

 1 import pandas as pd
 2 import numpy as np
 3 
 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv")
 5 series_film = fandango["FILM"]
 6 series_rt = fandango["RottenTomatoes"]
 7 
 8 film_name = series_film.values
 9 rt_scores = series_rt.values
10 
11 series_custom = pd.Series(rt_scores, index=film_name)
12 print(type(series_custom))
13 print(series_custom)

　　運行結果以下

　　能夠用Series結構按索引排序構造新的Series。以下代碼所示

 1 import pandas as pd
 2 import numpy as np
 3 
 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv")
 5 series_film = fandango["FILM"]
 6 series_rt = fandango["RottenTomatoes"]
 7 
 8 film_name = series_film.values
 9 rt_scores = series_rt.values
10 
11 series_custom = pd.Series(rt_scores, index=film_name)
12 
13 #對電影名進行排序
14 origial_index = series_custom.index.tolist()  #origial_index是list類型
15 sorted_index = sorted(origial_index)
16 new_series_custom = series_custom.reindex(sorted_index)
17 print(new_series_custom)

　　運行結果以下

　　如下代碼實現將數據表fandango中類型爲float64的列保留下來構成新表

 1 import pandas as pd
 2 import numpy as np
 3 
 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv")
 5 types = fandango.dtypes     #Series結構，索引是列名，值是該列的數據類型
 6 
 7 float_column = types[types.values == 'float64'].index   #將類型爲float64的列名找出來
 8 print(type(float_column))
 9 print(float_column)
10 float_df = fandango[float_column]
11 print(float_df.head(5))

運行結果以下

　　設fandango是Datafram結構，則fandango.columns的數據類型是index，fandango.columns.values的數據類型是ndarray，fandango.columns.values.tolist()的數據類型是list。這個數據類型

關係很重要。還有Datafram的某一行或者某一列爲Series結構，Series的values屬性是ndarray類型，ndarray結構調用tolist（）成爲list類型

 1 import pandas as pd
 2 import numpy as np
 3 
 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv")
 5 columns = fandango.columns
 6 columns_values = columns.values
 7 columns_value_list = columns_values.tolist()
 8 print(type(columns))
 9 print(columns)
10 print(type(columns_values))
11 print(columns_values)
12 print(type(columns_value_list))
13 print(columns_value_list)

　　運行結果以下

　　如下代碼實現對每一個電影的全部評價求方差，並打印出來

 1 import pandas as pd
 2 import numpy as np
 3 
 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv")
 5 columns = fandango.columns  #全部的屬性名，index類型
 6 columns_values = columns.values #全部的屬性名，ndarray類型
 7 new_fandango = fandango[columns_values[columns_values != 'FILM']]  #去掉電影名這一列才能對剩下的列求方差
 8 result = new_fandango.apply(lambda x: np.std(x), axis=1) #自定義函數求方差，axis=1表示按行處理，這裏的x是Series類型
 9 film_name = fandango['FILM']    #Series類型的電影名
10 result_std = pd.Series(data=result.values, index=film_name.values) 
11 print(result_std)