1.a = pandas.read_csv(filepath):讀取.csv格式的文件到列表a中,文件在路徑filepath中python
pandas.core.frame.DataFrame是pandas的核心結構app
b = a.head(n):b中存有文件前n行,默認爲5行函數
b = a.tail(n):b中存有文件後n行,默認爲5行學習
1 import pandas as pd 2 3 food_info = pd.read_csv("C:/Users/婁斌/Desktop/.ipynb_checkpoints/food_info.csv") 4 print(type(food_info)) 5 a = food_info.head() 6 b = food_info.tail(3) 7 print(a) 8 print(b)
2.pandas索引與計算。spa
設a爲DataFram類型。3d
1 import pandas as pd 2 3 food_info = pd.read_csv("C:/Users/婁斌/Desktop/.ipynb_checkpoints/food_info.csv") 4 print(food_info.loc[3:5]) 5 print(food_info["NDB_No"].head(3)) 6 print(food_info["Water_(g)"].max())
運行結果以下code
下面的代碼是將全部的單位是g的列找出來,並轉化爲mg,而後求和並加入a中。blog
1 import pandas as pd 2 3 food_info = pd.read_csv("C:/Users/婁斌/Desktop/.ipynb_checkpoints/food_info.csv") 4 columns = food_info.columns.tolist() 5 gram_c = [] 6 for c in columns: 7 if c.endswith("(g)"): 8 gram_c.append(c) 9 print(food_info[gram_c].head(3)) 10 food_info[gram_c] = food_info[gram_c]/1000 11 print(food_info[gram_c].head(3)) 12 13 food_info["sum(mg)"] = 0 14 for c in gram_c: 15 food_info["sum(mg)"] += food_info[c] 16 print(food_info.head(3))
3.pandas排序和titanic數據集排序
sor_value()函數進行排序,當參數inplace = false時,原數據集不變,當inplace = true時,原數據集變成排序後的結果。索引
下面的代碼是讀取titanic_train.csv的數據並按照標籤「fare"進行排序,而後讀取全部年齡爲空的記錄,並統計該記錄集的長度。
1 import pandas as pd 2 3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv") 4 print(titanic.head(5)) 5 titanic.sort_values("Fare", inplace=True) 6 print(titanic.head(5)) 7 8 #將全部年齡爲空的記錄顯示出來並統計個數 9 age = titanic['Age'] 10 age_is_null_judge = pd.isnull(age) 11 age_is_null = age[age_is_null_judge] 12 print(age_is_null) 13 print(len(age_is_null))
4.數據預處理方法
計算某一個屬性的平均值,下面代碼是計算數據集中age屬性平均值
1 import pandas as pd 2 3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv") 4 age = titanic['Age'] 5 age_is_null_judge = pd.isnull(age) #isnull函數判斷函數的age是否爲NaN,若是是則爲true,不然爲false 6 new_age = titanic['Age'][age_is_null_judge == False] #注意倆箇中括號,一個是屬性,一個是判斷 7 mean = sum(new_age) / len(new_age) #sun函數和len函數 8 print(mean)
還能夠用dropna函數去掉屬性爲空的記錄
1 import pandas as pd 2 3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv") 4 age = titanic['Age'] 5 6 new_titanic = titanic.dropna(axis=0, subset=['Age']) #subset是一個列表,能夠有多個屬性 7 new_age = new_titanic['Age'] 8 print(sum(new_age)/len(new_age))
以上兩段代碼的運行結果都是
29.69911764705882
下面這幾行代碼能夠訪問DataFram中的某行某列的元素
1 titanic = pd.reavd_csv("C:/學習/python/hello/titanic_train.csv") 2 print(titanic.loc[24, 'Age'])
能夠用pivot_table(index='Pclass', values='Age', aggfunc=np.mean)對數據進行分類統計,例如這裏的參數index說明該函數先將全部的記錄按照Pclass的不一樣進行分類,
參數value = ‘Age'說明對於每一類的記錄,統計其屬性Age, aggfunc = np.mean參數說明對Age屬性求平均值。
下面的代碼就是分別統計1,2,3等艙的乘客的平均年齡
1 import pandas as pd 2 import numpy as np 3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv") 4 mean_age = titanic.pivot_table(index='Pclass', values='Age', aggfunc=np.mean) 5 print(mean_age)
運行結果以下

利用panddas的sort_value函數能夠實現排序,可是排序好的記錄的索引值仍是原來的索引,即樣本再也不是從第0行到第n行了,以下圖所示
如今要把索引變成從0到1,只須要利用reset_index()函數
1 import pandas as pd 2 import numpy as np 3 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv") 4 new_titanic = titanic.sort_values('Age') 5 new_titanic1 = new_titanic.reset_index() 6 print(new_titanic.head(5)) 7 print(new_titanic1.head(5))
運行結果以下

能夠用apply(func, axis)函數實現自定義函數,其中第一個參數func是自定義的函數,第二個參數axis=0表示func函數逐列處理數據,axis=1表示逐行處理函數
以下代碼統計每一列的空值個數
1 import pandas as pd 2 import numpy as np 3 4 #統計每一個屬性的空值個數 5 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv") 6 7 8 def nul_count(column): 9 is_null_judge = pd.isnull(column) 10 is_null = column[is_null_judge] 11 return len(is_null) 12 13 14 column_null_count = titanic.apply(nul_count, axis=0) 15 print(column_null_count)
運行結果以下
如下的代碼將年齡離散化成成年人和未成年人
1 import pandas as pd 2 import numpy as np 3 4 #統計每一個屬性的空值個數 5 titanic = pd.read_csv("C:/學習/python/hello/titanic_train.csv") 6 7 8 def generate_age_label(row): 9 age = row['Age'] 10 if pd.isnull(age): 11 return "unknow" 12 elif age < 18: 13 return "minor" 14 else: 15 return "adult" 16 17 18 age_labels = titanic.apply(generate_age_label, axis=1) 19 print(age_labels)
運行結果以下
5.series結構
設a是DataFram結構,b爲a的某一行或者某一列,那麼b爲Series結構。 c = b.values,那麼c爲numpy的ndarray結構。以下代碼所示
1 import pandas as pd 2 import numpy as np 3 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv") 5 series_film = fandango["FILM"] 6 series_rt = fandango["RottenTomatoes"] 7 print(type(series_film)) 8 print(series_film.head(5)) 9 film_name = series_film.values 10 rt_scores = series_rt.values 11 print(type(rt_scores)) 12 print(rt_scores)
下面是運行結果
經過pandas.Series(value, index)函數能夠將兩個ndarray類型的值組合成一個Series類型,這裏index是索引,value是值,以下代碼 所示,將電影名和其RontenTomatoes的評分對應起來
1 import pandas as pd 2 import numpy as np 3 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv") 5 series_film = fandango["FILM"] 6 series_rt = fandango["RottenTomatoes"] 7 8 film_name = series_film.values 9 rt_scores = series_rt.values 10 11 series_custom = pd.Series(rt_scores, index=film_name) 12 print(type(series_custom)) 13 print(series_custom)
運行結果以下
能夠用Series結構按索引排序構造新的Series。以下代碼所示
1 import pandas as pd 2 import numpy as np 3 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv") 5 series_film = fandango["FILM"] 6 series_rt = fandango["RottenTomatoes"] 7 8 film_name = series_film.values 9 rt_scores = series_rt.values 10 11 series_custom = pd.Series(rt_scores, index=film_name) 12 13 #對電影名進行排序 14 origial_index = series_custom.index.tolist() #origial_index是list類型 15 sorted_index = sorted(origial_index) 16 new_series_custom = series_custom.reindex(sorted_index) 17 print(new_series_custom)
運行結果以下
如下代碼實現將數據表fandango中類型爲float64的列保留下來構成新表
1 import pandas as pd 2 import numpy as np 3 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv") 5 types = fandango.dtypes #Series結構,索引是列名,值是該列的數據類型 6 7 float_column = types[types.values == 'float64'].index #將類型爲float64的列名找出來 8 print(type(float_column)) 9 print(float_column) 10 float_df = fandango[float_column] 11 print(float_df.head(5))
運行結果以下
設fandango是Datafram結構,則fandango.columns的數據類型是index,fandango.columns.values的數據類型是ndarray,fandango.columns.values.tolist()的數據類型是list。這個數據類型
關係很重要。還有Datafram的某一行或者某一列爲Series結構,Series的values屬性是ndarray類型,ndarray結構調用tolist()成爲list類型
1 import pandas as pd 2 import numpy as np 3 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv") 5 columns = fandango.columns 6 columns_values = columns.values 7 columns_value_list = columns_values.tolist() 8 print(type(columns)) 9 print(columns) 10 print(type(columns_values)) 11 print(columns_values) 12 print(type(columns_value_list)) 13 print(columns_value_list)
運行結果以下
如下代碼實現對每一個電影的全部評價求方差,並打印出來
1 import pandas as pd 2 import numpy as np 3 4 fandango = pd.read_csv("C:/學習/python/hello/fandango_score_comparison.csv") 5 columns = fandango.columns #全部的屬性名,index類型 6 columns_values = columns.values #全部的屬性名,ndarray類型 7 new_fandango = fandango[columns_values[columns_values != 'FILM']] #去掉電影名這一列才能對剩下的列求方差 8 result = new_fandango.apply(lambda x: np.std(x), axis=1) #自定義函數求方差,axis=1表示按行處理,這裏的x是Series類型 9 film_name = fandango['FILM'] #Series類型的電影名 10 result_std = pd.Series(data=result.values, index=film_name.values) 11 print(result_std)
運行結果以下,記住numpy的std函數是能夠傳入Series類型的參數的,不過要求值所有爲數值類型。