pandas DataFrame的建立方法

時間 2019-11-12

原文原文鏈接

pandas DataFrame的增刪查改總結系列文章：html

在pandas裏，DataFrame是最常常用的數據結構，這裏總結生成和添加數據的方法：
①、把其餘格式的數據整理到DataFrame中；
②在已有的DataFrame中插入N列或者N行。python

1. 字典類型讀取到DataFrame（dict to DataFrame）

假如咱們在作實驗的時候獲得的數據是dict類型，爲了方便以後的數據統計和計算，咱們想把它轉換爲DataFrame，存在不少寫法，這裏簡單介紹經常使用的幾種：
方法一：直接使用pd.DataFrame(data=test_dict)便可,括號中的data=寫不寫均可以，具體以下：git

test_dict = {'id':[1,2,3,4,5,6],'name':['Alice','Bob','Cindy','Eric','Helen','Grace '],'math':[90,89,99,78,97,93],'english':[89,94,80,94,94,90]}
#[1].直接寫入參數test_dict
test_dict_df = pd.DataFrame(test_dict)
#[2].字典型賦值
test_dict_df = pd.DataFrame(data=test_dict)

那麼，咱們就獲得了一個DataFrame，以下：github

應該就是這個樣子了。
方法二：使用from_dict方法：數據結構

test_dict_df = pd.DataFrame.from_dict(test_dict)

結果是同樣的，再也不重複貼圖。
其餘方法：若是你的dict變量很小，例如{'id':1,'name':'Alice'},你想直接寫到括號裏：app

test_dict_df = pd.DataFrame({'id':1,'name':'Alice'}) # wrong style

這樣是不行的，會報錯ValueError: If using all scalar values, you must pass an index,是由於若是你提供的是一個標量，必須還得提供一個索引Index，因此你能夠這麼寫：函數

test_dict_df = pd.DataFrame({'id':1,'name':'Alice'},pd.Index(range(1)))

後面的能夠寫多個pd.Index(range(3)，就會生成三行同樣的，是由於前面的dict型變量只有一組值，若是有多個，後面的Index必須跟前面的數據組數一致，不然會報錯：spa

pd.DataFrame({'id':[1,2],'name':['Alice','Bob']},pd.Index(range(2)))  #must be 2 in range function.

關於選擇列，有些時候咱們只須要選擇dict中部分的鍵當作DataFrame的列，那麼咱們可使用columns參數，例如咱們只選擇'id'，'name'列：scala

test_dict_df = pd.DataFrame(data=test_dict,columns=['id','name']) #only choose 'id' and 'name' columns

這裏就不在多寫了，後續變動顏色添加內容。code

2. csv文件構建DataFrame（csv to DataFrame）

咱們實驗的時候數據通常比較大，而csv文件是文本格式的數據，佔用更少的存儲，因此通常數據來源是csv文件，從csv文件中如何構建DataFrame呢？ txt文件通常也能用這種方法。
方法一：最經常使用的應該就是pd.read_csv('filename.csv')了，用 sep指定數據的分割方式，默認的是','

df = pd.read_csv('./xxx.csv')

若是csv中沒有表頭，就要加入head參數

3. 在已有的DataFrame中，增長N列或者N行

加入咱們已經有了一個DataFrame，以下圖:

3.1 添加列
此時咱們又有一門新的課physics，咱們須要爲每一個人添加這門課的分數，按照Index的順序，咱們可使用insert方法，以下：

new_columns = [92,94,89,77,87,91]
test_dict_df.insert(2,'pyhsics',new_columns)
#test_dict_df.insert(2,'pyhsics',new_columns,allow_duplicates=True)

此時，就獲得了添加好的DataFrame，須要注意的是DataFrame默認不容許添加劇復的列，可是在insert函數中有參數allow_duplicates=True，設置爲True後，就能夠添加劇復的列了，列名也是重複的：

3.2 添加行
此時咱們又來了一位新的同窗Iric，須要在DataFrame中添加這個同窗的信息，咱們可使用loc方法：

new_line = [7,'Iric',99]
test_dict_df.loc[6]= new_line

可是十分注意的是，這樣實際是改的操做，若是loc[index]中的index已經存在，則新的值會覆蓋以前的值。

固然也能夠把這些新的數據構建爲一個新的DataFrame，而後兩個DataFrame拼起來。能夠用append方法，不過不太會用，提供一種方法：

test_dict_df.append(pd.DataFrame([new_line],columns=['id','name','physics']))

本想一口氣把CURD全寫完，沒想到寫到這裏就好累。。。其餘後續新開篇章在寫吧。
相關代碼：（https://github.com/dataSnail/blogCode/blob/master/python_curd/python_curd_create.ipynb）（在DataFrame中刪除N列或者N行）（在DataFrame中查詢某N列或者某N行）（在DataFrame中修改數據）

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。