pandas.read_csv(filepath_or_buffer, na_values='NAN', parse_dates=['Last Update'])
從CSV文件中讀取數據並建立一個DataFrame對象,na_vlaues用於設置缺失值形式,parse_dates用於將指定的列解析成時間日期格式。dataframe.to_csv("xxx.csv", mode='a', header=False)
導出DataFrame數據到CSV文件。mysql
import pandas as pd if __name__ == "__main__": df = pd.read_csv("temp.csv") print(df) print(df.info()) df.to_csv("temp2.csv") # output: # S.No Name Age City Salary # 0 1 Tom 28 Toronto 20000 # 1 2 Lee 32 HongKong 3000 # 2 3 Steven 43 Bay Area 8300 # 3 4 Ram 38 Hyderabad 3900 # <class 'pandas.core.frame.DataFrame'> # RangeIndex: 4 entries, 0 to 3 # Data columns (total 5 columns): # S.No 4 non-null int64 # Name 4 non-null object # Age 4 non-null int64 # City 4 non-null object # Salary 4 non-null int64 # dtypes: int64(3), object(2) # memory usage: 240.0+ bytes # None
能夠指定CSV文件中的一列來使用index_col定製索引。sql
import pandas as pd if __name__ == "__main__": df = pd.read_csv("temp.csv", index_col=['S.No']) print(df) # output: # Name Age City Salary # S.No # 1 Tom 28 Toronto 20000 # 2 Lee 32 HongKong 3000 # 3 Steven 43 Bay Area 8300 # 4 Ram 38 Hyderabad 3900
對於較大的文本文件,若是完整讀入內存,則讀入過程會很慢,甚至沒法讀入內存,或者能夠讀入內存,但無法進行進一步的計算,此時可使用read_csv提供的chunksize或者iterator參數,部分讀入文件,處理完後再經過to_csv的mode='a',將每部分結果逐步寫入文件。
在輸出文件時,大文件輸出csv比輸出excel要快,xls只支持60000+條記錄,xlsx雖然支持記錄變多,但若是內容有中文經常會出現內容丟失。所以,若是數量較小能夠選擇xls,而數量較大則建議輸出到csv。數據庫
HDF5(Hierarchical Data Formal)是用於存儲大規模數值數據的較爲理想的存儲格式,文件後綴名爲h5,存儲讀取速度很是快,且可在文件內部按照明確的層次存儲數據,同一個HDF5能夠看作一個高度整合的文件夾,其內部可存放不一樣類型的數據。在Python中操做HDF5文件的方式主要有兩種,一是利用pandas中內建的一系列HDF5文件操做相關的方法來將pandas中的數據結構保存在HDF5文件中,二是利用h5py模塊來完成從Python原生數據結構向HDF5格式的保存。pandas.HDFStore()
pandas.HDFStore()用於生成管理HDF5文件IO操做的對象,其主要參數以下:
path:字符型輸入,用於指定h5文件的路徑。
mode:用於指定IO操做的模式,默認爲'a',即當指定文件已存在時不影響原有數據寫入,指定文件不存在時則新建文件;'r',只讀模式;'w',建立新文件(會覆蓋同名舊文件);'r+',與'a'做用類似,但要求文件必須已經存在;
complevel:int型,用於控制h5文件的壓縮水平,取值範圍在0-9之間,越大則文件的壓縮程度越大,佔用的空間越小,但相對應的在讀取文件時須要付出更多解壓縮的時間成本,默認爲0,表明不壓縮。
經過使用鍵值對或put方法能夠將不一樣的數據存入store對象中,store對象的put()方法主要參數以下:
key:指定h5文件中待寫入數據的key
value:指定與key對應的待寫入的數據
format:字符型輸入,用於指定寫出的模式,'fixed'對應的模式速度快,但不支持追加也不支持檢索;'table'對應的模式以表格的模式寫出,速度稍慢,但支持直接經過store對象進行追加和表格查詢操做。json
import pandas as pd import numpy as np if __name__ == "__main__": store = pd.HDFStore("demo.h5") s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) df = pd.DataFrame(np.random.randn(8, 3), columns=['A', 'B', 'C']) store['s'] = s store.put(key='df', value=df) print(store.items) print(store.keys()) store.close() # output: # <bound method HDFStore.items of <class 'pandas.io.pytables.HDFStore'> # File path: demo.h5 # > # ['/df', '/s']
刪除store對象中指定數據的方法有兩種,一是使用remove()方法,傳入要刪除數據對應的鍵;二是使用Python中的關鍵詞del來刪除指定數據。數組
import pandas as pd import numpy as np if __name__ == "__main__": store = pd.HDFStore("demo.h5") s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) df = pd.DataFrame(np.random.randn(8, 3), columns=['A', 'B', 'C']) store['s'] = s store.put(key='df', value=df) print(store.keys()) store.remove('s') print(store.keys()) store.close() # output: # ['/df', '/s'] # ['/df']
將當前的store對象持久化到本地,只須要利用close()方法關閉store對象便可。
Pandas提供了便利方法能夠將Pandas的數據結構直接導出到本地h5文件中或從h5文件中讀取。pd.read_hdf('demo.h5', key='df')
從hdf文件中讀取鍵的值df.to_hdf(path_or_buf='demo.h5', key='df')
將df保存到hdf文件數據結構
import pandas as pd import numpy as np if __name__ == "__main__": # 建立新的數據框 df_ = pd.DataFrame(np.random.randn(5, 5)) # 導出到已存在的h5文件中 df_.to_hdf(path_or_buf='demo.h5', key='df') # 建立於本地demo.h5進行IO鏈接的store對象 store = pd.HDFStore('demo.h5') # 查看指定h5對象中的全部鍵 print(store.keys()) store.close() print(store.is_open) df = pd.read_hdf('demo.h5', key='df') print(df) # output: # ['/df'] # False # 0 1 2 3 4 # 0 0.262806 -0.146832 -0.219655 0.553608 -0.278420 # 1 -0.057369 -1.662138 -0.757119 -2.000140 1.659584 # 2 1.030621 0.421785 -0.239423 0.814709 -1.596752 # 3 -1.538354 0.988993 -1.460490 0.846775 1.073998 # 4 0.092367 -0.042897 -0.034253 0.299312 0.970190
HDF5在存儲較大規模數據時有顯著優點,其存取速度效率和壓縮效率都比CSV高不少。架構
pd.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None)
從Excel文件導入數據
io:爲excel文件路徑或IO。
sheet_name:返回指定的sheet,若是將sheet_name指定爲None,則返回全表。若是須要返回多個表,,能夠將sheet_name指定爲一個列表。
header:指定數據表的表頭,默認值爲0,即將第一行做爲表頭。
index_col:用做行索引的列編號或者列名,若是給定一個序列則有多個行索引。能夠設定index_col=False,pandas不適用第一列做爲行索引。
usecols:讀取指定的列,也能夠經過名字或索引值。app
import pandas as pd if __name__ == "__main__": df = pd.read_excel("test.xls", sheet_name=None) print(df['Sheet1']) print(df['Sheet2']) # output: # No Name Age Score # 0 1 Bauer 26 89 # 1 2 Bob 24 87 # 2 3 Jack 25 80 # 3 4 Alex 30 90 # No Name Age # 0 1 Bauer 26 # 1 2 Bob 24 # 2 3 Jack 25 # 3 4 Alex 30
讀取excel主要經過read_excel函數實現,除了pandas還須要安裝第三方庫xlrd。data.to_excel(io, sheet_name='Sheet1', index=False, header=True)
導出數據到Excel文件
使用to_excel函數須要安裝xlwt庫。dom
import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(3, 3), columns=['A', 'B', 'C']) print(df) df.to_excel("test1.xls", sheet_name='Sheet3', index=False) df = pd.read_excel("test1.xls") print(df) # output: # A B C # 0 1.066504 0.807083 -0.213006 # 1 0.247025 -1.129131 -0.130942 # 2 0.090071 -0.358951 0.266514 # A B C # 0 1.066504 0.807083 -0.213006 # 1 0.247025 -1.129131 -0.130942 # 2 0.090071 -0.358951 0.266514
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
將SQL查詢或數據庫表讀入DataFrame,是對read_sql_table和 read_sql_query的封裝,將根據提供的輸入委託給特定的功能。SQL查詢將被路由到read_sql_query,而數據庫表名將被路由到read_sql_table。pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None)
將SQL數據庫表讀入DataFrame。
sql:要執行的SQL查詢或表名,string或SQLAlchemy對象。
con:SQLAlchemy鏈接(引擎/鏈接)或數據庫字符串URI或DBAPI2鏈接,可使用SQLAlchemy庫支持的任何數據庫。若是是DBAPI2對象,則僅支持sqlite3。
index_col:字符串或字符串列表,可選,默認值:None,要設置爲index的列(MultiIndex)。
coerce_float:boolean,默認爲True,嘗試將非字符串,非數字對象(如decimal.Decimal)的值轉換爲浮點,
params:list,tuple或dict,optional,default:None,傳遞給執行方法的參數列表,用於傳遞參數的語法取決於數據庫驅動程序。
parse_dates:list或dict,默認值:None,要解析爲日期的列名的列表。
columns:list,默認值:None,從SQL表中選擇的列名列表。
chunksize:int,默認None,若是指定,則返回一個迭代器,其中chunksize是要包含在每一個塊中的行數。ide
import MySQLdb mysql_cn= MySQLdb.connect(host='host', port=3306,user='username', passwd='password', db='information_schema') df_mysql = pd.read_sql('select * from VIEWS;', con=mysql_cn) print('loaded dataframe from MySQL. records:', len(df_mysql)) mysql_cn.close()
DataFrame.to_sql (name,con,schema = None,if_exists ='fail',index = True,index_label = None,chunksize = None,dtype = None )
導出DataFrame到SQL數據庫。
name:SQL表的名稱。
con:sqlalchemy.engine.Engine或sqlite3.Connection,可使用SQLAlchemy庫支持的任何數據庫,爲sqlite3.Connection對象提供了舊版支持。
schema:可選,指定架構(若是數據庫支持)。若是爲None,請使用默認架構。
if_exists:{'fail','replace','append'},默認'fail',若是表已存在的狀況以下,fail:引起ValueError;replace:在插入新值前刪除表;append:將新值插入現有表。
index:布爾值,默認爲True,將DataFrame index寫爲列。使用index_label做爲表中的列名。
index_label:字符串或序列,默認爲None,index列的列標籤。若是給出None(默認)且 index爲True,則使用index名稱。若是DataFrame使用MultiIndex,則應該給出一個sequence。
chunksize:int,可選,將一次批量寫入的數量。默認狀況下,全部行都將當即寫入。
dtype:dict,可選指定列的數據類型。鍵應該是列名,值應該是SQLAlchemy類型,或sqlite3傳統模式的字符串。
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
從JSON文件或JSON格式的字符串導入數據
path_or_buf:Json文件路徑或JSON格式的字符串
orient:JSON格式字符串的指示,Series可選值爲'split','records','index','table',默認爲index。DataFrame的可選值爲
'split','records','index','columns','values','table',默認爲columns。
‘split’ : JSON字符串以字典格式,如{index -> [index], columns -> [columns], data -> [values]}
json文件的每一行都相似以下,並且json文件的key的名字只能爲index,cloumns,data三個。
‘records’ : JSON字符串以列表格式,如[{column -> value}, … , {column -> value}]
‘index’ : JSON字符串以字典格式,如 {index -> {column -> value}}
‘columns’ : JSON字符串以字典格式,如 {column -> {index -> value}}
‘values’ : JSON字符串爲數組格式。
typ:數據類型,可選值爲series,frame,默認爲frame。
data.to_json(self, path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=True)
導出DataFrame數據到JSON文件。
import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(3,8), index=['A', 'B', 'C']) print(df) df.to_json("test.json") df = pd.read_json("test.json") print(df) # output: # 0 1 2 ... 5 6 7 # A -0.305526 -0.696618 0.796365 ... -0.195769 -1.669797 0.548616 # B -1.598829 1.104907 -1.969812 ... 1.590904 1.372927 0.766009 # C -1.424199 0.717892 0.728426 ... 0.358646 0.742373 -0.820586 # # [3 rows x 8 columns] # 0 1 2 ... 5 6 7 # A -0.305526 -0.696618 0.796365 ... -0.195769 -1.669797 0.548616 # B -1.598829 1.104907 -1.969812 ... 1.590904 1.372927 0.766009 # C -1.424199 0.717892 0.728426 ... 0.358646 0.742373 -0.820586 # # [3 rows x 8 columns]
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.head(3)) print(df.tail(3)) # output: # A B C # 2013-01-01 0.768917 -0.963290 -0.159038 # 2013-01-02 -0.023267 -0.292786 0.652954 # 2013-01-03 0.176760 0.137241 1.301041 # 2013-01-04 -0.071628 -1.371969 0.774005 # 2013-01-05 -0.793016 -0.178345 0.035532 # 2013-01-06 0.407762 0.241827 1.170372 # A B C # 2013-01-01 0.768917 -0.963290 -0.159038 # 2013-01-02 -0.023267 -0.292786 0.652954 # 2013-01-03 0.176760 0.137241 1.301041 # A B C # 2013-01-04 -0.071628 -1.371969 0.774005 # 2013-01-05 -0.793016 -0.178345 0.035532 # 2013-01-06 0.407762 0.241827 1.170372
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.index) print(df.columns) print(list(df)) print(df.values) # output: # A B C # 2013-01-01 0.971426 0.403905 0.304562 # 2013-01-02 -2.404873 -0.222086 0.444464 # 2013-01-03 -0.144014 -0.513883 -0.468732 # 2013-01-04 0.065060 0.460675 -0.633609 # 2013-01-05 -1.322018 2.128932 1.099606 # 2013-01-06 -0.220413 -0.086348 -0.289723 # DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', # '2013-01-05', '2013-01-06'], # dtype='datetime64[ns]', freq='D') # Index(['A', 'B', 'C'], dtype='object') # ['A', 'B', 'C'] # [[ 0.97142634 0.40390521 0.30456152] # [-2.4048735 -0.22208588 0.44446443] # [-0.14401362 -0.51388305 -0.46873214] # [ 0.06505955 0.46067507 -0.63360907] # [-1.32201785 2.12893236 1.09960613] # [-0.22041327 -0.08634845 -0.28972288]]
查看DataFrame的行數與列數。
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.shape) print(df.shape[0]) print(df.shape[1]) # output: # A B C # 2013-01-01 1.571635 0.740456 -0.789674 # 2013-01-02 0.534758 0.372924 1.139897 # 2013-01-03 0.419329 0.097288 -0.061034 # 2013-01-04 0.292189 -0.805046 -0.512478 # 2013-01-05 2.293956 -0.310201 -0.661519 # 2013-01-06 0.890370 0.190517 0.306458 # (6, 3) # 6 # 3
查看DataFrame的index、數據類型及內存信息。
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.info()) # output: # A B C # 2013-01-01 0.145529 -0.299115 -0.360462 # 2013-01-02 2.203913 -0.619418 2.478992 # 2013-01-03 -1.106605 1.114359 -0.653225 # 2013-01-04 1.409313 2.198673 -1.663985 # 2013-01-05 -0.917697 0.645962 -1.323553 # 2013-01-06 0.729082 0.043500 -1.932772 # <class 'pandas.core.frame.DataFrame'> # DatetimeIndex: 6 entries, 2013-01-01 to 2013-01-06 # Freq: D # Data columns (total 3 columns): # A 6 non-null float64 # B 6 non-null float64 # C 6 non-null float64 # dtypes: float64(3) # memory usage: 192.0 bytes # None
統計每一列非空個數,使用df.count()
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.count()) # output: # A B C # 2013-01-01 0.160293 0.298212 0.572019 # 2013-01-02 1.046787 0.559711 -0.259907 # 2013-01-03 0.208801 1.018917 -1.165052 # 2013-01-04 -0.080998 1.268477 -1.038384 # 2013-01-05 -0.413563 0.101436 0.215154 # 2013-01-06 0.266813 0.945366 1.726588 # A 6 # B 6 # C 6 # dtype: int64
統計某列有多少個不一樣的類用nunique()或者len(set()),統計某列不一樣類對應的個數用value_counts()。
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.A.nunique()) print(len(set(df.A))) # output: # A B C # 2013-01-01 0.256037 -0.096629 -0.224575 # 2013-01-02 0.220131 0.460777 -0.191140 # 2013-01-03 0.957422 0.584076 -1.548418 # 2013-01-04 -0.913387 -1.056598 0.201946 # 2013-01-05 -0.076716 0.337379 2.560821 # 2013-01-06 1.244448 1.241131 0.232319 # 6 # 6
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.T) # output: # A B C # 2013-01-01 -0.622806 1.461436 -1.133845 # 2013-01-02 1.408834 -1.117877 0.922919 # 2013-01-03 -0.492947 -1.063588 1.702908 # 2013-01-04 -0.401612 -0.206524 0.843514 # 2013-01-05 0.064999 0.106151 0.733977 # 2013-01-06 -2.219718 -0.972984 0.466263 # 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 # A -0.622806 1.408834 -0.492947 -0.401612 0.064999 -2.219718 # B 1.461436 -1.117877 -1.063588 -0.206524 0.106151 -0.972984 # C -1.133845 0.922919 1.702908 0.843514 0.733977 0.466263
df.idxmax(self, axis=0, skipna=True)
df.idxmax(0)
顯示全部列最大值所對應的index
df.A.idxmax(0)
顯示A列中最大值對應的index
df.idxmax(1)
顯示全部行最大值所對應的列名
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) print(df.idxmax(0)) print(df.col2.idxmax(0)) print(df.idxmax(1)) print(df.idxmin(0)) print(df.col2.idxmin(0)) print(df.idxmin(1)) # output: # col3 col2 col1 # rank2 -0.139445 -1.239773 -0.280064 # rank1 0.170190 1.093101 1.697052 # rank4 -0.174857 -0.526127 -1.197490 # rank3 -0.190417 0.241660 1.206216 # col3 rank1 # col2 rank1 # col1 rank1 # dtype: object # rank1 # rank2 col3 # rank1 col1 # rank4 col3 # rank3 col1 # dtype: object # col3 rank3 # col2 rank2 # col1 rank4 # dtype: object # rank2 # rank2 col2 # rank1 col3 # rank4 col1 # rank3 col3 # dtype: object
「格式限定符」(語法是'{}'中帶:號),能夠print相應格式的數據
import pandas as pd import numpy as np if __name__ == "__main__": # 百分數 print('{:.2%}'.format(0.12354)) # 金額千位分隔符 print('{:,}'.format(123456789)) # 小數精度 print('{:.2f}'.format(31.31412)) # output: # 12.35% # 123,456,789 # 31.31
pandas.set_option('display.expand_frame_repr', False)
True表示能夠換行顯示,False表示不容許換行。pandas.set_option('display.max_rows', 10)
pandas.set_option('display.max_columns', 10)
顯示的最大行數和列數,若是超額就顯示省略號。pandas.set_option('display.precision', 5)
顯示小數點後的位數,浮點數的精度。pandas.set_option('display.large_repr', 'truncate')
truncate表示截斷,info表示查看信息,默認選truncate。pandas.set_option('display.max_colwidth', 5)
設定每一列的最大寬度pandas.set_option('display.chop_threshold', 0.5)
絕對值小於0.5的顯示0.0pandas.set_option('display.colheader_justify', 'left')
顯示居中仍是左邊pandas.set_option('display.width', 200)
橫向最多顯示多少個字符, 通常80不適合橫向的屏幕,平時多用200。
Pandas支持三種類型的多軸索引,基於標籤進行索引、基於整數進行索引、基於標籤和整數進行索引。
Pandas提供了各類方法來完成基於標籤的索引,可使用標籤以下:
(1)單個標量標籤
(2)標籤列表
(3)切片對象,標籤爲切片時包括起始邊界
(4)一個布爾數組
loc須要兩個標籤,用","分隔。第一個表示行,第二個表示列。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) print(df.loc['rank1', 'col2']) print(df.loc[:, 'col3']) print(df.loc[:, ['col1', 'col3']]) print(df.loc['rank1':'rank3', :]) # output: # col3 col2 col1 # rank2 1.113696 -1.412935 -0.806799 # rank1 0.107469 1.086778 -0.971733 # rank4 -0.135899 -0.753419 -0.569671 # rank3 1.416578 1.230413 0.795368 # 1.086777931461885 # rank2 1.113696 # rank1 0.107469 # rank4 -0.135899 # rank3 1.416578 # Name: col3, dtype: float64 # col1 col3 # rank2 -0.806799 1.113696 # rank1 -0.971733 0.107469 # rank4 -0.569671 -0.135899 # rank3 0.795368 1.416578 # col3 col2 col1 # rank1 0.107469 1.086778 -0.971733 # rank4 -0.135899 -0.753419 -0.569671 # rank3 1.416578 1.230413 0.795368
標籤的優勢是能夠多軸交叉選擇,能夠經過行index標籤和列標籤訂位DataFrame數據,但切片包含閉區間。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.loc[dates[0]]) print(df.loc[:, ['A', 'B']]) print(df.loc['2019-01-03':'2019-01-05', ['A', 'B']]) print(df.loc['2019-01-03', ['A', 'B']]) print(df.loc['2019-01-03', 'A']) # output: # A B C # 2019-01-01 -0.640586 0.296498 0.758321 # 2019-01-02 -0.219330 0.377097 0.353152 # 2019-01-03 0.857294 1.255778 1.797687 # 2019-01-04 -1.271955 -1.675781 0.484156 # 2019-01-05 1.223988 1.200979 1.074488 # 2019-01-06 -0.722830 -0.525681 0.294155 # A -0.640586 # B 0.296498 # C 0.758321 # Name: 2019-01-01 00:00:00, dtype: float64 # A B # 2019-01-01 -0.640586 0.296498 # 2019-01-02 -0.219330 0.377097 # 2019-01-03 0.857294 1.255778 # 2019-01-04 -1.271955 -1.675781 # 2019-01-05 1.223988 1.200979 # 2019-01-06 -0.722830 -0.525681 # A B # 2019-01-03 0.857294 1.255778 # 2019-01-04 -1.271955 -1.675781 # 2019-01-05 1.223988 1.200979 # A 0.857294 # B 1.255778 # Name: 2019-01-03 00:00:00, dtype: float64 # 0.8572941113047045
Pandas提供獲取純整數索引的多種方法,如整數、整數列表、Series值。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) print(df.iloc[0:3]) print(df.iloc[[1, 2], 0:2]) # output: # col3 col2 col1 # rank2 -0.483500 -1.073882 -1.081589 # rank1 -0.753271 -1.434796 -0.946916 # rank4 0.125635 0.570554 -2.454738 # rank3 1.949820 -1.464900 -0.171653 # col3 col2 col1 # rank2 -0.483500 -1.073882 -1.081589 # rank1 -0.753271 -1.434796 -0.946916 # rank4 0.125635 0.570554 -2.454738 # col3 col2 # rank1 -0.753271 -1.434796 # rank4 0.125635 0.570554
經過傳遞位置索引進行位置選擇,位置索引可使用切片操做。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.iloc[3]) # 選取除最後兩列外的全部列 print(df.iloc[:, :-2]) print(df.iloc[1:4, 1:3]) print(df.iloc[:, [1, 2]]) # 獲取標量 print(df.iloc[1, 2]) # output: # A B C # 2019-01-01 -1.348715 -0.184542 -0.290333 # 2019-01-02 0.177905 0.876349 0.371486 # 2019-01-03 1.368759 1.399392 -0.000577 # 2019-01-04 1.855882 0.564528 -0.089876 # 2019-01-05 0.530389 -1.292908 0.681160 # 2019-01-06 -0.286435 -0.461200 0.864096 # A 1.855882 # B 0.564528 # C -0.089876 # Name: 2019-01-04 00:00:00, dtype: float64 # A # 2019-01-01 -1.348715 # 2019-01-02 0.177905 # 2019-01-03 1.368759 # 2019-01-04 1.855882 # 2019-01-05 0.530389 # 2019-01-06 -0.286435 # B C # 2019-01-02 0.876349 0.371486 # 2019-01-03 1.399392 -0.000577 # 2019-01-04 0.564528 -0.089876 # B C # 2019-01-01 -0.184542 -0.290333 # 2019-01-02 0.876349 0.371486 # 2019-01-03 1.399392 -0.000577 # 2019-01-04 0.564528 -0.089876 # 2019-01-05 -1.292908 0.681160 # 2019-01-06 -0.461200 0.864096 # 0.3714863793190553
用於獲取整行或者整列的數據。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) print(df['col2']) print(df.col2) # output: # col3 col2 col1 # rank2 -0.010866 -1.438301 1.008284 # rank1 -0.633372 0.951618 0.190146 # rank4 -0.158926 -2.016063 0.456099 # rank3 -1.028975 -0.144202 -0.077525 # rank2 -1.438301 # rank1 0.951618 # rank4 -2.016063 # rank3 -0.144202 # Name: col2, dtype: float64 # rank2 -1.438301 # rank1 0.951618 # rank4 -2.016063 # rank3 -0.144202 # Name: col2, dtype: float64
選擇多列
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) print(df[['col2', 'col3']]) # output: # col3 col2 col1 # rank2 -0.190013 0.775020 -2.243045 # rank1 0.884000 1.347191 -0.388117 # rank4 -1.401332 0.228368 -1.475148 # rank3 0.369793 0.813368 -0.428450 # col2 col3 # rank2 0.775020 -0.190013 # rank1 1.347191 0.884000 # rank4 0.228368 -1.401332 # rank3 0.813368 0.369793
經過切片獲取行數據
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) print(df[0:3]) print(df['rank1':'rank4']) # output: # col3 col2 col1 # rank2 -0.868999 0.852147 0.346300 # rank1 1.975817 0.633193 -0.157873 # rank4 0.271203 -0.681425 0.227320 # rank3 0.173491 -0.225134 -0.750217 # col3 col2 col1 # rank2 -0.868999 0.852147 0.346300 # rank1 1.975817 0.633193 -0.157873 # rank4 0.271203 -0.681425 0.227320 # col3 col2 col1 # rank1 1.975817 0.633193 -0.157873 # rank4 0.271203 -0.681425 0.227320
使用一個單獨列的值來選擇數據。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df[df.A > 0]) # output: # A B C # 2019-01-01 -0.419116 0.370122 -2.026854 # 2019-01-02 -1.041050 0.356879 1.166706 # 2019-01-03 -0.853631 -0.115552 -0.859882 # 2019-01-04 -0.725505 -0.424321 0.218010 # 2019-01-05 1.087608 1.135607 -0.191611 # 2019-01-06 -0.630319 1.033699 -0.153894 # A B C # 2019-01-05 1.087608 1.135607 -0.191611
使用值來選擇數據,不知足條件的值填充NaN。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df[df > 0]) # output: # A B C # 2019-01-01 -0.562408 0.394501 0.516874 # 2019-01-02 -0.589820 -0.902871 -0.395223 # 2019-01-03 0.009566 -0.817079 1.620771 # 2019-01-04 0.307311 0.392733 0.090025 # 2019-01-05 0.469306 -0.563045 -1.402386 # 2019-01-06 0.554762 -0.023549 1.889080 # A B C # 2019-01-01 NaN 0.394501 0.516874 # 2019-01-02 NaN NaN NaN # 2019-01-03 0.009566 NaN 1.620771 # 2019-01-04 0.307311 0.392733 0.090025 # 2019-01-05 0.469306 NaN NaN # 2019-01-06 0.554762 NaN 1.889080
經過標籤設置新的值。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) df.loc['2019-01-04', 'B'] = 3.1415 print(df) # output: # A B C # 2019-01-01 0.950116 0.147263 1.049792 # 2019-01-02 0.305393 -0.235960 -0.385073 # 2019-01-03 -0.024728 -0.581566 -0.343492 # 2019-01-04 2.384613 0.256359 0.422368 # 2019-01-05 -0.941046 0.259252 0.559688 # 2019-01-06 -0.138191 -1.055116 -1.268404 # A B C # 2019-01-01 0.950116 0.147263 1.049792 # 2019-01-02 0.305393 -0.235960 -0.385073 # 2019-01-03 -0.024728 -0.581566 -0.343492 # 2019-01-04 2.384613 3.141500 0.422368 # 2019-01-05 -0.941046 0.259252 0.559688 # 2019-01-06 -0.138191 -1.055116 -1.268404
若是賦值的標籤不存在,則產生新的列(行),未賦值的位置用NaN填充。
經過位置設置新的值。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) df.iloc[0, 0] = 3.1415 print(df) # output: # A B C # 2019-01-01 1.141077 0.102785 -1.243796 # 2019-01-02 -0.100035 -0.468026 -1.230186 # 2019-01-03 -1.361605 0.603181 0.009779 # 2019-01-04 0.094592 0.377274 -0.743773 # 2019-01-05 0.756191 0.254951 -0.032884 # 2019-01-06 1.029874 0.377550 -1.061605 # A B C # 2019-01-01 3.141500 0.102785 -1.243796 # 2019-01-02 -0.100035 -0.468026 -1.230186 # 2019-01-03 -1.361605 0.603181 0.009779 # 2019-01-04 0.094592 0.377274 -0.743773 # 2019-01-05 0.756191 0.254951 -0.032884 # 2019-01-06 1.029874 0.377550 -1.061605
設置整列的值。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) df.loc[:, 'D']= np.array([3]*len(df)) print(df) # output: # A B C # 2019-01-01 -0.377629 -0.792364 -0.030633 # 2019-01-02 0.034738 -0.121923 0.159174 # 2019-01-03 0.288188 2.671207 -0.670135 # 2019-01-04 0.626814 0.669742 0.017105 # 2019-01-05 -0.127686 -0.643768 0.000738 # 2019-01-06 0.524352 -0.228057 -0.896196 # A B C D # 2019-01-01 -0.377629 -0.792364 -0.030633 3 # 2019-01-02 0.034738 -0.121923 0.159174 3 # 2019-01-03 0.288188 2.671207 -0.670135 3 # 2019-01-04 0.626814 0.669742 0.017105 3 # 2019-01-05 -0.127686 -0.643768 0.000738 3 # 2019-01-06 0.524352 -0.228057 -0.896196 3
經過布爾索引賦值。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) df2 = df.copy() # 將正數轉化爲負數 df2[df2 > 0] = -df2 print(df2) # output: # A B C # 2019-01-01 0.691983 0.489286 -1.632002 # 2019-01-02 1.212439 0.854812 -0.292094 # 2019-01-03 -0.365872 0.738098 -0.494800 # 2019-01-04 0.548706 0.066543 0.242601 # 2019-01-05 0.656829 0.155872 0.262424 # 2019-01-06 -0.085094 1.392970 -0.214890 # A B C # 2019-01-01 -0.691983 -0.489286 -1.632002 # 2019-01-02 -1.212439 -0.854812 -0.292094 # 2019-01-03 -0.365872 -0.738098 -0.494800 # 2019-01-04 -0.548706 -0.066543 -0.242601 # 2019-01-05 -0.656829 -0.155872 -0.262424 # 2019-01-06 -0.085094 -1.392970 -0.214890