pandas是一個強大的Python數據分析的工具包,是基於NumPy構建的。html
(1)具有對其功能的數據結構DataFrame、Series
(2)集成時間序列功能
(3)提供豐富的數學運算和操做
(4)靈活處理缺失數據python
# 安裝方法: # pip install pandas # 引用方法: import pandas as pd
Series是一種相似於一維數組的對象,由一組數據和一組與之相關的數據標籤(索引)組成。正則表達式
# Series建立方式 >>> import pandas as pd >>> pd.Series([2,3,4,5]) 0 2 1 3 2 4 3 5 dtype: int64 >>> pd.Series([2,3,4,5], index=['a','b','c','d']) a 2 b 3 c 4 d 5 dtype: int64
獲取值數組和索引數組:values屬性和index屬性。
Series比較像列表(數組)和字典的結合體。數據庫
# 從ndarray建立Series:Series(arr) >>> import numpy as np >>> pd.Series(np.arange(5)) 0 0 1 1 2 2 3 3 4 4 dtype: int64 # 與標量運算:sr*2 >>> sr = pd.Series([2,3,4,5], index=['a','b','c','d']) >>> sr a 2 b 3 c 4 d 5 dtype: int64 >>> sr*2 a 4 b 6 c 8 d 10 dtype: int64 >>> sr+2 a 4 b 5 c 6 d 7 dtype: int64 # 兩個Series運算:sr1+sr2 >>> sr + sr a 4 b 6 c 8 d 10 dtype: int64 # 索引:sr[0],sr[[1,2,4]] >>> sr[0] 2 >>> sr[[1,2,3]] b 3 c 4 d 5 dtype: int64 # 切片:sr[0:2] >>> sr[0:2] a 2 b 3 dtype: int64 # 通用函數(最大值、絕對值等),如:np.abs(sr) >>> sr.max() 5 >>> np.abs(sr) a 2 b 3 c 4 d 5 dtype: int64 # 布爾值過濾:sr[sr>0] >>> sr>4 a False b False c False d True dtype: bool >>> sr[sr>4] d 5 dtype: int64
# 從字典建立Series:Series(dic) >>> sr = pd.Series({'a':3, 'b':2, 'c':4}) >>> sr a 3 b 2 c 4 dtype: int64 # in運算:'a' in sr >>> 'a' in sr True >>> 'e' in sr False >>> for i in sr: print(i) # 只遍歷打印值,而不是打印鍵 3 2 4 # 鍵索引:sr['a'], sr[['a','b','c']] >>> sr['a'] 3 >>> sr[['a','b','c']] a 3 b 2 c 4 dtype: int64 # 獲取索引對應及對應值 >>> sr.index Index(['a', 'b', 'c'], dtype='object') >>> sr.index[0] 'a' >>> sr.values array([1, 2, 3, 4]) >>> sr = pd.Series([1,2,3,4],index=['a','b','c','d']) sr a 1 b 2 c 3 d 4 dtype: int64 >>> sr[['a','c']] a 1 c 3 >>> sr['a':'c'] # 標籤形式索引切片(前包後也包) a 1 b 2 c 3 dtype: int64
整數索引的pandas對象每每會使新手抓狂。json
>>> sr = pd.Series(np.arange(4.)) >>> sr 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64 >>> sr[-1] 報錯信息 KeyError: -1 >>> sr = pd.Series(np.arange(10)) >>> sr 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int64 >>> sr2 = sr[5:].copy() # 切片後複製 >>> sr2 # 能夠看到索引仍是保留以前的值 5 5 6 6 7 7 8 8 9 9 dtype: int64
若是索引是整數類型,則根據整數進行下標獲取值時老是面向標籤的。(意思是說,當索引值爲整數時,索引必定會解釋爲標籤)
解決方法:數組
# loc屬性:將索引解釋爲標籤 >>> sr2.loc[7] 7 # iloc屬性:將索引解釋爲下標 sr2.iloc[3] 8
所以涉及到整數的時候必定要loc和iloc指明,中括號裏的索引是標籤仍是下標。數據結構
pandas在進行兩個Series對象的運算時,會按索引進行對齊而後計算。app
>>> sr1 = pd.Series([12,23,34], index=['c','a','d']) >>> sr2 = pd.Series([11,20,10], index=['d','c','a']) >>> sr1 + sr2 a 33 # 23+10 c 32 # 12+20 d 45 # 34+11 dtype: int64 >>> sr1 = pd.Series([12,23,34], index=['c','a','d']) >>> sr2 = pd.Series([11,20,10,21], index=['d','c','a','b']) >>> sr1 + sr2 # 不同長Series相加 a 33.0 b NaN # 在pandas中用來當作數據缺失值 c 32.0 d 45.0 dtype: float64 >>> sr1 = pd.Series([12,23,34], index=['c','a','d']) >>> sr2 = pd.Series([11,20,10], index=['b','c','a']) >>> sr1 + sr2 a 33.0 b NaN c 32.0 d NaN dtype: float64
若是兩個Series對象的索引不徹底相同,則結果的索引是兩個操做數索引的並集。
若是隻有一個對象在某索引下有值,則結果中該索引的值爲nan(缺失值)。ide
靈活算術方法:add,sub,div,mul(分別對應加減乘除)。函數
>>> sr1 = pd.Series([12,23,34], index=['c','a','d']) >>> sr2 = pd.Series([11,20,10], index=['b','c','a']) >>> sr1.add(sr2) a 33.0 b NaN c 32.0 d NaN dtype: float64 >>> sr1.add(sr2, fill_value=0) # 標籤對應的值一個有一個沒有,沒有的那個賦值爲0 a 33.0 b 11.0 c 32.0 d 34.0 dtype: float64
缺失數據:使用NaN(Not a Number)來表示缺失數據。其值等於np.nan。
內置的None值也會被當作NaN處理。
>>> sr = sr1+sr2 >>> sr a 33.0 b NaN c 32.0 d NaN dtype: float64 # dropna():過濾掉值爲NaN的行 # fillna():填充缺失數據 # isnull():返回布爾數組,缺失值對應爲True(判斷是否爲缺失數據) >>> sr.isnull() a False b True # True的是NaN c False d True dtype: bool # notnull():返回布爾數組,缺失值對應爲False sr.notnull() a True b False # False對應NaN c True d False dtype: bool
# sr.dropna() >>> sr.dropna() a 33.0 c 32.0 dtype: float64 # sr[data.notnull()] >>> sr[sr.notnull()] # 剔除全部缺失值的行 a 33.0 c 32.0 dtype: float64
# fillna() >>> sr.fillna(0) # 給缺失值賦值爲0 a 33.0 b 0.0 c 32.0 d 0.0 dtype: float64 >>> sr.mean() # 剔除NaN求得平均值 32.5 >>> sr.fillna(sr.mean()) # 給缺失值填充平均值 a 33.0 b 32.5 c 32.0 d 32.5 dtype: float64
Series是數組和字典的結合體,能夠經過下標和標籤來訪問。
當索引值爲整數時,索引必定會解釋爲標籤。可使用loc和iloc來明確指明索引被解釋爲標籤仍是下標。
若是兩個Series對象的索引不徹底相同,則結果的索引是兩個操做數索引的並集。
若是隻有一個對象在某索引下有值,則結果中該索引的值爲nan(缺失值)。
缺失數據處理方法:dropna(過濾)、fillna(填充)。
DataFrame是一個表格式的數據結構,含有一組有序的列(即:好幾列)。
DataFrame能夠被看作是由Series組成的字典,而且共用一個索引。
# 建立方式: # 方法一:經過一個字典來建立 >>> pd.DataFrame({'one':[1,2,3],'two':[4,5,6]}) one two 0 1 4 1 2 5 2 3 6 >>> pd.DataFrame({'one':[1,2,3],'two':[4,5,6]}, index=['a','b','c']) # index指定行索引 one two a 1 4 b 2 5 c 3 6 # 方法二:用Series來組成字典 >>> pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])}) one two a 1.0 2 b 2.0 1 c 3.0 3 d NaN 4 # MacBook-Pro:pandas hqs$ vi test.csv # 建立並寫入csv文件 # a,b,c # 1,2,3 # 2,4,6 # 3,6,9 # csv文件讀取和寫入: >>> pd.read_csv('test.csv') # read_csv():讀取csv文件 a b c 0 1 2 3 1 2 4 6 2 3 6 9 >>> df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])}) >>> df one two a 1.0 2 b 2.0 1 c 3.0 3 d NaN 4 >>> df .to_csv('test2.csv') # to_csv():寫入csv文件 # MacBook-Pro:pandas hqs$ vi test2.csv # 查看csv文件,缺失的值自動爲空 # ,one,two # a,1.0,2 # b,2.0,1 # c,3.0,3 # d,,4
>>> df= pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])}) >>> df one two a 1.0 2 b 2.0 1 c 3.0 3 d NaN 4 # index:獲取行索引 >>> df.index Index(['a', 'b', 'c', 'd'], dtype='object') # columns:獲取列索引 >>> df.columns Index(['one', 'two'], dtype='object') # values:獲取值數組(通常是二維數組) >>> df.values array([[ 1., 2.], [ 2., 1.], [ 3., 3.], [nan, 4.]]) # T:轉置 >>> df one two a 1.0 2 b 2.0 1 c 3.0 3 d NaN 4 >>> df.T # 行變成列,列變成行 a b c d one 1.0 2.0 3.0 NaN two 2.0 1.0 3.0 4.0 # describe():獲取快速統計 >>> df.describe() one two count 3.0 4.000000 # 統計每一列個數 mean 2.0 2.500000 # 統計每一列平均數 std 1.0 1.290994 # 統計每一列標準差 min 1.0 1.000000 # 統計每一列最小值 25% 1.5 1.750000 # 1/4位上的數 50% 2.0 2.500000 # 1/2位上的數 75% 2.5 3.250000 # 3/4位上的數 max 3.0 4.000000 # 統計每一列最大值
DataFrame是一個二維數據類型,因此有行索引和列索引。
>>> df one two a 1.0 2 b 2.0 1 c 3.0 3 d NaN 4 >>> df['one']['a'] # 先列後行,一列是一個Series 1.0
DataFrame一樣能夠經過標籤和位置兩種方法進行索引和切片。
loc屬性和iloc屬性:loc是按索引選取數據,iloc是按位置(下標)選取數據。
# 使用方法:逗號隔開,前面是行索引,後面是列索引 >>> df.loc['a','one'] # 先行後列 1.0 # 行/列索引部分能夠是常規索引、切片、布爾值索引、花式索引任意搭配 >>> df.loc['a',:] # 選擇a這一行,列選擇所有 one 1.0 two 2.0 Name: a, dtype: float64 >>> df.loc['a',] # 效果同上 one 1.0 two 2.0 Name: a, dtype: float64 >>> df.loc[['a','c'],:] # 選擇a、c這兩行,列選擇所有 one two a 1.0 2 c 3.0 3 >>> df.loc[['a','c'],'two'] a 2 c 3 Name: two, dtype: int64 >>> df.apply(lambda x:x+1) one two a 2.0 3 b 3.0 2 c 4.0 4 d NaN 5 >>> df.apply(lambda x:x.mean()) one 2.0 two 2.5 dtype: float64
DataFrame對象在運算時,一樣會進行數據對齊,其行索引和列索引分別對齊。
>>> df = pd.DataFrame({'two':[1,2,3,4],'one':[4,5,6,7]}, index=['c','d','b','a']) >>> df2 = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])}) >>> df two one c 1 4 d 2 5 b 3 6 a 4 7 >>> df2 one two a 1.0 2 b 2.0 1 c 3.0 3 d NaN 4 >>> df + df2 one two a 8.0 6 b 8.0 4 c 7.0 4 d NaN 6
DataFrame處理缺失數據的相關方法:
# df.fillna(x):用x替換DataFrame對象中全部的空值 >>> df2.fillna(0) one two a 1.0 2 b 2.0 1 c 3.0 3 d 0.0 4 >>> df2.loc['d','two']=np.nan # 給df2修改添加缺失值 >>> df2.loc['c','two']=np.nan >>> df2 one two a 1.0 2.0 b 2.0 1.0 c 3.0 NaN d NaN NaN # df.dropna():刪除全部包含空值的行,how的默認參數是any >>> df2.dropna() one two a 1.0 2.0 b 2.0 1.0 >>> df2.dropna(how='all') # 刪除全部值都爲缺失值的行,how的默認參數是any one two a 1.0 2.0 b 2.0 1.0 c 3.0 NaN >>> df2.dropna(how='any') one two a 1.0 2.0 b 2.0 1.0 # df.dropna(axis=1):刪除全部包含空值的列,axis(軸)默認是0 >>> df.loc['c','one']=np.nan # 給df修改添加缺失值 >>> df two one c 1 NaN d 2 5.0 b 3 6.0 a 4 7.0 >>> df.dropna(axis=1) two c 1 d 2 b 3 a 4 # df.dropna(axis=1,thresh=n):刪除全部小於n個非空值的行 >>> df.dropna(axis=1, thresh=4) two c 1 d 2 b 3 a 4 # df.isnull():檢查DataFrame對象中的空值,並返回一個Boolean數組 >>> df2.isnull() one two a False False b False False c False True d True True # df.notnull():檢查DataFrame對象中的非空值,並返回一個Boolean數組 >>> df2.notnull() one two a True True b True True c True False d False False
>>> df two one c 1 NaN d 2 5.0 b 3 6.0 a 4 7.0 # mean(axis=0,skipna=False):對列(行)求平均值 >>> df.mean() # 忽略缺失值,默認對每一列求平均值 two 2.5 one 6.0 dtype: float64 >>> df.mean(axis=1) # 忽略缺失值,對每一行求平均值 c 1.0 d 3.5 b 4.5 a 5.5 dtype: float64 # sum(axis=1):對列(行)求和 >>> df.sum() # 對每一列求和 two 10.0 one 18.0 dtype: float64 >>> df.sum(axis=1) # 對每一行求和 c 1.0 d 7.0 b 9.0 a 11.0 dtype: float64 # sort_index(axis,...,ascending):對列(行)索引排序 >>> df.sort_index() # 默認對列索引升序排列 two one a 4 7.0 b 3 6.0 c 1 NaN d 2 5.0 >>> df.sort_index(ascending=False) # 對列索引降序排列 two one d 2 5.0 c 1 NaN b 3 6.0 a 4 7.0 >>> df.sort_index(axis=1) # 對行索引升序排列 one two # o排在t前面 c NaN 1 d 5.0 2 b 6.0 3 a 7.0 4 >>> df.sort_index(ascending=False,axis=1) # 對行索引降序排列 two one c 1 NaN d 2 5.0 b 3 6.0 a 4 7.0 # sort_values(by,axis,ascending):按某一列(行)的值排序 >>> df.sort_values(by='two') # 按two這一列排序 two one c 1 NaN d 2 5.0 b 3 6.0 a 4 7.0 >>> df.sort_values(by='two', ascending=False) # ascending默認升序,改成False即爲降序 two one a 4 7.0 b 3 6.0 d 2 5.0 c 1 NaN >>> df.sort_values(by='a',ascending=False,axis=1) # 按a行降序排序,注意是按值排序 one two c NaN 1 d 5.0 2 b 6.0 3 a 7.0 4 # 按列排序,有缺失值的默認放在最後 >>> df.sort_values(by='one') two one d 2 5.0 b 3 6.0 a 4 7.0 c 1 NaN >>> df.sort_values(by='one', ascending=False) two one a 4 7.0 b 3 6.0 d 2 5.0 c 1 NaN
注意:NumPy的通用函數一樣適用於pandas。
時間序列類型:
(1)時間戳:特定時刻
(2)固定時期:如2017年7月
(3)時間間隔:起始時間——結束時間
python標準庫處理時間對象:datetime模塊。datetime模塊的datetime類中有一個方法strptime(),能夠將字符串解析爲時間對象。
>>> import datetime >>> datetime.datetime.strptime('2010-01-01', '%Y-%m-%d') datetime.datetime(2010, 1, 1, 0, 0)
>>> import dateutil >>> dateutil.parser.parse('2001-01-01') # 用-分隔 datetime.datetime(2001, 1, 1, 0, 0) >>> dateutil.parser.parse('2001/01/01') # 用/分隔 datetime.datetime(2001, 1, 1, 0, 0) >>> dateutil.parser.parse('02/03/2001') # 年份放在後面也能夠識別 datetime.datetime(2001, 2, 3, 0, 0) >>> dateutil.parser.parse('2001-JAN-01') # 識別英文月份 datetime.datetime(2001, 1, 1, 0, 0)
一般被用來作索引。
>>> pd.to_datetime(['2001-01-01','2010/Feb/02']) # 不一樣格式均自動轉化爲DatetimeIndex DatetimeIndex(['2001-01-01', '2010-02-02'], dtype='datetime64[ns]', freq=None)
pandas中date_range函數以下所示:
def date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs): """ Return a fixed frequency DatetimeIndex. Parameters ---------- start : str or datetime-like, optional 開始時間 Left bound for generating dates. end : str or datetime-like, optional 結束時間 Right bound for generating dates. periods : integer, optional 時間長度 Number of periods to generate. freq : str or DateOffset, default 'D' 時間頻率,默認爲'D',可選H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es),S(econd),A(years),... Frequency strings can have multiples, e.g. '5H'. See :ref:`here <timeseries.offset_aliases>` for a list of frequency aliases. tz : str or tzinfo, optional Time zone name for returning localized DatetimeIndex, for example 'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is timezone-naive. normalize : bool, default False Normalize start/end dates to midnight before generating date range. name : str, default None Name of the resulting DatetimeIndex. closed : {None, 'left', 'right'}, optional Make the interval closed with respect to the given frequency to the 'left', 'right', or both sides (None, the default). **kwargs For compatibility. Has no effect on the result. """
使用示例以下所示:
>>> pd.date_range('2010-01-01','2010-5-1') # 設置起始時間和結束時間 DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04', '2010-01-09', '2010-01-10', ... '2010-04-26', '2010-04-27', '2010-04-28', '2010-04-29', '2010-04-30', '2010-05-01'], dtype='datetime64[ns]', length=121, freq='D') >>> pd.date_range('2010-01-01', periods=10) # 指定起始和長度 DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08', '2010-01-09', '2010-01-10'], dtype='datetime64[ns]', freq='D') >>> pd.date_range('2010-01-01', periods=10, freq='H') # 指定頻率爲每小時 DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:00:00', '2010-01-01 02:00:00', '2010-01-01 03:00:00', '2010-01-01 04:00:00', '2010-01-01 05:00:00', '2010-01-01 06:00:00', '2010-01-01 07:00:00', '2010-01-01 08:00:00', '2010-01-01 09:00:00'], dtype='datetime64[ns]', freq='H') >>> pd.date_range('2010-01-01', periods=10, freq='W-MON') # 指定頻率爲每週一 DatetimeIndex(['2010-01-04', '2010-01-11', '2010-01-18', '2010-01-25', '2010-02-01', '2010-02-08', '2010-02-15', '2010-02-22', '2010-03-01', '2010-03-08'], dtype='datetime64[ns]', freq='W-MON') >>> pd.date_range('2010-01-01', periods=10, freq='B') # 指定頻率爲工做日 DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13', '2010-01-14'], dtype='datetime64[ns]', freq='B') >>> pd.date_range('2010-01-01', periods=10, freq='1h20min') # 間隔一小時二十分鐘 DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:20:00', '2010-01-01 02:40:00', '2010-01-01 04:00:00', '2010-01-01 05:20:00', '2010-01-01 06:40:00', '2010-01-01 08:00:00', '2010-01-01 09:20:00', '2010-01-01 10:40:00', '2010-01-01 12:00:00'], dtype='datetime64[ns]', freq='80T') # 轉換爲datetime對象 >>> dt = pd.date_range('2010-01-01', periods=10, freq='B') >>> dt[0] Timestamp('2010-01-01 00:00:00', freq='B') >>> dt[0].to_pydatetime() # 轉換爲python的datetime對象 datetime.datetime(2010, 1, 1, 0, 0)
時間序列就是以時間對象做爲索引的Series
或DataFrame
。
datetime對象
做爲索引時是存儲在DatetimeIndex對象
中的。
時間序列的特殊功能:
>>> sr = pd.Series(np.arange(1000),index=pd.date_range('2017-01-01', periods=1000)) >>> sr 2017-01-01 0 2017-01-02 1 2017-01-03 2 2017-01-04 3 ... 2019-09-26 998 2019-09-27 999 Freq: D, Length: 1000, dtype: int64 # 功能一:傳入"年"或"年月"做爲切片方式 >>> sr['2017'] # 傳入年切片 2017-01-01 0 2017-01-02 1 ... 2017-12-30 363 2017-12-31 364 Freq: D, Length: 365, dtype: int64 >>> sr['2017-05'] # 傳入年月切片 2017-05-01 120 2017-05-02 121 ... 2017-05-30 149 2017-05-31 150 Freq: D, dtype: int64 # 功能二:傳入日期範圍做爲切片方式 >>> sr['2017-10-25':'2018-03'] # 2017年10月25日到2018年3月 2017-10-25 297 2017-10-26 298 ... 2018-03-30 453 2018-03-31 454 Freq: D, Length: 158, dtype: int64 # 功能三:豐富的函數支持:resample()、truncate().... # resample()從新採樣函數 >>> sr.resample('W').sum() # 每一週的合 2017-01-01 0 2017-01-08 28 ... 2019-09-22 6937 2019-09-29 4985 Freq: W-SUN, Length: 144, dtype: int64 >>> sr.resample('M').sum() # 每月的合 2017-01-31 465 2017-02-28 1246 ... 2019-08-31 29667 2019-09-30 26622 Freq: M, dtype: int64 >>> sr.resample('M').mean() # 每月天天的平均值 2017-01-31 15.0 2017-02-28 44.5 2017-03-31 74.0 ... 2019-08-31 957.0 2019-09-30 986.0 Freq: M, dtype: float64 # truncate()截斷 >>> sr.truncate(before='2018-04-01') # 截斷掉2018年4月1日以前的部分 2018-04-01 455 2018-04-02 456 ... 2019-09-26 998 2019-09-27 999 Freq: D, Length: 545, dtype: int64 >>> sr.truncate(after='2018-01-01') # 截斷掉2018年1月1日以後的部分 2017-01-01 0 2017-01-02 1 2017-01-03 2 ... 2017-12-31 364 2018-01-01 365 Freq: D, Length: 366, dtype: int64
數據文件經常使用格式:csv(以某間隔符分隔數據)。
pandas除了支持csv格式,還支持其餘文件類型如:json、XML、HTML、數據庫、pickle、excel....
從文件名、URL、文件對象中加載數據。
>>> pd.read_csv('601318.csv') # 將原來的索引標識爲unnamed,從新生成一列索引 Unnamed: 0 date open ... low volume code 0 0 2007-03-01 21.878 ... 20.040 1977633.51 601318 1 1 2007-03-02 20.565 ... 20.075 425048.32 601318 2 2 2007-03-05 20.119 ... 19.047 419196.74 601318 ... ... ... ... ... ... ... 2561 2561 2017-12-14 72.120 ... 70.600 676186.00 601318 2562 2562 2017-12-15 70.690 ... 70.050 735547.00 601318 [2563 rows x 8 columns] >>> pd.read_csv('601318.csv',index_col=0) # 將第0列做爲索引 date open close high low volume code 0 2007-03-01 21.878 20.473 22.302 20.040 1977633.51 601318 1 2007-03-02 20.565 20.307 20.758 20.075 425048.32 601318 ... ... ... ... ... ... ... 2561 2017-12-14 72.120 71.010 72.160 70.600 676186.00 601318 2562 2017-12-15 70.690 70.380 71.440 70.050 735547.00 601318 [2563 rows x 7 columns] >>> pd.read_csv('601318.csv',index_col='date') # 將date那一列做爲索引 Unnamed: 0 open close high low volume code date 2007-03-01 0 21.878 20.473 22.302 20.040 1977633.51 601318 2007-03-02 1 20.565 20.307 20.758 20.075 425048.32 601318 ... ... ... ... ... ... ... 2017-12-14 2561 72.120 71.010 72.160 70.600 676186.00 601318 2017-12-15 2562 70.690 70.380 71.440 70.050 735547.00 601318 [2563 rows x 7 columns] # 須要注意:上面雖然是有時間日期做爲索引,但實際不是時間對象而是字符串 >>> df = pd.read_csv('601318.csv',index_col='date') >>> df.index Index(['2007-03-01', '2007-03-02', '2007-03-05', '2007-03-06', '2007-03-07', '2007-03-08', '2007-03-09', '2007-03-12', '2007-03-13', '2007-03-14', ... '2017-12-04', '2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08', '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14', '2017-12-15'], dtype='object', name='date', length=2563) # 轉換爲時間對象的方法: # 方法一: >>> df = pd.read_csv('601318.csv',index_col='date', parse_dates=True) # 解釋表中全部能解釋爲時間序列的列 >>> df Unnamed: 0 open close high low volume code date 2007-03-01 0 21.878 20.473 22.302 20.040 1977633.51 601318 2007-03-02 1 20.565 20.307 20.758 20.075 425048.32 601318 ... ... ... ... ... ... ... 2017-12-14 2561 72.120 71.010 72.160 70.600 676186.00 601318 2017-12-15 2562 70.690 70.380 71.440 70.050 735547.00 601318 [2563 rows x 7 columns] >>> df.index # 查看索引,能夠發現已轉換爲Datetime DatetimeIndex(['2007-03-01', '2007-03-02', '2007-03-05', '2007-03-06', '2007-03-07', '2007-03-08', '2007-03-09', '2007-03-12', ... '2017-12-08', '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14', '2017-12-15'], dtype='datetime64[ns]', name='date', length=2563, freq=None) # 方法二: >>> df = pd.read_csv('601318.csv',index_col='date', parse_dates=['date']) # parse_dates也能夠傳列表,指定哪些列轉換 >>> df.index DatetimeIndex(['2007-03-01', '2007-03-02', '2007-03-05', '2007-03-06', '2007-03-07', '2007-03-08', '2007-03-09', '2007-03-12', ... '2017-12-08', '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14', '2017-12-15'], dtype='datetime64[ns]', name='date', length=2563, freq=None) # header參數爲None:指定文件無列名,可自動生成數字列名 >>> pd.read_csv('601318.csv',header=None) 0 1 2 3 4 5 6 7 # 新列名 0 NaN date open close high low volume code 1 0.0 2007-03-01 21.878 20.473 22.302 20.04 1977633.51 601318 2 1.0 2007-03-02 20.565 20.307 20.758 20.075 425048.32 601318 ... ... ... ... ... ... ... ... 2562 2561.0 2017-12-14 72.12 71.01 72.16 70.6 676186.0 601318 2563 2562.0 2017-12-15 70.69 70.38 71.44 70.05 735547.0 601318 [2564 rows x 8 columns] # 還可用names參數指定列名 >>> pd.read_csv('601318.csv',header=None, names=list('abcdefgh')) a b c d e f g h 0 NaN date open close high low volume code 1 0.0 2007-03-01 21.878 20.473 22.302 20.04 1977633.51 601318 2 1.0 2007-03-02 20.565 20.307 20.758 20.075 425048.32 601318 ... ... ... ... ... ... ... ... 2562 2561.0 2017-12-14 72.12 71.01 72.16 70.6 676186.0 601318 2563 2562.0 2017-12-15 70.69 70.38 71.44 70.05 735547.0 601318 [2564 rows x 8 columns]
read_table使用方法和read_csv基本相同。
# sep:指定分隔符,可用正則表達式如'\s+' # header=None:指定文件無列名 # name:指定列名 # index_col:指定某裏列做爲索引 # skip_row:指定跳過某些行 >>> pd.read_csv('601318.csv',header=None, skiprows=[1,2,3]) # 跳過1\2\3這三行 0 1 2 3 4 5 6 7 0 NaN date open close high low volume code 1 3.0 2007-03-06 19.253 19.8 20.128 19.143 297727.88 601318 2 4.0 2007-03-07 19.817 20.338 20.522 19.651 287463.78 601318 3 5.0 2007-03-08 20.171 20.093 20.272 19.988 130983.83 601318 ... ... ... ... ... ... ... ... 2559 2561.0 2017-12-14 72.12 71.01 72.16 70.6 676186.0 601318 2560 2562.0 2017-12-15 70.69 70.38 71.44 70.05 735547.0 601318 [2561 rows x 8 columns] # na_values:指定某些字符串表示缺失值 # 若是某些值是NaN能識別是缺失值,但若是是None則識別爲字符串 >>> pd.read_csv('601318.csv',header=None, na_values=['None']) # 將None字符串解釋爲缺失值 0 1 2 3 4 5 6 7 0 NaN date open close high low volume code 1 0.0 2007-03-01 21.878 NaN 22.302 20.04 1977633.51 601318 2 1.0 2007-03-02 20.565 NaN 20.758 20.075 425048.32 601318 ... ... ... ... ... ... ... ... 2561 2560.0 2017-12-13 71.21 72.12 72.62 70.2 865117.0 601318 2562 2561.0 2017-12-14 72.12 71.01 72.16 70.6 676186.0 601318 2563 2562.0 2017-12-15 70.69 70.38 71.44 70.05 735547.0 601318 [2564 rows x 8 columns] # parse_dates:指定某些列是否被解析爲日期,類型爲布爾值或列表
寫入到csv文件:to_csv函數。
>>> df = pd.read_csv('601318.csv',index_col=0) >>> df.iloc[0,0]=np.nan # 第0行第0列改成NaN # 寫入新文件 >>> df.to_csv('test.csv') # 寫入文件函數的主要參數 # sep:指定文件分隔符 # header=False:不輸出列名一行 >>> df.to_csv('test.csv', header=False) # index=False:不輸出行索引一行 >>> df.to_csv('test.csv', index=False) # na_rep:指定缺失值轉換的字符串,默認爲空字符串 >>> df.to_csv('test.csv', header=False, index=False, na_rep='null') # 空白處填寫null # columns:指定輸出的列,傳入列表 >>> df.to_csv('test.csv', header=False, index=False, na_rep='null', columns=[0,1,2,3]) # 輸出前四列
>>> df.to_html('test.html') # 以html格式寫入文件 >>> df.to_json('test.json') # 以json格式寫入文件 >>> pd.read_json('test.json') # 讀取json格式文件 Unnamed: 0 date open close high low volume code 0 0 2007-03-01 21.878 None 22.302 20.040 1977633.51 601318 1 1 2007-03-02 20.565 None 20.758 20.075 425048.32 601318 ... ... ... ... ... ... ... ... 998 998 2011-07-07 22.438 21.985 22.465 21.832 230480.00 601318 999 999 2011-07-08 22.076 21.936 22.212 21.850 141415.00 601318 [2563 rows x 8 columns] >>> pd.read_html('test.html') # 讀取html格式文件 [ Unnamed: 0 Unnamed: 0.1 date ... low volume code 0 0 0 2007-03-01 ... 20.040 1977633.51 601318 1 1 1 2007-03-02 ... 20.075 425048.32 601318 ... ... ... ... ... ... ... 2561 2561 2561 2017-12-14 ... 70.600 676186.00 601318 2562 2562 2562 2017-12-15 ... 70.050 735547.00 601318 [2563 rows x 9 columns]]