若是要將自定義函數或其它庫函數應用於Pandas對象,有三種使用方式。pipe()將函數用於表格,apply()將函數用於行或列,applymap()將函數用於元素。算法
能夠經過將函數對象和參數做爲pipe函數的參數來執行自定義操做,會對整個DataFrame執行操做。數組
# -*- coding=utf-8 -*- import pandas as pd import numpy as np def adder(x, y): return x + y if __name__ == "__main__": df = pd.DataFrame(np.random.randn(5, 3),columns=['col1', 'col2', 'col3']) print(df) df = df.pipe(adder, 1) print(df) # output: # col1 col2 col3 # 0 0.390803 0.940306 -1.300635 # 1 -0.349588 -1.290132 0.415693 # 2 -0.079585 -0.083825 0.262867 # 3 0.582377 0.171701 -1.011748 # 4 -0.466655 1.746269 1.281538 # col1 col2 col3 # 0 1.390803 1.940306 -0.300635 # 1 0.650412 -0.290132 1.415693 # 2 0.920415 0.916175 1.262867 # 3 1.582377 1.171701 -0.011748 # 4 0.533345 2.746269 2.281538
使用apply()函數能夠沿DataFrame或Panel的軸執行應用函數,採用可選axis參數。 默認狀況下,操做按列執行。app
# -*- coding=utf-8 -*- import pandas as pd import numpy as np def adder(x, y): return x + y if __name__ == "__main__": df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3']) print(df) # 按列執行 result = df.apply(np.sum) print(result) # 按行執行 result = df.apply(np.sum, axis=1) print(result) # output: # col1 col2 col3 # 0 -1.773775 -0.608478 0.602059 # 1 -0.208412 0.969435 -0.292108 # 2 0.776864 -0.768559 -0.389092 # 3 -2.088412 1.133090 1.006486 # 4 0.693241 1.808845 0.772191 # col1 -2.600494 # col2 2.534332 # col3 1.699536 # dtype: float64 # 0 -1.780194 # 1 0.468915 # 2 -0.380788 # 3 0.051164 # 4 3.274277 # dtype: float64
在DataFrame的applymap()函數能夠接受任何Python函數,而且返回單個值。dom
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3']) print(df) df = df.applymap(lambda x: x + 1) print(df) # output: # col1 col2 col3 # 0 2.396185 -0.263581 -0.090799 # 1 1.718716 0.876074 -1.067746 # 2 -1.033945 -0.078448 1.036566 # 3 0.553849 0.251312 -0.422640 # 4 -0.896062 1.605349 -0.089430 # col1 col2 col3 # 0 3.396185 0.736419 0.909201 # 1 2.718716 1.876074 -0.067746 # 2 -0.033945 0.921552 2.036566 # 3 1.553849 1.251312 0.577360 # 4 0.103938 2.605349 0.910570
數據清洗是一項複雜且繁瑣的工做,同時也是數據分析過程當中最爲重要的環節。數據清洗的目的一是經過清洗讓數據可用,二是讓數據變的更適合進行數據分析工做。所以,髒數據要清洗,乾淨數據也要清洗。在實際數據分析中,數據清洗將佔用項目70%左右的時間。ide
查看每一列有多少缺失值。df.isnull().sum()
查看每一列有多少完整的數據df.shape[0]-df.isnull().sum()
函數
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) print(df.isnull().sum()) print(df.shape[0] - df.isnull().sum()) # output: # A B C # 2019-01-01 1.138325 0.981597 1.359580 # 2019-01-02 -1.622074 0.812393 -0.946351 # 2019-01-03 0.049815 1.194241 0.807209 # 2019-01-04 1.500074 -0.570367 -0.328529 # 2019-01-05 0.465869 1.049651 -0.112453 # 2019-01-06 -1.399495 0.492769 1.961198 # A 0 # B 0 # C 0 # dtype: int64 # A 6 # B 6 # C 6 # dtype: int64
刪除列性能
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) print(df) del df['D'] # 刪除第2列 df.drop(df.columns[2], axis=1, inplace=True) # 刪除B列 df.drop('B', axis=1, inplace=True) print(df) # output: # A B C # 2019-01-01 -0.703151 0.753482 -0.624376 # 2019-01-02 -0.396221 -0.832279 -1.419897 # 2019-01-03 -0.179341 -0.368501 -0.300810 # 2019-01-04 0.464156 0.117461 1.502114 # 2019-01-05 -1.022012 -1.612456 1.611377 # 2019-01-06 -0.677521 0.001020 -0.342290 # A # 2019-01-01 -0.703151 # 2019-01-02 -0.396221 # 2019-01-03 -0.179341 # 2019-01-04 0.464156 # 2019-01-05 -1.022012 # 2019-01-06 -0.677521
刪除NaN值ui
df.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
axis爲軸,0表示對行進行操做,1表示對列進行操做。
how爲操做類型,’any’表示只要出現NaN的行或列都刪除,’all’表示刪除整行或整列都爲NaN的行或列。
thresh:NaN的閾值,達到thresh時刪除。code
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) df.iloc[1, 3] = None df.iloc[2, 2] = None print(df) print(df.dropna(axis=1)) print(df.dropna(how='any')) # output: # A B C D # 2019-01-01 -0.152239 -2.315100 -0.504998 -0.987549 # 2019-01-02 -1.884801 1.046506 -1.618871 NaN # 2019-01-03 0.976682 -1.043107 NaN 0.391338 # 2019-01-04 0.143389 0.951518 0.040632 -0.443944 # 2019-01-05 3.092766 0.787921 -2.408260 -1.111238 # 2019-01-06 -0.179249 0.573734 -0.912023 0.261517 # A B # 2019-01-01 -0.152239 -2.315100 # 2019-01-02 -1.884801 1.046506 # 2019-01-03 0.976682 -1.043107 # 2019-01-04 0.143389 0.951518 # 2019-01-05 3.092766 0.787921 # 2019-01-06 -0.179249 0.573734 # A B C D # 2019-01-01 -0.152239 -2.315100 -0.504998 -0.987549 # 2019-01-04 0.143389 0.951518 0.040632 -0.443944 # 2019-01-05 3.092766 0.787921 -2.408260 -1.111238 # 2019-01-06 -0.179249 0.573734 -0.912023 0.261517
填充NaN值orm
df.fillna(self, value=None, method=None, axis=None, inplace=False,limit=None, downcast=None, **kwargs)
value:填充的值,能夠爲字典,字典的key爲列名稱。
inplace:表示是否對源數據進行修改,默認爲False。
fillna默認會返回新對象,但也能夠對現有對象進行就地修改。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) df.iloc[1, 3] = None df.iloc[2, 2] = None print(df) print(df.fillna({'C': 3.14, 'D': 0.0})) # 使用指定值填充 df.fillna(value=3.14, inplace=True) print(df) # output: # A B C D # 2019-01-01 0.490727 -0.603079 0.202922 2.012060 # 2019-01-02 -0.855106 0.305557 0.851141 NaN # 2019-01-03 -0.324215 0.629637 NaN -0.174930 # 2019-01-04 0.085996 0.173265 0.416938 -0.903989 # 2019-01-05 0.009368 0.410056 -1.297822 -2.202893 # 2019-01-06 0.021892 -0.359749 -0.608556 -0.859454 # A B C D # 2019-01-01 0.490727 -0.603079 0.202922 2.012060 # 2019-01-02 -0.855106 0.305557 0.851141 0.000000 # 2019-01-03 -0.324215 0.629637 3.140000 -0.174930 # 2019-01-04 0.085996 0.173265 0.416938 -0.903989 # 2019-01-05 0.009368 0.410056 -1.297822 -2.202893 # 2019-01-06 0.021892 -0.359749 -0.608556 -0.859454 # A B C D # 2019-01-01 0.490727 -0.603079 0.202922 2.012060 # 2019-01-02 -0.855106 0.305557 0.851141 3.140000 # 2019-01-03 -0.324215 0.629637 3.140000 -0.174930 # 2019-01-04 0.085996 0.173265 0.416938 -0.903989 # 2019-01-05 0.009368 0.410056 -1.297822 -2.202893 # 2019-01-06 0.021892 -0.359749 -0.608556 -0.859454
對數據進行布爾填充
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) df.iloc[1, 3] = None df.iloc[2, 2] = None print(df) print(pd.isnull(df)) # output: # A B C D # 2019-01-01 -1.337471 0.154446 0.493862 1.278946 # 2019-01-02 2.853301 -0.151376 0.318281 NaN # 2019-01-03 1.094465 0.059063 NaN 0.216805 # 2019-01-04 -0.983091 -1.052905 0.416604 -1.431156 # 2019-01-05 -1.421142 1.015465 -1.851315 -0.680514 # 2019-01-06 0.224378 -0.636699 -0.749040 -0.728368 # A B C D # 2019-01-01 False False False False # 2019-01-02 False False False True # 2019-01-03 False False True False # 2019-01-04 False False False False # 2019-01-05 False False False False # 2019-01-06 False False False False
經過字典鍵能夠進行列選擇,獲取DataFrame中的一列數據。
生成DataFrame時指定index和columns
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) print(df) # output: # A B C D # 2013-01-01 1.116914 -0.221035 -0.577299 -0.328831 # 2013-01-02 1.764656 1.462838 -0.360678 1.176134 # 2013-01-03 0.144396 -0.594359 -0.548543 1.281829 # 2013-01-04 0.632378 0.895123 -0.757924 -1.325917 # 2013-01-05 0.219125 -1.247446 0.335363 -0.676052 # 2013-01-06 0.963715 -0.131331 0.326482 -0.718461
index和columns也能夠在DataFrame建立後指定
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) print(df) df.index = pd.date_range('20130201', periods=df.shape[0]) df.columns = list('abcd') print(df) df.index = pd.date_range('20130301', periods=len(df)) df.columns = list('ABCD') print(df) # output: # A B C D # 2013-01-01 1.588442 1.548420 0.132539 0.410512 # 2013-01-02 0.200415 1.515354 2.275575 -1.533603 # 2013-01-03 0.838294 0.067409 -1.157181 0.401973 # 2013-01-04 0.551363 -0.749296 0.343762 -1.558969 # 2013-01-05 -0.799507 -1.343379 -0.006312 1.091014 # 2013-01-06 0.012188 -0.382384 0.280008 -2.333430 # a b c d # 2013-02-01 1.588442 1.548420 0.132539 0.410512 # 2013-02-02 0.200415 1.515354 2.275575 -1.533603 # 2013-02-03 0.838294 0.067409 -1.157181 0.401973 # 2013-02-04 0.551363 -0.749296 0.343762 -1.558969 # 2013-02-05 -0.799507 -1.343379 -0.006312 1.091014 # 2013-02-06 0.012188 -0.382384 0.280008 -2.333430 # A B C D # 2013-03-01 1.588442 1.548420 0.132539 0.410512 # 2013-03-02 0.200415 1.515354 2.275575 -1.533603 # 2013-03-03 0.838294 0.067409 -1.157181 0.401973 # 2013-03-04 0.551363 -0.749296 0.343762 -1.558969 # 2013-03-05 -0.799507 -1.343379 -0.006312 1.091014 # 2013-03-06 0.012188 -0.382384 0.280008 -2.333430
能夠指定某一列爲index
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD')) df['date'] = dates print(df) df = df.set_index('date', drop=True) print(df) # output: # A B C D date # 0 0.910416 -0.378195 0.332562 -0.194766 2013-01-01 # 1 0.533733 0.888629 -0.358143 1.583278 2013-01-02 # 2 0.482362 -0.905558 1.045753 -0.874653 2013-01-03 # 3 0.901622 -0.535862 -0.439763 -0.640594 2013-01-04 # 4 -1.273577 -0.746785 1.448309 -0.368285 2013-01-05 # 5 0.191289 -1.246213 0.184757 -1.143074 2013-01-06 # A B C D # date # 2013-01-01 0.910416 -0.378195 0.332562 -0.194766 # 2013-01-02 0.533733 0.888629 -0.358143 1.583278 # 2013-01-03 0.482362 -0.905558 1.045753 -0.874653 # 2013-01-04 0.901622 -0.535862 -0.439763 -0.640594 # 2013-01-05 -1.273577 -0.746785 1.448309 -0.368285 # 2013-01-06 0.191289 -1.246213 0.184757 -1.143074
在原有DataFrame的基礎上,能夠建立一個新的DataFrame,或者將原有DataFrame按行進行彙總統計建立一個新的DataFrame。
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) df1 = pd.DataFrame() df1['min'] = df.min() df1['max'] = df.max() df1['std'] = df.std() print(df1) df['min'] = df.min(axis=1) df['max'] = df.max(axis=1) df['std'] = df.std(axis=1) print(df) # output: # A B C # 2013-01-01 0.901073 1.706925 -0.503194 # 2013-01-02 0.379870 0.729674 0.579337 # 2013-01-03 -1.285323 -0.665951 -0.161148 # 2013-01-04 -0.714282 0.423376 0.586061 # 2013-01-05 -0.895171 -0.413328 0.485803 # 2013-01-06 1.926472 -0.718467 1.113522 # min max std # A -1.285323 1.926472 1.234084 # B -0.718467 1.706925 0.955797 # C -0.503194 1.113522 0.582913 # A B C min max std # 2013-01-01 0.901073 1.706925 -0.503194 -0.503194 1.706925 1.113132 # 2013-01-02 0.379870 0.729674 0.579337 0.379870 0.729674 0.175247 # 2013-01-03 -1.285323 -0.665951 -0.161148 -1.285323 -0.161148 0.562671 # 2013-01-04 -0.714282 0.423376 0.586061 -0.714282 0.586061 0.685749 # 2013-01-05 -0.895171 -0.413328 0.485803 -0.895171 0.485803 0.696763 # 2013-01-06 1.926472 -0.718467 1.113522 -0.718467 1.926472 1.341957
axis=0,對DataFrame的每一列數據進行統計運算,獲得一行。axis=0,對DataFrame的每一行數據進行統計運算,獲得一列。
DataFrame能夠修改index和columns。
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC')) print(df) df = df.rename(index=lambda x: x + 5, columns={'A': 'newA', 'B': 'newB'}) print(df) # output: # A B C # 2013-01-01 0.834910 0.652175 0.537611 # 2013-01-02 1.083902 0.836208 -1.466876 # 2013-01-03 -0.044256 0.932547 1.843682 # 2013-01-04 1.610113 -0.705734 -0.145042 # 2013-01-05 1.114897 0.273569 -0.047725 # 2013-01-06 -0.541942 -0.112752 1.644338 # newA newB C # 2013-01-06 0.834910 0.652175 0.537611 # 2013-01-07 1.083902 0.836208 -1.466876 # 2013-01-08 -0.044256 0.932547 1.843682 # 2013-01-09 1.610113 -0.705734 -0.145042 # 2013-01-10 1.114897 0.273569 -0.047725 # 2013-01-11 -0.541942 -0.112752 1.644338
列數據的單位統一
import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) df['D'] = [10000, 34000, 60000, 34000, 56000, 80000] print(df) for i in range(len(df['D'])): weight = float(df.iloc[i, 3]) / 10000 df.iloc[i, 3] = '{}萬'.format(weight) print(df) # output: # A B C D # 2019-01-01 -0.889533 -0.411451 0.563969 10000 # 2019-01-02 -0.573239 0.264805 -0.058530 34000 # 2019-01-03 1.224993 -1.815338 -2.075301 60000 # 2019-01-04 0.266483 1.841926 -0.759681 34000 # 2019-01-05 -0.167595 0.432617 0.533577 56000 # 2019-01-06 -0.973877 0.700821 1.093101 80000 # A B C D # 2019-01-01 -0.889533 -0.411451 0.563969 1.0萬 # 2019-01-02 -0.573239 0.264805 -0.058530 3.4萬 # 2019-01-03 1.224993 -1.815338 -2.075301 6.0萬 # 2019-01-04 0.266483 1.841926 -0.759681 3.4萬 # 2019-01-05 -0.167595 0.432617 0.533577 5.6萬 # 2019-01-06 -0.973877 0.700821 1.093101 8.0萬
df.duplicated(self, subset=None, keep='first')
檢查DataFrame是否有重複數據。
subset:子集,列標籤或列標籤的序列
keep:可選值爲first,last,False,first表示保留第一個出現的值,last表示保留最後一個出現的值,False表示保留全部的值。df.drop_duplicates(self, subset=None, keep='first', inplace=False)
刪除DataFrame的重複數據。
subset:子集,列標籤或列標籤的序列
keep:可選值爲first,last,False,first表示保留第一個出現的值,last表示保留最後一個出現的值,False表示保留全部的值。
inplace:值爲True表示修改源數據,值爲False表示不修改源數據
import pandas as pd import numpy as np if __name__ == "__main__": data = [['Alex', np.nan, 80], ['Bob', 25, 90], ['Bob', 25, 90]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) # 使用bool過濾,取出重複的值 print(df[df.duplicated(keep=False)]) # 刪除重複值,修改源數據 df.drop_duplicates(keep='last', inplace=True) print(df) # output: # Name Age Score # 0 Alex NaN 80 # 1 Bob 25.0 90 # 2 Bob 25.0 90 # Name Age Score # 1 Bob 25.0 90 # 2 Bob 25.0 90 # Name Age Score # 0 Alex NaN 80 # 2 Bob 25.0 90
異常值分爲兩種,一種是非法數據,如數字列的中間夾雜着一些漢字或者是符號;第二種是異常數據,異乎尋常的大數值或者是小數值。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np def swap(x): if type(x) == str: if x[-1] == '歲': x = int(x[:-1]) elif x[-1] == '分': x = int(x[:-1]) return x if __name__ == "__main__": data = [['Alex', np.nan, '89分'], ['Bob', '25歲', '90分'], ['Bob', '28歲', '90分']] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) df = df.applymap(swap) print(df) # output: # Name Age Score # 0 Alex NaN 89分 # 1 Bob 25歲 90分 # 2 Bob 28歲 90分 # Name Age Score # 0 Alex NaN 89 # 1 Bob 25.0 90 # 2 Bob 28.0 90
清除字段字符的先後空格df[‘city’]=df[‘city’].map(str.strip)
將字段進行大小寫轉換:df[‘city’]=df[‘city’].str.lower()
import pandas as pd import numpy as np if __name__ == "__main__": data = [['Alex', np.nan, 80], [' Bob ', 25, 90], [' Bob', 25, 90]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) # 清除字符串先後空格 print(df['Name'].map(str.strip)) # 大小寫轉換 print(df['Name'].str.lower()) # output: # Name Age Score # 0 Alex NaN 80 # 1 Bob 25.0 90 # 2 Bob 25.0 90 # 0 Alex # 1 Bob # 2 Bob # Name: Name, dtype: object # 0 alex # 1 bob # 2 bob # Name: Name, dtype: object
更改列的數據類型:df[‘price’].astype(‘int’)
df[‘city’].replace(‘sh’, ‘shanghai’) import pandas as pd import numpy as np if __name__ == "__main__": data = [['Alex', np.nan, 80], ['Bob', 25, 90], ['Bob', 25, 90]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) print(df['Name'].replace('Bob', 'Bauer')) # output: # Name Age Score # 0 Alex NaN 80 # 1 Bob 25.0 90 # 2 Bob 25.0 90 # 0 Alex # 1 Bauer # 2 Bauer # Name: Name, dtype: object
替換時,字符串先後不能有空格存在,必須嚴格匹配。
(1)按標籤排序
sort_index(self, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)
使用sort_index()函數,經過傳遞axis參數和排序順序,能夠對DataFrame進行排序。 默認狀況下,按照升序對行標籤進行排序。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col1', 'col2', 'col3']) print(df) df = df.sort_index() print(df) # output: # col1 col2 col3 # rank2 -0.627700 -0.361006 -1.126366 # rank1 -1.997538 1.569461 0.454773 # rank4 -0.598688 1.348594 0.777791 # rank3 -0.190794 -1.209312 0.830699 # col1 col2 col3 # rank1 -1.997538 1.569461 0.454773 # rank2 -0.627700 -0.361006 -1.126366 # rank3 -0.190794 -1.209312 0.830699 # rank4 -0.598688 1.348594 0.777791
經過將布爾值傳遞給升序參數ascending,能夠控制排序順序;經過傳遞axis參數值爲1,能夠對列標籤進行排序。 默認狀況下,axis = 0,對行標籤進行排序。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) # 按列標籤進行排序 df = df.sort_index(ascending=True, axis=1) print(df) # output: # col3 col2 col1 # rank2 -0.715319 -0.245760 -1.282737 # rank1 0.046705 -0.202133 0.185576 # rank4 -1.608270 -0.491281 0.047686 # rank3 -1.013456 -0.020197 1.184151 # col1 col2 col3 # rank2 -1.282737 -0.245760 -0.715319 # rank1 0.185576 -0.202133 0.046705 # rank4 0.047686 -0.491281 -1.608270 # rank3 1.184151 -0.020197 -1.013456
(2)按值排序
sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
使用sort_values函數能夠按值排序,接收一個by參數,使用DataFrame的列名稱做爲值,根據某列進行排序。by能夠是列名稱的列表。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) df = df.sort_values(by="col2") print(df) # output: # col3 col2 col1 # rank2 -0.706054 -2.135880 1.066836 # rank1 0.290660 -2.214451 -1.724394 # rank4 1.211874 0.475177 -0.711855 # rank3 -0.253331 1.211301 -0.208633 # col3 col2 col1 # rank1 0.290660 -2.214451 -1.724394 # rank2 -0.706054 -2.135880 1.066836 # rank4 1.211874 0.475177 -0.711855 # rank3 -0.253331 1.211301 -0.208633
sort_values()提供mergesort,heapsort和quicksort三種排序算法,mergesort是惟一的穩定排序算法,經過參數kind進行傳遞。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) df = df.sort_values(by="col2", kind='mergesort') print(df) # output: # col3 col2 col1 # rank2 -0.243768 -0.344846 0.535481 # rank1 -1.491950 0.690749 -2.023808 # rank4 -0.656292 -0.704788 0.655129 # rank3 0.468007 -0.250702 0.079670 # col3 col2 col1 # rank4 -0.656292 -0.704788 0.655129 # rank2 -0.243768 -0.344846 0.535481 # rank3 0.468007 -0.250702 0.079670 # rank1 -1.491950 0.690749 -2.023808
按順序進行多列降序排序
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) df = df.sort_values(by=['col1', 'col3'], ascending=True, axis=0) print(df) # output: # col3 col2 col1 # rank2 1.035965 1.048124 -0.341586 # rank1 2.391899 -1.575462 0.616940 # rank4 0.968523 -0.932288 -0.553498 # rank3 0.585521 1.907344 -0.264500 # col3 col2 col1 # rank4 0.968523 -0.932288 -0.553498 # rank2 1.035965 1.048124 -0.341586 # rank3 0.585521 1.907344 -0.264500 # rank1 2.391899 -1.575462 0.616940
Pandas可使用groupby函數對DataFrame進行拆分,獲得分組對象。
df.groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)
by:分組方式,能夠是字典、函數、標籤、標籤列表
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90], ['Jack', 26, 80]] df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A']) print(df) group_obj1 = df.groupby('Name') print(group_obj1.groups) print('===================================') # 單層分組迭代 for key, data in group_obj1: print(key) print(data) group_obj2 = df.groupby(['Name', 'A']) # 分組信息查看 print(group_obj2.groups) print('===================================') # 多層分組迭代 for key, data in group_obj2: print(key) print(data) # output: # Name Age A # a Alex 24 80 # b Bob 25 90 # c Bauer 25 90 # d Jack 26 80 # {'Alex': Index(['a'], dtype='object'), 'Bauer': Index(['c'], dtype='object'), 'Bob': Index(['b'], dtype='object'), 'Jack': Index(['d'], dtype='object')} # =================================== # Alex # Name Age A # a Alex 24 80 # Bauer # Name Age A # c Bauer 25 90 # Bob # Name Age A # b Bob 25 90 # Jack # Name Age A # d Jack 26 80 # {('Alex', 80): Index(['a'], dtype='object'), ('Bauer', 90): Index(['c'], dtype='object'), ('Bob', 90): Index(['b'], dtype='object'), ('Jack', 80): Index(['d'], dtype='object')} # =================================== # ('Alex', 80) # Name Age A # a Alex 24 80 # ('Bauer', 90) # Name Age A # c Bauer 25 90 # ('Bob', 90) # Name Age A # b Bob 25 90 # ('Jack', 80) # Name Age A # d Jack 26 80
filter()函數能夠用於過濾數據。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data = [['Alex', 24, 80], ['Bob', 25, 92], ['Bauer', 25, 90], ['Jack', 26, 80]] df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A']) print(df) group_obj1 = df.groupby('Age') print(group_obj1.groups) # 過濾年齡相同的人 group = group_obj1.filter(lambda x: len(x) > 1) print(group) # output: # Name Age A # a Alex 24 80 # b Bob 25 92 # c Bauer 25 90 # d Jack 26 80 # {24: Index(['a'], dtype='object'), 25: Index(['b', 'c'], dtype='object'), 26: Index(['d'], dtype='object')} # Name Age A # b Bob 25 92 # c Bauer 25 90
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)
合併兩個DataFrame對象。
left ,左DataFrame對象。
right,右DataFrame對象。
on,列(名稱)鏈接,必須在左DataFrame和右DataFrame對象中存在(找到)。
left_on,左側DataFrame中的列用做鍵,能夠是列名或長度等於DataFrame長度的數組。
right_on,來自右DataFrame的列做爲鍵,能夠是列名或長度等於DataFrame長度的數組。
left_index,若是爲True,則使用左側DataFrame中的索引(行標籤)做爲其鏈接鍵。 在具備MultiIndex(分層)的DataFrame的狀況下,級別的數量必須與來自右DataFrame的鏈接鍵的數量相匹配。
right_index ,與右DataFrame的left_index具備相同的用法。
how,可選值爲left, right, outer,inner,默認爲inner。
sort,按照字典順序經過鏈接鍵對結果DataFrame進行排序。默認爲True,設置爲False時,能夠大大提升性能。
在一個鍵上合併兩個DataFrame的示例以下:
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]] left = pd.DataFrame(data1, columns=['Name', 'Age', 'A']) data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]] right = pd.DataFrame(data2, columns=['Name', 'B', 'C']) print(left) print('==================================') print(right) print('==================================') df = pd.merge(left, right, on='Name') print(df) # output: # Name Age A # 0 Alex 24 80 # 1 Bob 25 90 # 2 Bauer 25 90 # ================================== # Name B C # 0 Alex 87 78 # 1 Bob 67 87 # 2 Bauer 98 78 # ================================== # Name Age A B C # 0 Alex 24 80 87 78 # 1 Bob 25 90 67 87 # 2 Bauer 25 90 98 78
合併多個鍵上的兩個DataFrame的示例以下:
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]] left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A']) data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]] right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C']) print(left) print('==================================') print(right) print('==================================') df = pd.merge(left, right, on=['ID', 'Name']) print(df) # output: # ID Name Age A # 0 1 Alex 24 80 # 1 2 Bob 25 90 # 2 3 Bauer 25 90 # ================================== # ID Name B C # 0 1 Alex 87 78 # 1 4 Bob 67 87 # 2 3 Bauer 98 78 # ================================== # ID Name Age A B C # 0 1 Alex 24 80 87 78 # 1 3 Bauer 25 90 98 78
使用「how」參數進行合併,如何合併參數指定如何肯定哪些鍵將被包含在結果表中。若是組合鍵沒有出如今左側或右側表中,則鏈接表中的值將爲NA。
left:LEFT OUTER JOIN,使用左側對象的鍵。
right:RIGHT OUTER JOIN,使用右側對象的鍵。
outer:FULL OUTER JOIN,使用鍵的聯合。
inner:INNER JOIN,使用鍵的交集。
Left Join示例:
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]] left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A']) data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]] right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C']) print(left) print('==================================') print(right) print('==================================') df = pd.merge(left, right, on='ID', how='left') print(df) # output: # ID Name Age A # 0 1 Alex 24 80 # 1 2 Bob 25 90 # 2 3 Bauer 25 90 # ================================== # ID Name B C # 0 1 Alex 87 78 # 1 4 Bob 67 87 # 2 3 Bauer 98 78 # ================================== # ID Name_x Age A Name_y B C # 0 1 Alex 24 80 Alex 87.0 78.0 # 1 2 Bob 25 90 NaN NaN NaN # 2 3 Bauer 25 90 Bauer 98.0 78.0
Right Join示例:
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]] left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A']) data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]] right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C']) print(left) print('==================================') print(right) print('==================================') df = pd.merge(left, right, on='ID', how='right') print(df) # output: # ID Name Age A # 0 1 Alex 24 80 # 1 2 Bob 25 90 # 2 3 Bauer 25 90 # ================================== # ID Name B C # 0 1 Alex 87 78 # 1 4 Bob 67 87 # 2 3 Bauer 98 78 # ================================== # ID Name_x Age A Name_y B C # 0 1 Alex 24.0 80.0 Alex 87 78 # 1 3 Bauer 25.0 90.0 Bauer 98 78 # 2 4 NaN NaN NaN Bob 67 87
Outer Join示例:
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]] left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A']) data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]] right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C']) print(left) print('==================================') print(right) print('==================================') df = pd.merge(left, right, on='ID', how='outer') print(df) # output: # ID Name Age A # 0 1 Alex 24 80 # 1 2 Bob 25 90 # 2 3 Bauer 25 90 # ================================== # ID Name B C # 0 1 Alex 87 78 # 1 4 Bob 67 87 # 2 3 Bauer 98 78 # ================================== # ID Name_x Age A Name_y B C # 0 1 Alex 24.0 80.0 Alex 87.0 78.0 # 1 2 Bob 25.0 90.0 NaN NaN NaN # 2 3 Bauer 25.0 90.0 Bauer 98.0 78.0 # 3 4 NaN NaN NaN Bob 67.0 87.0
Inner Join示例:
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]] left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A']) data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]] right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C']) print(left) print('==================================') print(right) print('==================================') df = pd.merge(left, right, on='ID', how='inner') print(df) # output: # ID Name Age A # 0 1 Alex 24 80 # 1 2 Bob 25 90 # 2 3 Bauer 25 90 # ================================== # ID Name B C # 0 1 Alex 87 78 # 1 4 Bob 67 87 # 2 3 Bauer 98 78 # ================================== # ID Name_x Age A Name_y B C # 0 1 Alex 24 80 Alex 87 78 # 1 3 Bauer 25 90 Bauer 98 78
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
沿某個軸進行級聯操做。
objs,Series、DataFrame或Panel對象的序列或字典。
axis,{0,1,...},默認爲0,axis=0表示按index進行級聯,axis=1表示按columns進行級聯。
join,{'inner', 'outer'},默認inner,指示如何處理其它軸上的索引。
ignore_index,布爾值,默認爲False。若是指定爲True,則不使用鏈接軸上的索引值。結果軸將被標記爲:0,...,n-1。
join_axes ,Index對象的列表。用於其它(n-1)軸的特定索引,而不是執行內部/外部集邏輯。
sort:是否進行排序,True會進行排序,False不進行排序。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]] one = pd.DataFrame(data1, columns=['Name', 'Age', 'A']) data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]] two = pd.DataFrame(data2, columns=['Name', 'B', 'C']) print(one) print('==================================') print(two) print('==================================') df = pd.concat([one, two], axis=1, sort=False) print(df) # output: # Name Age A # 0 Alex 24 80 # 1 Bob 25 90 # 2 Bauer 25 90 # ================================== # Name B C # 0 Alex 87 78 # 1 Bob 67 87 # 2 Bauer 98 78 # ================================== # Name Age A Name B C # 0 Alex 24 80 Alex 87 78 # 1 Bob 25 90 Bob 67 87 # 2 Bauer 25 90 Bauer 98 78
當結果的索引是重複的,若是想要生成的對象必須遵循本身的索引,須要將ignore_index設置爲True。
Pandas提供了鏈接DataFrame的append方法,沿axis=0鏈接。
df.append(self, other, ignore_index=False, verify_integrity=False, sort=None)
向DataFrame對象中添加新的行,若是添加的列名不在DataFrame對象中,將會被看成新的列進行添加。
other:DataFrame、series、dict、list
ignore_index:默認值爲False,若是爲True則不使用index標籤。
verify_integrity :默認值爲False,若是爲True當建立相同的index時會拋出ValueError的異常。
sort:boolean,默認是None。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]] one = pd.DataFrame(data1, columns=['Name', 'Age', 'A']) data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]] two = pd.DataFrame(data2, columns=['Name', 'B', 'C']) print(one) print('==================================') print(two) print('==================================') df = one.append(two, sort=False) print(df) # output: # Name Age A # 0 Alex 24 80 # 1 Bob 25 90 # 2 Bauer 25 90 # ================================== # Name B C # 0 Alex 87 78 # 1 Bob 67 87 # 2 Bauer 98 78 # ================================== # Name Age A B C # 0 Alex 24.0 80.0 NaN NaN # 1 Bob 25.0 90.0 NaN NaN # 2 Bauer 25.0 90.0 NaN NaN # 0 Alex NaN NaN 87.0 78.0 # 1 Bob NaN NaN 67.0 87.0 # 2 Bauer NaN NaN 98.0 78.0
Pandas提供了鏈接DataFrame的join方法,沿axis=1鏈接,用於將兩個DataFrame中的不一樣的列索引合併成爲一個DataFrame。
df.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
join方法提供SQL的Join操做,默認爲爲左外鏈接how=left。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90],['Jack', 26, 80]] one = pd.DataFrame(data1, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A']) data2 = [[87, 78], [67, 87], [98, 78]] two = pd.DataFrame(data2, index=['a', 'b', 'c'], columns=['B', 'C']) print(one) print('==================================') print(two) print('==================================') df = one.join(two) print(df) # output: # Name Age A # a Alex 24 80 # b Bob 25 90 # c Bauer 25 90 # d Jack 26 80 # ================================== # B C # a 87 78 # b 67 87 # c 98 78 # ================================== # Name Age A B C # a Alex 24 80 87.0 78.0 # b Bob 25 90 67.0 87.0 # c Bauer 25 90 98.0 78.0 # d Jack 26 80 NaN NaN
迭代DataFrame提供列名。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) print(df) for col in df: print(col, end=' ') # output: # A B C D # 2019-01-01 -0.415754 -1.214340 -0.103952 1.232414 # 2019-01-02 -0.367888 0.257199 -1.615029 -0.335322 # 2019-01-03 0.552697 0.202993 -1.000219 -0.530897 # 2019-01-04 0.503410 -1.610091 1.660362 0.649700 # 2019-01-05 0.575416 -1.962578 -1.681379 -0.425239 # 2019-01-06 1.075917 -0.499081 1.886878 -0.073895 # A B C D
df.iteritems()用於迭代(key,value)對,將每一個列標籤做爲key,value爲Series對象。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": dates = pd.date_range('20190101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) print(df) for key, value in df.iteritems(): print(key, value) # output: # A B C D # 2019-01-01 -0.302021 1.343811 -0.070351 -0.409479 # 2019-01-02 -0.365564 0.743572 -0.475075 1.026054 # 2019-01-03 0.025748 1.395340 -0.987686 0.141003 # 2019-01-04 -0.291348 -1.173600 -2.286905 0.528416 # 2019-01-05 -1.844523 -0.052567 0.575980 0.260001 # 2019-01-06 0.271046 -0.583334 -0.596251 0.772095 # A 2019-01-01 -0.302021 # 2019-01-02 -0.365564 # 2019-01-03 0.025748 # 2019-01-04 -0.291348 # 2019-01-05 -1.844523 # 2019-01-06 0.271046 # Freq: D, Name: A, dtype: float64 # B 2019-01-01 1.343811 # 2019-01-02 0.743572 # 2019-01-03 1.395340 # 2019-01-04 -1.173600 # 2019-01-05 -0.052567 # 2019-01-06 -0.583334 # Freq: D, Name: B, dtype: float64 # C 2019-01-01 -0.070351 # 2019-01-02 -0.475075 # 2019-01-03 -0.987686 # 2019-01-04 -2.286905 # 2019-01-05 0.575980 # 2019-01-06 -0.596251 # Freq: D, Name: C, dtype: float64 # D 2019-01-01 -0.409479 # 2019-01-02 1.026054 # 2019-01-03 0.141003 # 2019-01-04 0.528416 # 2019-01-05 0.260001 # 2019-01-06 0.772095 # Freq: D, Name: D, dtype: float64
df.iterrows()用於返回迭代器,產生每一個index以及包含每行數據的Series。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD')) print(df) for index, value in df.iterrows(): print(index, value) # output: # A B C D # 0 -1.097851 0.785749 -1.727198 -1.120925 # 1 -1.420429 0.094384 -1.566202 0.237084 # 2 -0.761957 0.552395 0.680884 -0.290955 # 3 0.357713 -0.323331 1.438013 -1.334616 # 4 0.015467 -2.431556 -0.717285 -0.094409 # 5 -1.198224 -1.370170 0.201725 0.258093 # 0 A -1.097851 # B 0.785749 # C -1.727198 # D -1.120925 # Name: 0, dtype: float64 # 1 A -1.420429 # B 0.094384 # C -1.566202 # D 0.237084 # Name: 1, dtype: float64 # 2 A -0.761957 # B 0.552395 # C 0.680884 # D -0.290955 # Name: 2, dtype: float64 # 3 A 0.357713 # B -0.323331 # C 1.438013 # D -1.334616 # Name: 3, dtype: float64 # 4 A 0.015467 # B -2.431556 # C -0.717285 # D -0.094409 # Name: 4, dtype: float64 # 5 A -1.198224 # B -1.370170 # C 0.201725 # D 0.258093 # Name: 5, dtype: float64
df.itertuples()方法將爲DataFrame中的每一行返回一個產生一個命名元組的迭代器。元組的第一個元素是行的index,而剩餘的值是行值。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD')) print(df) for row in df.itertuples(): print(row) # output: # A B C D # 0 0.681324 1.047734 -1.909570 -0.845900 # 1 -0.879077 -0.897085 -0.795461 -0.634519 # 2 0.484502 -0.061608 0.605827 -0.321721 # 3 -0.051974 1.533112 -1.011544 -0.922280 # 4 -0.634157 -0.173692 1.228584 -1.229581 # 5 0.236769 -0.933609 0.111948 1.048215 # Pandas(Index=0, A=0.6813238552921729, B=1.0477343302788706, C=-1.909570436815022, D=-0.8459001766064564) # Pandas(Index=1, A=-0.8790771200969485, B=-0.8970849190216943, C=-0.7954606477323869, D=-0.6345188867416923) # Pandas(Index=2, A=0.48450157948338324, B=-0.061608014575315506, C=0.6058267522125123, D=-0.32172144100965605) # Pandas(Index=3, A=-0.05197447447575398, B=1.5331115391025778, C=-1.0115444345763995, D=-0.9222798204619236) # Pandas(Index=4, A=-0.6341570074338677, B=-0.173692444412635, C=1.2285839004083785, D=-1.2295807166909738) # Pandas(Index=5, A=0.23676890089548117, B=-0.9336090868233837, C=0.11194794444517034, D=1.0482154173833818)
迭代用於讀取,迭代器返回原始對象(視圖)的副本,所以迭代時更改將不會反映在原始對象上。
在SQL中,SELECT使用逗號分隔的列列表(或選擇全部列)來完成。SELECT ID, Name FROM tablename LIMIT 5;
在Pandas中,列選擇經過傳遞列名到DataFrame。df[['ID', 'Name']].head(5)
SELECT操做示例:
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]] df = pd.DataFrame(data, columns=['ID', 'Name', 'Age', 'A']) print(df) print(df[['ID', 'Name']].head(5)) # output: # ID Name Age A # 0 1 Alex 24 80 # 1 2 Bob 25 90 # 2 3 Bauer 25 90 # ID Name # 0 1 Alex # 1 2 Bob # 2 3 Bauer
在SQL中,使用WHERE進行條件過濾。SELECT * FROM tablename WHERE Name = 'Bauer' LIMIT 5;
在Pandas中,一般使用布爾索引進行過濾。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": data = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]] df = pd.DataFrame(data, columns=['ID', 'Name', 'Age', 'A']) print(df) print('===========================') print(df[df['Name'] == 'Bauer'].head(5)) # output: # ID Name Age A # 0 1 Alex 24 80 # 1 2 Bob 25 90 # 2 3 Bauer 25 90 # =========================== # ID Name Age A # 2 3 Bauer 25 90
(1)sum
返回所請求軸的值的總和。 默認狀況下,軸爲索引(axis=0)。
import pandas as pd if __name__ == "__main__": data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) print(df.sum()) print(df.sum(1)) # output: # Name Age Score # 0 Alex 25 80 # 1 Bob 26 90 # 2 Bauer 24 87 # Name AlexBobBauer # Age 75 # Score 257 # dtype: object # 0 105 # 1 116 # 2 111 # dtype: int64
(2)mean
返回平均值。
import pandas as pd if __name__ == "__main__": data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) print(df.mean()) # output: # Name Age Score # 0 Alex 25 80 # 1 Bob 26 90 # 2 Bauer 24 87 # Age 25.000000 # Score 85.666667 # dtype: float64
(3)std
返回數字列的Bressel標準誤差。
import pandas as pd if __name__ == "__main__": data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) print(df.std()) # output: # Name Age Score # 0 Alex 25 80 # 1 Bob 26 90 # 2 Bauer 24 87 # Age 1.000000 # Score 5.131601 # dtype: float64
(4)median
求全部值的中位數。
import pandas as pd if __name__ == "__main__": data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) print(df.median()) # output: # Name Age Score # 0 Alex 25 80 # 1 Bob 26 90 # 2 Bauer 24 87 # Age 25.0 # Score 87.0 # dtype: float64
(5)min
求全部值中的最小值。
import pandas as pd if __name__ == "__main__": data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) print(df.min()) # output: # Name Age Score # 0 Alex 25 80 # 1 Bob 26 90 # 2 Bauer 24 87 # Name Alex # Age 24 # Score 80 # dtype: object
(6)max
求全部值中的最大值。
import pandas as pd if __name__ == "__main__": data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) print(df.max()) # output: # Name Age Score # 0 Alex 25 80 # 1 Bob 26 90 # 2 Bauer 24 87 # Name Bob # Age 26 # Score 90 # dtype: object
(7)describe
彙總有關DataFrame列的統計信息的摘要。def describe(self, percentiles=None, include=None, exclude=None)
include用於傳遞關於什麼列須要考慮用於總結的必要信息的參數。獲取值列表,默認狀況下是number 。
object - 彙總字符串列
number - 彙總數字列
all - 將全部列彙總在一塊兒(不該將其做爲列表值傳遞)
import pandas as pd if __name__ == "__main__": data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]] df = pd.DataFrame(data, columns=['Name', 'Age', 'Score']) print(df) print(df.describe(include="all")) # output: # Name Age Score # 0 Alex 25 80 # 1 Bob 26 90 # 2 Bauer 24 87 # Name Age Score # count 3 3.0 3.000000 # unique 3 NaN NaN # top Alex NaN NaN # freq 1 NaN NaN # mean NaN 25.0 85.666667 # std NaN 1.0 5.131601 # min NaN 24.0 80.000000 # 25% NaN 24.5 83.500000 # 50% NaN 25.0 87.000000 # 75% NaN 25.5 88.500000 # max NaN 26.0 90.000000
abs:求全部值的絕對值
prod:求全部值的乘積
cumsum:累計總和
cumprod:累計乘積
Series,DatFrames和Panel都有pct_change()函數,用於將每一個元素與其前一個元素進行比較,並計算變化百分比。默認狀況下,pct_change()對列進行操做; 若是想應用到行上,那麼可以使用axis = 1參數。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1']) print(df) print(df.pct_change()) # output: # col3 col2 col1 # rank2 0.988739 2.062798 1.400892 # rank1 0.394663 -0.988307 1.583098 # rank4 -0.768109 -0.163727 -1.801323 # rank3 0.999816 -1.224068 1.470020 # col3 col2 col1 # rank2 NaN NaN NaN # rank1 -0.600842 -1.479110 0.130064 # rank4 -2.946241 -0.834336 -2.137846 # rank3 -2.301659 6.476294 -1.816078
協方差適用於Series數據,Series對象有一個方法cov用來計算Series對象之間的協方差,NA將被自動排除。當應用於DataFrame對象時,協方差方法計算全部列之間的協方差(cov)值。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e']) print(df) print(df['a'].cov(df['b'])) print(df.cov()) # output: # a b c d e # 0 1.168443 -0.343905 2.254448 0.269765 -0.928009 # 1 0.542551 -1.303205 -1.767313 -0.349884 -0.352578 # 2 -2.028410 -1.176339 0.156047 1.426468 -1.338805 # 0.48923631972868176 # a b c d e # a 2.870241 0.489236 0.713430 -1.312818 0.581441 # b 0.489236 0.271550 0.974811 -0.023849 -0.055862 # c 0.713430 0.974811 4.046193 0.580236 -0.558184 # d -1.312818 -0.023849 0.580236 0.812892 -0.430603 # e 0.581441 -0.055862 -0.558184 -0.430603 0.245420
相關性顯示了任何兩個數值(Series)之間的線性關係。有多種計算相關性的方法,如pearson(默認),spearman和kendall。若是DataFrame中存在任何非數字列,則會自動排除。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": df = pd.DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e']) print(df) print(df['a'].corr(df['b'])) print(df.corr()) # output: # a b c d e # 0 -2.110756 0.693665 0.405701 -0.628349 -1.062029 # 1 -1.331364 1.283434 1.619166 -0.025866 1.742287 # 2 -1.159944 0.435840 -0.251710 -0.347102 -0.026825 # 0.052396578025987336 # a b c d e # a 1.000000 0.052397 -0.000006 0.743940 0.664845 # b 0.052397 1.000000 0.998626 0.706309 0.780790 # c -0.000006 0.998626 1.000000 0.668242 0.746977 # d 0.743940 0.706309 0.668242 1.000000 0.993772 # e 0.664845 0.780790 0.746977 0.993772 1.000000
數據排名爲元素數組中的每一個元素生成排名。在關係的狀況下,分配平均等級。
# -*- coding=utf-8 -*- import pandas as pd import numpy as np if __name__ == "__main__": s = pd.Series(np.random.randn(5), index=list('abcde')) print(s) s['a'] = s['c'] print(s.rank()) # output: # a 1.597684 # a 1.597684 # b 1.107413 # c -0.298296 # d -0.281076 # e -0.667954 # dtype: float64 # a 2.5 # b 5.0 # c 2.5 # d 4.0 # e 1.0 # dtype: float64
rank使用一個默認爲True的升序參數; False時,數據被反向排序,較大的值被分配較小的排序。