pandas主要有兩種數據對象php
注: 後面代碼使用pandas版本0.20.1,經過import pandas as pd引入css
Series是一種帶有索引的序列對象html
簡單建立以下git
# 經過傳入一個序列給pd.Series初始化一個Series對象, 好比list s1 = pd.Series(list("1234")) print(s1) 0 1 1 2 2 3 3 4 dtype: object
相似與數據庫table有行列的數據對象github
# 經過傳入一個numpy的二維數組或者dict對象給pd.DataFrame初始化一個DataFrame對象 # 經過numpy二維數組 import numpy as np df1 = pd.DataFrame(np.random.randn(6,4)) print(df1) 0 1 2 3 0 -0.646340 -1.249943 0.393323 -1.561873 1 0.371630 0.069426 1.693097 0.907419 2 -0.328575 -0.256765 0.693798 -0.787343 3 1.875764 -0.416275 -1.028718 0.158259 4 1.644791 -1.321506 -0.337425 0.820689 5 0.006391 -1.447894 0.506203 0.977295 # 經過dict字典 df2 = pd.DataFrame({ 'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' }) print(df2) A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
不論是Series對象仍是DataFrame對象都有一個對對象相對應的索引, Series的索引相似於每一個元素, DataFrame的索引對應着每一行web
在建立對象的時候,每一個對象都會初始化一個起始值爲0,自增的索引列表, DataFrame同理正則表達式
# 打印對象的時候,第一列就是索引 print(s1) 0 1 1 2 2 3 3 4 dtype: object # 或者只查看索引, DataFrame同理 print(s1.index)
這裏的增刪查改主要基於DataFrame對象數據庫
爲了有足夠數據用於展現,這裏選擇tushare的數據json
tushare安裝數組
pip install tushare
建立數據對象以下
import tushare as ts df = ts.get_k_data("000001")
DataFrame 行列,axis 圖解
查看每列的數據類型
# 查看df數據類型 df.dtypes date object open float64 close float64 high float64 low float64 volume float64 code object dtype: object
查看指定指定數量的行
head函數默認查看前5行,tail函數默認查看後5行,能夠傳遞指定的數值用於查看指定行數
查看前5行 df.head() date open close high low volume code 0 2015-12-23 9.927 9.935 10.174 9.871 1039018.0 000001 1 2015-12-24 9.919 9.823 9.998 9.744 640229.0 000001 2 2015-12-25 9.855 9.879 9.927 9.815 399845.0 000001 3 2015-12-28 9.895 9.537 9.919 9.537 822408.0 000001 4 2015-12-29 9.545 9.624 9.632 9.529 619802.0 000001 # 查看後5行 df.tail() date open close high low volume code 636 2018-08-01 9.42 9.15 9.50 9.11 814081.0 000001 637 2018-08-02 9.13 8.94 9.15 8.88 931401.0 000001 638 2018-08-03 8.93 8.91 9.10 8.91 476546.0 000001 639 2018-08-06 8.94 8.94 9.11 8.89 554010.0 000001 640 2018-08-07 8.96 9.17 9.17 8.88 690423.0 000001 # 查看前10行 df.head(10) date open close high low volume code 0 2015-12-23 9.927 9.935 10.174 9.871 1039018.0 000001 1 2015-12-24 9.919 9.823 9.998 9.744 640229.0 000001 2 2015-12-25 9.855 9.879 9.927 9.815 399845.0 000001 3 2015-12-28 9.895 9.537 9.919 9.537 822408.0 000001 4 2015-12-29 9.545 9.624 9.632 9.529 619802.0 000001 5 2015-12-30 9.624 9.632 9.640 9.513 532667.0 000001 6 2015-12-31 9.632 9.545 9.656 9.537 491258.0 000001 7 2016-01-04 9.553 8.995 9.577 8.940 563497.0 000001 8 2016-01-05 8.972 9.075 9.210 8.876 663269.0 000001 9 2016-01-06 9.091 9.179 9.202 9.067 515706.0 000001
查看某一行或多行,某一列或多列
# 查看第一行 df[0:1] date open close high low volume code 0 2015-12-23 9.927 9.935 10.174 9.871 1039018.0 000001 # 查看 10到20行 df[10:21] date open close high low volume code 10 2016-01-07 9.083 8.709 9.083 8.685 174761.0 000001 11 2016-01-08 8.924 8.852 8.987 8.677 747527.0 000001 12 2016-01-11 8.757 8.566 8.820 8.502 732013.0 000001 13 2016-01-12 8.621 8.605 8.685 8.470 561642.0 000001 14 2016-01-13 8.669 8.526 8.709 8.518 391709.0 000001 15 2016-01-14 8.430 8.574 8.597 8.343 666314.0 000001 16 2016-01-15 8.486 8.327 8.597 8.295 448202.0 000001 17 2016-01-18 8.231 8.287 8.406 8.199 421040.0 000001 18 2016-01-19 8.319 8.526 8.582 8.287 501109.0 000001 19 2016-01-20 8.518 8.390 8.597 8.311 603752.0 000001 20 2016-01-21 8.343 8.215 8.558 8.215 606145.0 000001 # 查看看Date列前5個數據 df["date"].head() # 或者df.date.head() 0 2015-12-23 1 2015-12-24 2 2015-12-25 3 2015-12-28 4 2015-12-29 Name: date, dtype: object # 查看看Date列,code列, open列前5個數據 df[["date","code", "open"]].head() date code open 0 2015-12-23 000001 9.927 1 2015-12-24 000001 9.919 2 2015-12-25 000001 9.855 3 2015-12-28 000001 9.895 4 2015-12-29 000001 9.545
使用行列組合條件查詢
# 查看date, code列的第10行 df.loc[10, ["date", "code"]] date 2016-01-07 code 000001 Name: 10, dtype: object # 查看date, code列的第10行到20行 df.loc[10:20, ["date", "code"]] date code 10 2016-01-07 000001 11 2016-01-08 000001 12 2016-01-11 000001 13 2016-01-12 000001 14 2016-01-13 000001 15 2016-01-14 000001 16 2016-01-15 000001 17 2016-01-18 000001 18 2016-01-19 000001 19 2016-01-20 000001 20 2016-01-21 000001 # 查看第一行,open列的數據 df.loc[0, "open"] 9.9269999999999996
經過==位置==查詢
值得注意的是上面的索引值就是特定的位置
# 查看第1行() df.iloc[0] date 2015-12-24 open 9.919 close 9.823 high 9.998 low 9.744 volume 640229 code 000001 Name: 0, dtype: object # 查看最後一行 df.iloc[-1] date 2018-08-08 open 9.16 close 9.12 high 9.16 low 9.1 volume 29985 code 000001 Name: 640, dtype: object # 查看第一列,前5個數值 df.iloc[:,0].head() 0 2015-12-24 1 2015-12-25 2 2015-12-28 3 2015-12-29 4 2015-12-30 Name: date, dtype: object # 查看前2到4行,第1,3列 df.iloc[2:4,[0,2]] date close 2 2015-12-28 9.537 3 2015-12-29 9.624
經過條件篩選
查看open列大於10的前5行 df[df.open > 10].head() date open close high low volume code 378 2017-07-14 10.483 10.570 10.609 10.337 1722570.0 000001 379 2017-07-17 10.619 10.483 10.987 10.396 3273123.0 000001 380 2017-07-18 10.425 10.716 10.803 10.299 2349431.0 000001 381 2017-07-19 10.657 10.754 10.851 10.551 1933075.0 000001 382 2017-07-20 10.745 10.638 10.880 10.580 1537338.0 000001 # 查看open列大於10且open列小於10.6的前五行 df[(df.open > 10) & (df.open < 10.6)].head() date open close high low volume code 378 2017-07-14 10.483 10.570 10.609 10.337 1722570.0 000001 380 2017-07-18 10.425 10.716 10.803 10.299 2349431.0 000001 387 2017-07-27 10.550 10.422 10.599 10.363 1194490.0 000001 388 2017-07-28 10.441 10.569 10.638 10.412 819195.0 000001 390 2017-08-01 10.471 10.865 10.904 10.432 2035709.0 000001 # 查看open列大於10或open列小於10.6的前五行 df[(df.open > 10) | (df.open < 10.6)].head() date open close high low volume code 0 2015-12-24 9.919 9.823 9.998 9.744 640229.0 000001 1 2015-12-25 9.855 9.879 9.927 9.815 399845.0 000001 2 2015-12-28 9.895 9.537 9.919 9.537 822408.0 000001 3 2015-12-29 9.545 9.624 9.632 9.529 619802.0 000001 4 2015-12-30 9.624 9.632 9.640 9.513 532667.0 000001
在前面已經簡單的說明Series, DataFrame的建立,這裏說一些經常使用有用的建立方式
# 建立2018-08-08到2018-08-15的時間序列,默認時間間隔爲Day s2 = pd.date_range("20180808", periods=7) print(s2) DatetimeIndex(['2018-08-08', '2018-08-09', '2018-08-10', '2018-08-11', '2018-08-12', '2018-08-13', '2018-08-14'], dtype='datetime64[ns]', freq='D') # 指定2018-08-08 00:00 到2018-08-09 00:00 時間間隔爲小時 # freq參數可以使用參數, 參考: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases s3 = pd.date_range("20180808", "20180809", freq="H") print(s2) DatetimeIndex(['2018-08-08 00:00:00', '2018-08-08 01:00:00', '2018-08-08 02:00:00', '2018-08-08 03:00:00', '2018-08-08 04:00:00', '2018-08-08 05:00:00', '2018-08-08 06:00:00', '2018-08-08 07:00:00', '2018-08-08 08:00:00', '2018-08-08 09:00:00', '2018-08-08 10:00:00', '2018-08-08 11:00:00', '2018-08-08 12:00:00', '2018-08-08 13:00:00', '2018-08-08 14:00:00', '2018-08-08 15:00:00', '2018-08-08 16:00:00', '2018-08-08 17:00:00', '2018-08-08 18:00:00', '2018-08-08 19:00:00', '2018-08-08 20:00:00', '2018-08-08 21:00:00', '2018-08-08 22:00:00', '2018-08-08 23:00:00', '2018-08-09 00:00:00'], dtype='datetime64[ns]', freq='H') # 經過已有序列建立時間序列 s4 = pd.to_datetime(df.date.head()) print(s4) 0 2015-12-24 1 2015-12-25 2 2015-12-28 3 2015-12-29 4 2015-12-30 Name: date, dtype: datetime64[ns]
# 將df 的索引修改成date列的數據,而且將類型轉換爲datetime類型 df.index = pd.to_datetime(df.date) df.head() date open close high low volume code date 2015-12-24 2015-12-24 9.919 9.823 9.998 9.744 640229.0 000001 2015-12-25 2015-12-25 9.855 9.879 9.927 9.815 399845.0 000001 2015-12-28 2015-12-28 9.895 9.537 9.919 9.537 822408.0 000001 2015-12-29 2015-12-29 9.545 9.624 9.632 9.529 619802.0 000001 2015-12-30 2015-12-30 9.624 9.632 9.640 9.513 532667.0 000001 # 修改列的字段 df.columns = ["Date", "Open","Close","High","Low","Volume","Code"] print(df.head()) Date Open Close High Low Volume Code date 2015-12-24 2015-12-24 9.919 9.823 9.998 9.744 640229.0 000001 2015-12-25 2015-12-25 9.855 9.879 9.927 9.815 399845.0 000001 2015-12-28 2015-12-28 9.895 9.537 9.919 9.537 822408.0 000001 2015-12-29 2015-12-29 9.545 9.624 9.632 9.529 619802.0 000001 2015-12-30 2015-12-30 9.624 9.632 9.640 9.513 532667.0 000001 # 將Open列每一個數值加1, apply方法並不直接修改源數據,因此須要將新值複製給df df.Open = df.Open.apply(lambda x: x+1) df.head() Date Open Close High Low Volume Code date 2015-12-24 2015-12-24 10.919 9.823 9.998 9.744 640229.0 000001 2015-12-25 2015-12-25 10.855 9.879 9.927 9.815 399845.0 000001 2015-12-28 2015-12-28 10.895 9.537 9.919 9.537 822408.0 000001 2015-12-29 2015-12-29 10.545 9.624 9.632 9.529 619802.0 000001 2015-12-30 2015-12-30 10.624 9.632 9.640 9.513 532667.0 000001 # 將Open,Close列都數值上加1,若是多列,apply接收的對象是整個列 df[["Open", "Close"]].head().apply(lambda x: x.apply(lambda x: x+1)) Open Close date 2015-12-24 11.919 10.823 2015-12-25 11.855 10.879 2015-12-28 11.895 10.537 2015-12-29 11.545 10.624 2015-12-30 11.624 10.632
經過drop方法drop指定的行或者列
注意: drop方法並不直接修改源數據,若是須要使源dataframe對象被修改,須要傳入inplace=True
經過以前的axis圖解,知道行的值(或者說label)在axis=0,列的值(或者說label)在axis=1
# 刪除指定列,刪除Open列 df.drop("Open", axis=1).head() #或者df.drop(df.columns[1]) Date Close High Low Volume Code date 2015-12-24 2015-12-24 9.823 9.998 9.744 640229.0 000001 2015-12-25 2015-12-25 9.879 9.927 9.815 399845.0 000001 2015-12-28 2015-12-28 9.537 9.919 9.537 822408.0 000001 2015-12-29 2015-12-29 9.624 9.632 9.529 619802.0 000001 2015-12-30 2015-12-30 9.632 9.640 9.513 532667.0 000001 # 刪除第1,3列. 即Open,High列 df.drop(df.columns[[1,3]], axis=1).head() # 或df.drop(["Open", "High], axis=1).head() Date Close Low Volume Code date 2015-12-24 2015-12-24 9.823 9.744 640229.0 000001 2015-12-25 2015-12-25 9.879 9.815 399845.0 000001 2015-12-28 2015-12-28 9.537 9.537 822408.0 000001 2015-12-29 2015-12-29 9.624 9.529 619802.0 000001 2015-12-30 2015-12-30 9.632 9.513 532667.0 000001
當數值很大的時候pandas默認會使用科學計數法
# float數據類型以{:.4f}格式顯示,即顯示完整數據且保留後四位 pd.options.display.float_format = '{:.4f}'.format
# descibe方法會計算每列數據對象是數值的count, mean, std, min, max, 以及必定比率的值 df.describe() Open Close High Low Volume count 641.0000 641.0000 641.0000 641.0000 641.0000 mean 10.7862 9.7927 9.8942 9.6863 833968.6162 std 1.5962 1.6021 1.6620 1.5424 607731.6934 min 8.6580 7.6100 7.7770 7.4990 153901.0000 25% 9.7080 8.7180 8.7760 8.6500 418387.0000 50% 10.0770 9.0960 9.1450 8.9990 627656.0000 75% 11.8550 10.8350 10.9920 10.7270 1039297.0000 max 15.9090 14.8600 14.9980 14.4470 4262825.0000 # 單獨統計Open列的平均值 df.Open.mean() 10.786248049922001 # 查看居於95%的值, 默認線性擬合 df.Open.quantile(0.95) 14.187 # 查看Open列每一個值出現的次數 df.Open.value_counts().head() 9.8050 12 9.8630 10 9.8440 10 9.8730 10 9.8830 8 Name: Open, dtype: int64
刪除或者填充缺失值
# 刪除含有NaN的任意行 df.dropna(how='any') # 刪除含有NaN的任意列 df.dropna(how='any', axis=1) # 將NaN的值改成5 df.fillna(value=5)
按行或者列排序, 默認也不修改源數據
# 按列排序 df.sort_index(axis=1).head() Close Code Date High Low Open Volume date 2015-12-24 9.8230 000001 2015-12-24 9.9980 9.7440 10.9190 640229.0000 2015-12-25 1.0000 000001 2015-12-25 1.0000 9.8150 10.8550 399845.0000 2015-12-28 1.0000 000001 2015-12-28 1.0000 9.5370 10.8950 822408.0000 2015-12-29 9.6240 000001 2015-12-29 9.6320 9.5290 10.5450 619802.0000 2015-12-30 9.6320 000001 2015-12-30 9.6400 9.5130 10.6240 532667.0000 # 按行排序,不遞增 df.sort_index(ascending=False).head() Date Open Close High Low Volume Code date 2018-08-08 2018-08-08 10.1600 9.1100 9.1600 9.0900 153901.0000 000001 2018-08-07 2018-08-07 9.9600 9.1700 9.1700 8.8800 690423.0000 000001 2018-08-06 2018-08-06 9.9400 8.9400 9.1100 8.8900 554010.0000 000001 2018-08-03 2018-08-03 9.9300 8.9100 9.1000 8.9100 476546.0000 000001 2018-08-02 2018-08-02 10.1300 8.9400 9.1500 8.8800 931401.0000 000001
安裝某一列的值排序
# 按照Open列的值從小到大排序 df.sort_values(by="Open") Date Open Close High Low Volume Code date 2016-03-01 2016-03-01 8.6580 7.7220 7.7770 7.6260 377910.0000 000001 2016-02-15 2016-02-15 8.6900 7.7930 7.8410 7.6820 278499.0000 000001 2016-01-29 2016-01-29 8.7540 7.9610 8.0240 7.7140 544435.0000 000001 2016-03-02 2016-03-02 8.7620 8.0400 8.0640 7.7380 676613.0000 000001 2016-02-26 2016-02-26 8.7770 7.7930 7.8250 7.6900 392154.0000 000001
concat, 按照行方向或者列方向合併
# 分別取0到2行,2到4行,4到9行組成一個列表,經過concat方法按照axis=0,行方向合併, axis參數不指定,默認爲0 split_rows = [df.iloc[0:2,:],df.iloc[2:4,:], df.iloc[4:9]] pd.concat(split_rows) Date Open Close High Low Volume Code date 2015-12-24 2015-12-24 10.9190 9.8230 9.9980 9.7440 640229.0000 000001 2015-12-25 2015-12-25 10.8550 1.0000 1.0000 9.8150 399845.0000 000001 2015-12-28 2015-12-28 10.8950 1.0000 1.0000 9.5370 822408.0000 000001 2015-12-29 2015-12-29 10.5450 9.6240 9.6320 9.5290 619802.0000 000001 2015-12-30 2015-12-30 10.6240 9.6320 9.6400 9.5130 532667.0000 000001 2015-12-31 2015-12-31 10.6320 9.5450 9.6560 9.5370 491258.0000 000001 2016-01-04 2016-01-04 10.5530 8.9950 9.5770 8.9400 563497.0000 000001 2016-01-05 2016-01-05 9.9720 9.0750 9.2100 8.8760 663269.0000 000001 2016-01-06 2016-01-06 10.0910 9.1790 9.2020 9.0670 515706.0000 000001 # 分別取2到3列,3到5列,5列及之後列數組成一個列表,經過concat方法按照axis=1,列方向合併 split_columns = [df.iloc[:,1:2], df.iloc[:,2:4], df.iloc[:,4:]] pd.concat(split_columns, axis=1).head() Open Close High Low Volume Code date 2015-12-24 10.9190 9.8230 9.9980 9.7440 640229.0000 000001 2015-12-25 10.8550 1.0000 1.0000 9.8150 399845.0000 000001 2015-12-28 10.8950 1.0000 1.0000 9.5370 822408.0000 000001 2015-12-29 10.5450 9.6240 9.6320 9.5290 619802.0000 000001 2015-12-30 10.6240 9.6320 9.6400 9.5130 532667.0000 000001
追加行, 相應的還有insert, 插入插入到指定位置
# 將第一行追加到最後一行 df.append(df.iloc[0,:], ignore_index=True).tail() Date Open Close High Low Volume Code 637 2018-08-03 9.9300 8.9100 9.1000 8.9100 476546.0000 000001 638 2018-08-06 9.9400 8.9400 9.1100 8.8900 554010.0000 000001 639 2018-08-07 9.9600 9.1700 9.1700 8.8800 690423.0000 000001 640 2018-08-08 10.1600 9.1100 9.1600 9.0900 153901.0000 000001 641 2015-12-24 10.9190 9.8230 9.9980 9.7440 640229.0000 000001
因爲dataframe是引用對象,因此須要顯示調用copy方法用以複製整個dataframe對象
pandas的繪圖是使用matplotlib,若是想要畫的更細緻, 可使用matplotplib,不過簡單的畫一些圖仍是不錯的
由於上圖太麻煩,這裏就不配圖了,能夠在資源文件裏面查看pandas-blog.ipynb文件或者本身敲一遍代碼。
# 這裏使用notbook,爲了直接在輸出中顯示,須要如下配置 %matplotlib inline # 繪製Open,Low,Close.High的線性圖 df[["Open", "Low", "High", "Close"]].plot() # 繪製面積圖 df[["Open", "Low", "High", "Close"]].plot(kind="area")
讀寫常見文件格式,如csv,excel,json等, 甚至是讀取==系統的剪切板==.這個功能有時候頗有用。直接將鼠標選中複製的內容讀取建立dataframe對象。
# 將df數據保存到當前工做目錄的stock.csv文件 df.to_csv("stock.csv") # 查看stock.csv文件前5行 with open("stock.csv") as rf: print(rf.readlines()[:5]) ['date,Date,Open,Close,High,Low,Volume,Code\n', '2015-12-24,2015-12-24,9.919,9.823,9.998,9.744,640229.0,000001\n', '2015-12-25,2015-12-25,9.855,9.879,9.927,9.815,399845.0,000001\n', '2015-12-28,2015-12-28,9.895,9.537,9.919,9.537,822408.0,000001\n', '2015-12-29,2015-12-29,9.545,9.624,9.632,9.529,619802.0,000001\n'] # 讀取stock.csv文件並將第一行做爲index df2 = pd.read_csv("stock.csv", index_col=0) df2.head() Date Open Close High Low Volume Code date 2015-12-24 2015-12-24 9.9190 9.8230 9.9980 9.7440 640229.0000 1 2015-12-25 2015-12-25 9.8550 9.8790 9.9270 9.8150 399845.0000 1 2015-12-28 2015-12-28 9.8950 9.5370 9.9190 9.5370 822408.0000 1 2015-12-29 2015-12-29 9.5450 9.6240 9.6320 9.5290 619802.0000 1 2015-12-30 2015-12-30 9.6240 9.6320 9.6400 9.5130 532667.0000 1 # 讀取stock.csv文件並將第一行做爲index,而且將000001做爲str類型讀取, 否則會被解析成整數 df2 = pd.read_csv("stock.csv", index_col=0, dtype={"Code": str}) df2.head()
這裏以處理web日誌爲例,也許不太實用 ,由於ELK處理這些綽綽有餘,不過喜歡什麼本身來也何嘗不可
日誌文件: https://raw.githubusercontent.com/Apache-Labor/labor/master/labor-04/labor-04-example-access.log
# 日誌格式 # 字段說明, 參考:https://ru.wikipedia.org/wiki/Access.log %h%l%u%t \「%r \」%> s%b \「%{Referer} i \」\「%{User-Agent} i \」 # 具體示例 75.249.65.145 US - [2015-09-02 10:42:51.003372] "GET /cms/tina-access-editor-for-download/ HTTP/1.1" 200 7113 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.example.com 124.165.3.7 443 redirect-handler - + "-" Vea2i8CoAwcAADevXAgAAAAB TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 701 12118 -% 88871 803 0 0 0 0
解析日誌文件
HOST = r'^(?P<host>.*?)' SPACE = r'\s' IDENTITY = r'\S+' USER = r"\S+" TIME = r'\[(?P<time>.*?)\]' # REQUEST = r'\"(?P<request>.*?)\"' REQUEST = r'\"(?P<method>.+?)\s(?P<path>.+?)\s(?P<http_protocol>.*?)\"' STATUS = r'(?P<status>\d{3})' SIZE = r'(?P<size>\S+)' REFER = r"\S+" USER_AGENT = r'\"(?P<user_agent>.*?)\"' REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE+IDENTITY+USER_AGENT+SPACE line = '79.81.243.171 - - [30/Mar/2009:20:58:31 +0200] "GET /exemples.php HTTP/1.1" 200 11481 "http://www.facades.fr/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)" "-"' reg = re.compile(REGEX) reg.match(line).groups()
將數據注入DataFrame對象
COLUMNS = ["Host", "Time", "Method", "Path", "Protocol", "status", "size", "User_Agent"] field_lis = [] with open("access.log") as rf: for line in rf: # 因爲一些記錄不能匹配,因此須要捕獲異常, 不能捕獲的數據格式以下 # 80.32.156.105 - - [27/Mar/2009:13:39:51 +0100] "GET HTTP/1.1" 400 - "-" "-" "-" # 因爲重點不在寫正則表達式這裏就略過了 try: fields = reg.match(line).groups() except Exception as e: #print(e) #print(line) pass field_lis.append(fields) log_df = pd.DataFrame(field_lis) # 修改列名 log_df.columns = COLUMNS def parse_time(value): try: return pd.to_datetime(value) except Exception as e: print(e) print(value) # 將Time列的值修改爲pandas可解析的時間格式 log_df.Time = log_df.Time.apply(lambda x: x.replace(":", " ", 1)) log_df.Time = log_df.Time.apply(parse_time) # 修改index, 將Time列做爲index,並drop掉在Time列 log_df.index = pd.to_datetime(log_df.Time) log_df.drop("Time", inplace=True) log_df.head() Host Time Method Path Protocol status size User_Agent Time 2009-03-22 06:00:32 88.191.254.20 2009-03-22 06:00:32 GET / HTTP/1.0 200 8674 "- 2009-03-22 06:06:20 66.249.66.231 2009-03-22 06:06:20 GET /popup.php?choix=-89 HTTP/1.1 200 1870 "Mozilla/5.0 (compatible; Googlebot/2.1; +htt... 2009-03-22 06:11:20 66.249.66.231 2009-03-22 06:11:20 GET /specialiste.php HTTP/1.1 200 10743 "Mozilla/5.0 (compatible; Googlebot/2.1; +htt... 2009-03-22 06:40:06 83.198.250.175 2009-03-22 06:40:06 GET / HTTP/1.1 200 8714 "Mozilla/4.0 (compatible; MSIE 7.0; Windows N... 2009-03-22 06:40:06 83.198.250.175 2009-03-22 06:40:06 GET /style.css HTTP/1.1 200 1692 "Mozilla/4.0 (compatible; MSIE 7.0; Windows N...
查看數據類型
# 查看數據類型 log_df.dtypes Host object Time datetime64[ns] Method object Path object Protocol object status object size object User_Agent object dtype: object
由上可知, 除了Time字段是時間類型,其餘都是object,可是Size, Status應該爲數字
def parse_number(value): try: return pd.to_numeric(value) except Exception as e: pass return 0 # 將Size,Status字段值改成數值類型 log_df[["Status","Size"]] = log_df[["Status","Size"]].apply(lambda x: x.apply(parse_number)) log_df.dtypes Host object Time datetime64[ns] Method object Path object Protocol object Status int64 Size int64 User_Agent object dtype: object
統計status數據
# 統計不一樣status值的次數 log_df.Status.value_counts() 200 5737 304 1540 404 1186 400 251 302 37 403 3 206 2 Name: Status, dtype: int64
繪製pie圖
log_df.Status.value_counts().plot(kind="pie", figsize=(10,8))
查看日誌文件時間跨度
log_df.index.max() - log_df.index.min() Timedelta('15 days 11:12:03')
分別查看起始,終止時間
print(log_df.index.max()) print(log_df.index.min()) 2009-04-06 17:12:35 2009-03-22 06:00:32
按照此方法還能夠統計Method, User_Agent字段 ,不過User_Agent還須要額外清洗如下數據
統計top 10 IP地址
91.121.31.184 745 88.191.254.20 441 41.224.252.122 420 194.2.62.185 255 86.75.35.144 184 208.89.192.106 170 79.82.3.8 161 90.3.72.207 157 62.147.243.132 150 81.249.221.143 141 Name: Host, dtype: int64
繪製請求走勢圖
log_df2 = log_df.copy() # 爲每行加一個request字段,值爲1 log_df2["Request"] = 1 # 每一小時統計一次request數量,並將NaN值替代爲0,最後繪製線性圖,尺寸爲16x9 log_df2.Request.resample("H").sum().fillna(0).plot(kind="line",figsize=(16,10))
分別繪圖
分別對202,304,404狀態從新取樣,並放在一個列表裏面 req_df_lis = [ log_df2[log_df2.Status == 200].Request.resample("H").sum().fillna(0), log_df2[log_df2.Status == 304].Request.resample("H").sum().fillna(0), log_df2[log_df2.Status == 404].Request.resample("H").sum().fillna(0) ] # 將三個dataframe組合起來 req_df = pd.concat(req_df_lis,axis=1) req_df.columns = ["200", "304", "404"] # 繪圖 req_df.plot(figsize=(16,10))
https://pandas.pydata.org/pandas-docs/stable/index.html