第七課: - 計算異常值的方法

第 7 課

 異常值

Pandas提供了大量與數據探索相關的函數。這些統計特徵函數能反映出數據的總體分佈,主要做爲Pandas的對象DataFrame或Series的方法出現。 
sum():計算數據樣本的總和(按列計算) html

mean():計算數據樣本的算術平均數 python

var():計算數據樣本的方差 app

std():計算數據樣本的標準差 函數

corr():計算數據樣本的Spearman(Pearson)相關係數矩陣  spa

cov():計算數據樣本的協方差矩陣 .net

skew():樣本值的偏度(三階矩) code

kurt():樣本值的峯度(四階矩) 
describe():給出樣本的基本描述(基本統計量如均值、標準差等)orm

In [1]:

import pandas as pd import sys 
In [2]:
print('Python version ' + sys.version) print('Pandas version ' + pd.__version__) 
Python version 3.5.1 |Anaconda custom (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version 0.20.1
In [3]:
# 建立一個以日期爲索引的數據幀
States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL']
data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]
idx = pd.date_range('1/1/2012', periods=10, freq='MS')
df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
df1['State'] = States
#建立第二個數據幀
data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
idx2 = pd.date_range('1/1/2013', periods=10, freq='MS')
df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue']) df2['State'] = States

請參考pandas中時間序列——date_range函數
In [4]:
# 合併數據幀
df = pd.concat([df1,df2]) df 
Out[4]:
  Revenue State
2012-01-01 1.0 NY
2012-02-01 2.0 NY
2012-03-01 3.0 NY
2012-04-01 4.0 NY
2012-05-01 5.0 FL
2012-06-01 6.0 FL
2012-07-01 7.0 GA
2012-08-01 8.0 GA
2012-09-01 9.0 FL
2012-10-01 10.0 FL
2013-01-01 10.0 NY
2013-02-01 10.0 NY
2013-03-01 9.0 NY
2013-04-01 9.0 NY
2013-05-01 8.0 FL
2013-06-01 8.0 FL
2013-07-01 7.0 GA
2013-08-01 7.0 GA
2013-09-01 6.0 FL
2013-10-01 6.0 FL
 

計算異常值的方法

注意:平均誤差和標準誤差僅適用於高斯分佈。htm

In [5]:對象

# 方法 1
# 建立df的一個拷貝 newdf = df.copy() newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) newdf['1.96*std'] = 1.96*newdf['Revenue'].std() newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std() newdf

Out[5]:
  Revenue State x-Mean 1.96*std Outlier
2012-01-01 1.0 NY 5.75 5.200273 True
2012-02-01 2.0 NY 4.75 5.200273 False
2012-03-01 3.0 NY 3.75 5.200273 False
2012-04-01 4.0 NY 2.75 5.200273 False
2012-05-01 5.0 FL 1.75 5.200273 False
2012-06-01 6.0 FL 0.75 5.200273 False
2012-07-01 7.0 GA 0.25 5.200273 False
2012-08-01 8.0 GA 1.25 5.200273 False
2012-09-01 9.0 FL 2.25 5.200273 False
2012-10-01 10.0 FL 3.25 5.200273 False
2013-01-01 10.0 NY 3.25 5.200273 False
2013-02-01 10.0 NY 3.25 5.200273 False
2013-03-01 9.0 NY 2.25 5.200273 False
2013-04-01 9.0 NY 2.25 5.200273 False
2013-05-01 8.0 FL 1.25 5.200273 False
2013-06-01 8.0 FL 1.25 5.200273 False
2013-07-01 7.0 GA 0.25 5.200273 False
2013-08-01 7.0 GA 0.25 5.200273 False
2013-09-01 6.0 FL 0.75 5.200273 False
2013-10-01 6.0 FL 0.75 5.200273 False
In [6]:
# 方法 2
# 按項分組 #建立df的一個拷貝
newdf = df.copy() State = newdf.groupby('State') newdf['Outlier'] = State.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() ) newdf['x-Mean'] = State.transform( lambda x: abs(x-x.mean()) ) newdf['1.96*std'] = State.transform( lambda x: 1.96*x.std() ) newdf 
Out[6]:
  Revenue State Outlier x-Mean 1.96*std
2012-01-01 1.0 NY False 5.00 7.554813
2012-02-01 2.0 NY False 4.00 7.554813
2012-03-01 3.0 NY False 3.00 7.554813
2012-04-01 4.0 NY False 2.00 7.554813
2012-05-01 5.0 FL False 2.25 3.434996
2012-06-01 6.0 FL False 1.25 3.434996
2012-07-01 7.0 GA False 0.25 0.980000
2012-08-01 8.0 GA False 0.75 0.980000
2012-09-01 9.0 FL False 1.75 3.434996
2012-10-01 10.0 FL False 2.75 3.434996
2013-01-01 10.0 NY False 4.00 7.554813
2013-02-01 10.0 NY False 4.00 7.554813
2013-03-01 9.0 NY False 3.00 7.554813
2013-04-01 9.0 NY False 3.00 7.554813
2013-05-01 8.0 FL False 0.75 3.434996
2013-06-01 8.0 FL False 0.75 3.434996
2013-07-01 7.0 GA False 0.25 0.980000
2013-08-01 7.0 GA False 0.25 0.980000
2013-09-01 6.0 FL False 1.25 3.434996
2013-10-01 6.0 FL False 1.25 3.434996
In [7]:
# Method 2
# Group by multiple items # make a copy of original df newdf = df.copy() StateMonth = newdf.groupby(['State', lambda x: x.month]) newdf['Outlier'] = StateMonth.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() ) newdf['x-Mean'] = StateMonth.transform( lambda x: abs(x-x.mean()) ) newdf['1.96*std'] = StateMonth.transform( lambda x: 1.96*x.std() ) newdf 
Out[7]:
  Revenue State Outlier x-Mean 1.96*std
2012-01-01 1.0 NY False 4.5 12.473364
2012-02-01 2.0 NY False 4.0 11.087434
2012-03-01 3.0 NY False 3.0 8.315576
2012-04-01 4.0 NY False 2.5 6.929646
2012-05-01 5.0 FL False 1.5 4.157788
2012-06-01 6.0 FL False 1.0 2.771859
2012-07-01 7.0 GA False 0.0 0.000000
2012-08-01 8.0 GA False 0.5 1.385929
2012-09-01 9.0 FL False 1.5 4.157788
2012-10-01 10.0 FL False 2.0 5.543717
2013-01-01 10.0 NY False 4.5 12.473364
2013-02-01 10.0 NY False 4.0 11.087434
2013-03-01 9.0 NY False 3.0 8.315576
2013-04-01 9.0 NY False 2.5 6.929646
2013-05-01 8.0 FL False 1.5 4.157788
2013-06-01 8.0 FL False 1.0 2.771859
2013-07-01 7.0 GA False 0.0 0.000000
2013-08-01 7.0 GA False 0.5 1.385929
2013-09-01 6.0 FL False 1.5 4.157788
2013-10-01 6.0 FL False 2.0 5.543717
In [8]:
# Method 3
# Group by item # make a copy of original df newdf = df.copy() State = newdf.groupby('State') def s(group): group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean()) group['1.96*std'] = 1.96*group['Revenue'].std() group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std() return group Newdf2 = State.apply(s) Newdf2 
Out[8]:
  Revenue State x-Mean 1.96*std Outlier
2012-01-01 1.0 NY 5.00 7.554813 False
2012-02-01 2.0 NY 4.00 7.554813 False
2012-03-01 3.0 NY 3.00 7.554813 False
2012-04-01 4.0 NY 2.00 7.554813 False
2012-05-01 5.0 FL 2.25 3.434996 False
2012-06-01 6.0 FL 1.25 3.434996 False
2012-07-01 7.0 GA 0.25 0.980000 False
2012-08-01 8.0 GA 0.75 0.980000 False
2012-09-01 9.0 FL 1.75 3.434996 False
2012-10-01 10.0 FL 2.75 3.434996 False
2013-01-01 10.0 NY 4.00 7.554813 False
2013-02-01 10.0 NY 4.00 7.554813 False
2013-03-01 9.0 NY 3.00 7.554813 False
2013-04-01 9.0 NY 3.00 7.554813 False
2013-05-01 8.0 FL 0.75 3.434996 False
2013-06-01 8.0 FL 0.75 3.434996 False
2013-07-01 7.0 GA 0.25 0.980000 False
2013-08-01 7.0 GA 0.25 0.980000 False
2013-09-01 6.0 FL 1.25 3.434996 False
2013-10-01 6.0 FL 1.25 3.434996 False
In [9]:
# Method 3
# Group by multiple items # make a copy of original df newdf = df.copy() StateMonth = newdf.groupby(['State', lambda x: x.month]) def s(group): group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean()) group['1.96*std'] = 1.96*group['Revenue'].std() group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std() return group Newdf2 = StateMonth.apply(s) Newdf2 
Out[9]:
  Revenue State x-Mean 1.96*std Outlier
2012-01-01 1.0 NY 4.5 12.473364 False
2012-02-01 2.0 NY 4.0 11.087434 False
2012-03-01 3.0 NY 3.0 8.315576 False
2012-04-01 4.0 NY 2.5 6.929646 False
2012-05-01 5.0 FL 1.5 4.157788 False
2012-06-01 6.0 FL 1.0 2.771859 False
2012-07-01 7.0 GA 0.0 0.000000 False
2012-08-01 8.0 GA 0.5 1.385929 False
2012-09-01 9.0 FL 1.5 4.157788 False
2012-10-01 10.0 FL 2.0 5.543717 False
2013-01-01 10.0 NY 4.5 12.473364 False
2013-02-01 10.0 NY 4.0 11.087434 False
2013-03-01 9.0 NY 3.0 8.315576 False
2013-04-01 9.0 NY 2.5 6.929646 False
2013-05-01 8.0 FL 1.5 4.157788 False
2013-06-01 8.0 FL 1.0 2.771859 False
2013-07-01 7.0 GA 0.0 0.000000 False
2013-08-01 7.0 GA 0.5 1.385929 False
2013-09-01 6.0 FL 1.5 4.157788 False
2013-10-01 6.0 FL 2.0 5.543717 False
 

假設一個非高斯分佈(若是你繪製它,它看起來不像正態分佈)

In [10]:
# make a copy of original df
newdf = df.copy() State = newdf.groupby('State') newdf['Lower'] = State['Revenue'].transform( lambda x: x.quantile(q=.25) - (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) ) newdf['Upper'] = State['Revenue'].transform( lambda x: x.quantile(q=.75) + (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) ) newdf['Outlier'] = (newdf['Revenue'] < newdf['Lower']) | (newdf['Revenue'] > newdf['Upper']) newdf 
Out[10]:
  Revenue State Lower Upper Outlier
2012-01-01 1.0 NY -7.000 19.000 False
2012-02-01 2.0 NY -7.000 19.000 False
2012-03-01 3.0 NY -7.000 19.000 False
2012-04-01 4.0 NY -7.000 19.000 False
2012-05-01 5.0 FL 2.625 11.625 False
2012-06-01 6.0 FL 2.625 11.625 False
2012-07-01 7.0 GA 6.625 7.625 False
2012-08-01 8.0 GA 6.625 7.625 True
2012-09-01 9.0 FL 2.625 11.625 False
2012-10-01 10.0 FL 2.625 11.625 False
2013-01-01 10.0 NY -7.000 19.000 False
2013-02-01 10.0 NY -7.000 19.000 False
2013-03-01 9.0 NY -7.000 19.000 False
2013-04-01 9.0 NY -7.000 19.000 False
2013-05-01 8.0 FL 2.625 11.625 False
2013-06-01 8.0 FL 2.625 11.625 False
2013-07-01 7.0 GA 6.625 7.625 False
2013-08-01 7.0 GA 6.625 7.625 False
2013-09-01 6.0 FL 2.625 11.625 False
2013-10-01 6.0 FL 2.625 11.625 False
 

This tutorial wasrewrited by CDS

相關文章
相關標籤/搜索