在數據挖掘的過程當中,咱們可能會常常遇到一些偏離於預測趨勢以外的數據,一般咱們稱之爲異常值。python
一般將這樣的一些數據的出現歸爲偏差。有不少狀況會出現偏差,具體的狀況須要就對待:算法
傳感器故障 -> 忽略dom
數據輸入錯誤 -> 忽略ide
反常事件 -> 重視函數
一、訓練數據this
二、異常值檢測,找出訓練集中訪問最多的點,去除這些點(通常約10%的異常數據)spa
三、再訓練code
須要屢次重複二、3步驟orm
例:對數據第一次使用迴歸後的擬合對象
偏差點的出現使擬合線相對偏離,將偏差點去除後進行一次迴歸:
去除偏差點後的迴歸線很好的對數據進行了擬合
環境:MacOS mojave 10.14.3
Python 3.7.0
使用庫:scikit-learn 0.19.2
原始數據集:
對原始數據進行一次迴歸:
刪除10%的異常值後進行一次迴歸:
outlier_removal_regression.py 主程序
#!/usr/bin/python import random import numpy import matplotlib.pyplot as plt import pickle from outlier_cleaner import outlierCleaner class StrToBytes: def __init__(self, fileobj): self.fileobj = fileobj def read(self, size): return self.fileobj.read(size).encode() def readline(self, size=-1): return self.fileobj.readline(size).encode() ### load up some practice data with outliers in it ages = pickle.load(StrToBytes(open("practice_outliers_ages.pkl", "r") ) ) net_worths = pickle.load(StrToBytes(open("practice_outliers_net_worths.pkl", "r") ) ) ### ages and net_worths need to be reshaped into 2D numpy arrays ### second argument of reshape command is a tuple of integers: (n_rows, n_columns) ### by convention, n_rows is the number of data points ### and n_columns is the number of features ages = numpy.reshape( numpy.array(ages), (len(ages), 1)) net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1)) from sklearn.cross_validation import train_test_split ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42) ### fill in a regression here! Name the regression object reg so that ### the plotting code below works, and you can see what your regression looks like from sklearn import linear_model reg = linear_model.LinearRegression() reg.fit(ages_train,net_worths_train) print (reg.coef_) print (reg.intercept_) print (reg.score(ages_test,net_worths_test) ) try: plt.plot(ages, reg.predict(ages), color="blue") except NameError: pass plt.scatter(ages, net_worths) plt.show() ### identify and remove the most outlier-y points cleaned_data = [] try: predictions = reg.predict(ages_train) cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train ) except NameError: print ("your regression object doesn't exist, or isn't name reg") print ("can't make predictions to use in identifying outliers") ### only run this code if cleaned_data is returning data if len(cleaned_data) > 0: ages, net_worths, errors = zip(*cleaned_data) ages = numpy.reshape( numpy.array(ages), (len(ages), 1)) net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1)) ### refit your cleaned data! try: reg.fit(ages, net_worths) plt.plot(ages, reg.predict(ages), color="blue") print (reg.coef_) print (reg.intercept_) print (reg.score(ages_test,net_worths_test) ) except NameError: print ("you don't seem to have regression imported/created,") print (" or else your regression object isn't named reg") print (" either way, only draw the scatter plot of the cleaned data") plt.scatter(ages, net_worths) plt.xlabel("ages") plt.ylabel("net worths") plt.show() else: print ("outlierCleaner() is returning an empty list, no refitting to be done")
outlier_cleaner.py 清除10%的異常值
import numpy as np import math def outlierCleaner(predictions, ages, net_worths): """ Clean away the 10% of points that have the largest residual errors (difference between the prediction and the actual net worth). Return a list of tuples named cleaned_data where each tuple is of the form (age, net_worth, error). """ cleaned_data = [] ages = ages.reshape((1,len(ages)))[0] net_worths = net_worths.reshape((1,len(ages)))[0] predictions = predictions.reshape((1,len(ages)))[0] # zip() 函數用於將可迭代的對象做爲參數,將對象中對應的元素打包成一個個元組,而後返回由這些元組組成的列表。 cleaned_data = zip(ages,net_worths,abs(net_worths-predictions)) #按照error大小排序 cleaned_data = sorted(cleaned_data , key=lambda x: (x[2])) #ceil() 函數返回數字的上入整數,計算要刪除的元素個數 cleaned_num = int(-1 * math.ceil(len(cleaned_data)* 0.1)) #切片 cleaned_data = cleaned_data[:cleaned_num] return cleaned_data
同時獲得這兩次迴歸的擬合優度:
第一次:0.8782624703664675
第二次:0.983189455395532
可見,去除異常值對於預測數據具備重要做用