Python機器學習筆記：異常點檢測算法——LOF（Local Outiler Factor）

時間 2020-12-03

標籤 html git github 算法 app dom 機器學習 ide 學習 this 欄目 Python 简体版

原文原文鏈接

完整代碼及其數據，請移步小編的GitHub

　　傳送門：請點擊我html

　　若是點擊有誤：https://github.com/LeBron-Jian/MachineLearningNotegit

　　在數據挖掘方面，常常須要在作特徵工程和模型訓練以前對數據進行清洗，剔除無效數據和異常數據。異常檢測也是數據挖掘的一個方向，用於反做弊，僞基站，金融欺詐等領域。github

　　在以前已經學習了異常檢測算法One Class SVM和 isolation Forest算法，博文以下：算法

Python機器學習筆記：異常點檢測算法——One Class SVMapp

Python機器學習筆記：異常點檢測算法——Isolation Forestdom

　　下面學習一個新的異常檢測算法：Local Outlier Factor機器學習

前言：異常檢測算法

　　異常檢測方法，針對不一樣的數據形式，有不一樣的實現方法。經常使用的有基於分佈的方法，在上下 α 分位點以外的值認爲是異常值（例以下圖），對於屬性值經常使用此類方法。基於距離的方法，適用於二維或高維座標體系內異常點的判別。例如二維平面座標或經緯度空間座標下異常點識別，可用此類方法。ide

　　下面要學習一種基於距離的異常檢測算法，局部異常因子 LOF算法（Local Outlier Factor）。此算法能夠在中等高維數據集上執行異常值檢測。學習

　　Local Outlier Factor（LOF）是基於密度的經典算法（Breuning et,al 2000），文章發表與SIGMOD 2000 ，到目前已經有 3000+引用。在LOF以前的異常檢測算法大多數是基於統計方法的，或者是借用了一些聚類算法用於異常點的識別（好比：DBSCAN，OPTICS），可是基於統計的異常檢測算法一般須要假設數據服從特定的機率分佈，這個假設每每是不成立的。而聚類的方法一般只能給出0/1的判斷（即：是否是異常點），不能量化每一個數據點的異常程度。相比較而言，基於密度的LOF算法要更簡單，直觀。它不須要對數據的分佈作太多要求，還能量化每一個數據點的異常程度（outlierness）。this

　　在學習LOF以前，可能須要瞭解一下KMeans算法，這裏附上博文：

Python機器學習筆記：K-Means算法，DBSCAN算法

1，LOF（Local Outlier Factor）算法理論

　　（此處地址：https://blog.csdn.net/wangyibo0201/article/details/51705966/）

1.1 LOF算法介紹

　　LOF是基於密度的算法，其最核心的部分是關於數據點密度的刻畫。若是對 distanced-based 或者 density-based 的聚類算法有些印象，你會發現 LOF中用來定義密度的一些概念和K-Means算法一些概念很類似。

　　首先用視覺直觀的感覺一下，以下圖，對於C1集合的點，總體間距，密度，分散狀況較爲均勻一致。能夠認爲是同一簇；對於C2集合點，一樣可認爲是一簇。o1, o2點相對孤立，能夠認爲是異常點或離散點。如今的問題是，如何實現算法的通用性，能夠知足C1 和 C2 這種密度分散狀況迥異的集合的異常點識別。LOF能夠實現咱們的目標，LOF不會由於數據密度分散狀況不一樣而錯誤的將正確點斷定爲異常點。

1.2 LOF 算法步驟

　　下面介紹 LOF算法的相關定義：

（1） d(p, o) ：兩點 P 和 O 之間的距離

（2） K-distance：第 k 距離

　　在距離數據點 p 最近的幾個點中，第 k 個最近的點跟點 p 之間的距離稱爲點 p的K-鄰近距離，記爲 K-distance(p)。

　　對於點 p 的第 k 距離 dk(p) 定義以下：

　　　　dk(p) = d(p, o) 而且知足：

　　　　（a）在集合中至少有不包括 p 在內的 k 個點 o $k$ $k$

$k$

　　p 的第 k 距離，也就是距離 p 第 k 遠的點的距離，不包括 P，以下圖所示：

（3） k-distance neighborhood of p：第 k 距離鄰域

　　點 p 的第 k 距離鄰域 Nk(p) 就是 p 的第 k距離即之內的全部點，包括第 k 距離。

　　所以 p 的第 k 鄰域點的個數 |Nk(p)| >=k

（4） reach-distance：可達距離

　　可達距離（Reachablity distance）：可達距離的定義跟K-鄰近距離是相關的，給定參數k時，數據點 p 到數據點o的可達距離 reach-dist(p, o)爲數據點 o 的 K-鄰近距離和數據點 p與點o 之間的直接距離的最大值。

　　點 o 到點 p 的第 k 可達距離定義爲：

　　也就是，點 o 到點 p 的第 k 可達距離，至少是 o 的第 k 距離，或者爲 o, p之間的真實距離。這也意味着，離點 o 最近的 k 個點， o 到他們的可達距離被認爲是相等，且都等於 dk(o)。以下圖所示， o1 到 p 的第 5 可達距離爲 d(p, o1)，o2 到 p 的第5可達距離爲 d5(o2)

（5） local reachablity density：局部可達密度

　　局部可達密度（local reachablity density）：局部可達密度的定義是基於可達距離的，對於數據點 p，那些跟點 p的距離小於等於 K-distance(p) 的數據點稱爲它的 K-nearest-neighbor，記爲Nk(p)，數據點p的局部可達密度爲它與鄰近的數據點的平都可達距離的導數。

　　點 p 的局部可達密度表示爲：

　　表示點 p 的第 k 鄰域內點到 p 的平都可達距離的倒數。

　　注意：是 p 的鄰域點 Nk(p)到 p的可達距離，不是 p 到 Nk(p) 的可達距離，必定要弄清楚關係。而且，若是有重複點，那麼分母的可達距離之和有可能爲0，則會致使 ird 變爲無限大，下面還會繼續提到這一點。

　　這個值的含義能夠這樣理解，首先這表明一個密度，密度越高，咱們認爲越可能屬於同一簇，密度越低，越多是離羣點，若是 p 和周圍鄰域點是同一簇，那麼可達距離越可能爲較小的 dk(o)，致使可達距離之和較小，密度值較高；若是 p 和周圍鄰居點較遠，那麼可達距離可能都會取較大值 d(p, o)，致使密度較小，越多是離羣點。

（6） local outlier factor：局部離羣因子

　　Local Outlier Factor：根據局部可達密度的定義，若是一個數據點根其餘點比較疏遠的話，那麼顯然它的局部可達密度就小。但LOF算法衡量一個數據點的異常程度，並非看他的絕對局部密度，而是它看跟周圍鄰近的數據點的相對密度。這樣作的好處是能夠容許數據分佈不均勻，密度不一樣的狀況。局部異常因子既是用局部相對密度來定義的。數據點 p 的局部相對密度（局部異常因子）爲點 p 的鄰居們的平均局部可達密度跟數據點 p 的局部可達密度的比值。

　　點 p 的局部離羣因子表示爲：

　　表示點 p 的鄰域點 Nk(p) 的局部可達密度與點 p的局部可達密度之比的平均數。

　　LOF 主要經過計算一個數值 score 來反映一個樣本的異常程度。這個數值的大體意思是：一個樣本點周圍的樣本點所處位置的平均密度比上該樣本點所在位置的密度。若是這個比值越接近1，說明 p 的其鄰域點密度差很少， p 可能和鄰域同屬一簇；若是這個比值越小於1，說明 p 的密度高於其鄰域點目擊，p 爲密度點；若是這個比值越大於1，說明 p 的密度小於其鄰域點密度， p 越多是異常點。

　　因此瞭解了上面LOF一大堆定義，咱們在這裏簡單整理一下此算法：

1，對於每一個數據點，計算它與其餘全部點的距離，並按從近到遠排序
2，對於每一個數據點，找到它的K-Nearest-Neighbor，計算LOF得分

1.3 算法應用

　　LOF 算法中關於局部可達密度的定義其實暗含了一個假設，即：不存在大於等於k個重複的點。當這樣的重複點存在的時候，這些點的平都可達距離爲零，局部可達密度就變爲無窮大，會給計算帶來一些麻煩。在實際應用中，爲了不這樣的狀況出現，能夠把 K-distance改成 K-distinct-distance，不考慮重複的狀況。或者，還能夠考慮給可達距離都加一個很小的值，避免可達距離等於零。

　　LOF算法須要計算數據點兩兩之間的距離，形成整個算法時間複雜度爲 O(n**2)。爲了提升算法效率，後續有算法嘗試改進。FastLOF（Goldstein, 2012）先將整個數據隨機的分紅多個子集，而後在每一個子集裏計算 LOF值。對於那些LOF異常得分小於等於1的。從數據集裏剔除，剩下的在下一輪尋找更合適的 nearest-neighbor，並更新LOF值。這種先將數據粗略分爲多個部分，而後根據局部計算結果將數據過濾減小計算量的想法，並不罕見。好比，爲了改進 K-Means的計算效率，Canopy Clustering算法也採用過比較類似的作法。

2，LOF算法應用（sklearn實現）

2.1 sklearn 中LOF庫介紹

　　Unsupervised Outlier Detection using Local Outlier Factor （LOF）。

　　The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densityes of its neighbors, one can identify samples that have s substantially lower density than their neighbors. These are considered outliers.

　　局部離羣點因子爲每一個樣本的異常分數，主要是經過比較每一個點 p 和其鄰域點的密度來判斷該點是否爲異常點，若是點p的密度越低，越可能被認定是異常點。至於密度，是經過點之間的距離計算的，點之間的距離越遠，密度越低，距離越近，密度越高。並且，由於LOF對密度的是經過點的第 k 鄰域來計算，而不是全局計算，所以得名「局部」異常因子。

　　Sklearn中LOF在 neighbors 裏面，其源碼以下：

　　LOF的中主要參數含義：

n_neighbors：設置k，default=20
contamination：設置樣本中異常點的比例，default=auto

　　LOF的主要屬性：

　　補充一下這裏的 negative_outlier_factor_：和LOF相反的值，值越小，越有多是異常值。（LOF的值越接近1，越有多是正常樣本，LOF的值越大於1，則越有多是異常樣本）

　　LOF的主要方法：

2.2 LOF算法實戰

實例1：在一組數中找異常點

　　代碼以下：

import numpy as np
from sklearn.neighbors import LocalOutlierFactor as LOF

X = [[-1.1], [0.2], [100.1], [0.3]]
clf = LOF(n_neighbors=2)
res = clf.fit_predict(X)
print(res)
print(clf.negative_outlier_factor_)

'''
若是 X = [[-1.1], [0.2], [100.1], [0.3]]
[ 1  1 -1  1]
[ -0.98214286  -1.03703704 -72.64219576  -0.98214286]

若是 X = [[-1.1], [0.2], [0.1], [0.3]]
[-1  1  1  1]
[-7.29166666 -1.33333333 -0.875      -0.875     ]

若是 X = [[0.15], [0.2], [0.1], [0.3]]
[ 1  1  1 -1]
[-1.33333333 -0.875      -0.875      -1.45833333]
'''

　　咱們能夠發現，隨着數字的改變，它的異常點也在變，不管怎麼變，都是基於鄰域密度比來衡量。

實例2：Outlier detection

　　（outlier detection：當訓練數據中包含離羣點，模型訓練時要匹配訓練數據的中心樣本，忽視訓練樣本的其餘異常點）

　　The Local Outlier Factor（LOF） algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors.

　　This example shows how to use LOF for outlier detection which is the default use case of this estimator in sklearn。Note that when LOF is used for outlier detection it has no predict, decision_function and score_samples methods.

　　The number of neighbors considered（parameter n_neighbors）is typically set 1) greater than the minimum number of samples a cluster has to contain, so that other samples can be local outliers relative to this cluster , and 2) smaller than the maximum number of close by samples that can potentially be local outliers. In practice, such informations are generally not available and taking n_neighbors=20 appears to work well in general.

　　鄰居的數量考慮（參數 n_neighbors一般設置爲：

1）大於一個集羣包含最小數量的樣本，以便其餘樣本能夠局部離羣
2）小於附加的最大數量樣本，能夠局部離羣值

　　在實踐中，這種信息通常是不可用的，n_neighbors=20 彷佛實踐很好。

　　代碼：

#_*_coding:utf-8_*_
import numpy as np
from sklearn.neighbors import LocalOutlierFactor as LOF
import matplotlib.pyplot as plt

# generate train data
X_inliers = 0.3 * np.random.randn(100, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]


# generate some outliers
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X_inliers, X_outliers]

n_outliers = len(X_outliers)  # 20
ground_truth = np.ones(len(X), dtype=int)
ground_truth[-n_outliers:] = -1

# fit the model for outlier detection
clf = LOF(n_neighbors=20, contamination=0.1)

# use fit_predict to compute the predicted labels of the training samples
y_pred = clf.fit_predict(X)
n_errors = (y_pred != ground_truth).sum()
X_scores = clf.negative_outlier_factor_


plt.title('Locla Outlier Factor (LOF)')
plt.scatter(X[:, 0], X[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to thr outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(X[:, 0], X[:, 1], s=1000*radius, edgecolors='r',
    facecolors='none', label='Outlier scores')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.xlabel("prediction errors: %d"%(n_errors))
legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()

　　結果以下：

　　這個圖可能有點複雜。這樣咱們將異常點設置爲2個，則執行效果：

實例3：Novelty detection

　　（novelty detection：當訓練數據中沒有離羣點，咱們的目的是用訓練好的模型去檢測另外發現的新樣本。）

　　This example shows how to use LOF for novelty detection .Note that when LOF is used for novelty detection you MUST not use no predict, decision_function and score_samples on the training set as this would lead to wrong result. you must only use these methods on new unseen data（which are not in the training set）

　　代碼以下：

#_*_coding:utf-8_*_
import numpy as np
from sklearn.neighbors import LocalOutlierFactor as LOF
import matplotlib.pyplot as plt
import matplotlib


# np.meshgrid() 生成網格座標點
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))

# generate normal  (not abnormal) training observations  
X = 0.3*np.random.randn(100, 2)
X_train = np.r_[X+2, X-2]

# generate new normal (not abnormal) observations
X = 0.3*np.random.randn(20, 2)
X_test = np.r_[X+2, X-2]

# generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))


# fit the model for novelty detection  (novelty=True)
clf = LOF(n_neighbors=20, contamination=0.1, novelty=True)
clf.fit(X_train)

# do not use predict, decision_function and score_samples on X_train
# as this would give wrong results but only on new unseen data(not 
# used in X_train , eg: X_test, X_outliers or the meshgrid)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
'''
### contamination=0.1
X_test: [ 1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1 -1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1 -1  1  1 -1  1  1]

### contamination=0.01
X_test: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1]

y_pred_outliers: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
'''

n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size

# plot the learned frontier, the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title('Novelty Detection with LOF')
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='darkred')
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='palevioletred')

s = 40
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white', s=s, edgecolors='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='blueviolet', s=s, edgecolors='k')

c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='gold', s=s, edgecolors='k')

plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([a.collections[0], b1, b2, c],
            ["learned frontier", "training observations",
            "new regular observations", "new abnormal observations"],
            loc='upper left',
            prop=matplotlib.font_manager.FontProperties(size=11))

plt.xlabel("errors novel regular:%d/40; errors novel abnormal: %d/40"
    %(n_error_test, n_error_outliers))
plt.show()

　　效果以下：

　　對上面模型進行調參，並設置異常點個數爲2個，則效果以下：

參考地址：

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html?highlight=lof

https://blog.csdn.net/YE1215172385/article/details/79766906

https://blog.csdn.net/bbbeoy/article/details/80301211