文章目錄

引入

本文 $\color{red}^{[1]}$ 貢獻：
1）提出了一種新的數據相關核，即隔離核 (Isolation kernel)。與已有的數據相關核相比，其無需使用或學習類別信息。
2）對隔離核的劃分機制進行評估，即劃分機制須要使得大隔離分區 (partition)位於稀疏區域 (region)，小隔離分區位於密集區域。該性質要求隔離核：兩個點間距離相等的點，在稀疏區域應該更類似，相比於在密集區域。
3）說明了爲何隔離核可以適用於SVM，並提升預測精度。
4）與RBF、Laplacian、多核學習、距離度量學習進行比較。html

【1】Kai Ming Ting, Yue Zhu, and Zhi-Hua Zhou. 2018. Isolation Kernel and Its Effect on SVM. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). Association for Computing Machinery, New York, NY, USA, 2329–2337. DOI:https://doi.org/10.1145/3219819.3219990web

1 隔離核：定義

部分符號表以下：svg

符號	含義
$\{ \mathbf{x}_1, \cdots, \mathbf{x}_n \}, \mathbf{x}_i \in \mathbb{R}^d$	來自服從未知機率密度函數 $\mathbf{x}_i$ ~ $F$ 的樣本
$\mathcal{H}_\psi (D)$	全部分區 (partition) $H$ 的集合
$\mathcal{D} \in D, \mid D \mid = \psi$	隨機子集
$\theta \in H$	隔離分區，將某一個點與 $\mathcal{D}$ 中其他點隔開

定義1.1. 給定任意兩個點 $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ ，其關於 $D$ 的隔離核被定義爲：在全部的分區 $H$ 上， $\mathbf{x}, \mathbf{y}$ 屬於相同隔離分區 $\theta$ 的指望：
$\tag{1} K_\psi (\mathbf{x}, \mathbf{y} \mid D) = \mathbb{E}_{\mathcal{H}_\psi (D)} \left[ \mathbb{I} (\mathbf{x}, \mathbf{y} \in \theta \mid \theta \in H) \right]$ 其中 $\mathbb{I} (B)$ 是一個指示函數：
$\mathbb{I} (B) = \left \{ \begin{matrix} 1 & B \text{ is true};\\ 0 & \text{otherwise} \end{matrix} \right.$ 事實上，隔離核將經過有限數量的分區 $H_i \in \mathcal{H}_\psi (D), i = 1, \cdots, t$ 來計算：
$\tag{2} K_\psi (\mathbf{x}, \mathbf{y} \mid D) = \frac{1}{t} \sum{i = 1}^t (\mathbf{x}, \mathbf{y} \in \theta \mid \theta \in H_i)$ 函數

引理1.2. $K_\psi (\mathbf{x}, \mathbf{y} \mid D)$ 是一個合法核 (證實見原論文)。學習

目前，假設 $H$ 可以達成貢獻 (2)中的要求。
令 $\mathcal{X}_S$ 和 $\mathcal{X}_T$ 分別表明稀疏和密集區域點的子集，則有機率密度 $(\mathcal{X}_S) < P (\mathcal{X}_T)$ ，且 $\| \mathbf{x} - \mathbf{y} \|$ 表示兩點間的距離。spa

$K_\psi$ 的性質： $\forall \mathbf{x}, \mathbf{y} \in \mathcal{X}_S$ 以及 $\forall \mathbf{x}', \mathbf{y}' \in \mathcal{X}_T$ ，知足：
$\tag{3} K_\psi (\mathbf{x}, \mathbf{y}) > K_\psi (\mathbf{x}', \mathbf{y}')$ .net

1.1 劃分機制

隔離方法適用iForest $\color{red}^{[1]}$ 。下圖展現了拉普拉斯核、隔離核和RBF核在均勻密度分佈下的不一樣之處。
rest

【1】 Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Proceedings of the IEEE International Conference on Data Mining, pages 413–422, 2008.xml

1.2 均勻密度分佈下的 $K_\psi$

1.2.1 Breiman分析下的徹底隨機樹

Breiman $\color{red}^{[1]}$ 基於徹底隨機樹，其無需數據便可生成。對於樹深度 $\geq 5$ 且葉子節點數 $\leq \exp(d / 2)$ ，能夠獲得拉普拉斯核近似：
$\tag{4} L (\mathbf{x}, \mathbf{y}) = \exp \left (- \lambda \sum_{J = 1}^d | \mathbf{x}_J - \mathbf{y}_J | \right)$ 其中 $\mathbf{x} = <x_1, \cdots, x_J, \cdots, x_d>$ ， $\lambda$ 決定核的銳度 (sharpness)。
均勻密度分佈時，上述核等價於iForest。htm

【1】Leo Breiman. Some infinity theory for predictor ensembles. Technical Report 577. Statistics Dept. UCB., 2000.

1.2.2 拉普拉斯核的新發現

令 $\psi$ 表示一個數據不相關徹底隨機樹的葉子節點數量，Breiman的分析代表：拉普拉斯核的 $\lambda = \frac{\log(\psi)}{d}$ 。
故拉普拉斯核被從新表示爲：
$\tag{5}$

本文同步分享在博客「因吉」（CSDN）。
若有侵權，請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」，歡迎正在閱讀的你也加入，一塊兒分享。

論文閱讀 (二十)：Isolation Kernel and Its Effect on SVM (2018)