Sparse Coding

時間 2019-11-11

標籤 sparse coding 简体版

原文原文鏈接

Sparse Coding

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. —— 過完備的基，無監督 The aim of sparse coding is to find a set of basis vectors $\mathbf{\phi}_i$ such that we can represent an input vector $\mathbf{x}$ as a linear combination of these basis vectors:算法

$\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} \end{align}$

While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an
over-complete
set of basis vectors to represent input vectors
$\mathbf{x}\in\mathbb{R}^n$
(i.e. such that
k
>
n
).
The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data.
However, with an over-complete basis,
the coefficients a_i are no longer uniquely determined
by the input vector
$\mathbf{x}$
. Therefore, in sparse coding, we introduce the additional criterion of sparsity t

o resolve the degeneracy introduced by over-completeness.

過完備的基能發掘出數據的內在結構和模型，可是其係數表示將不惟一app

Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients a_i be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. 原子圖像的疊加 Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.less

We define the sparse coding cost function on a set of m input vectors aside

$\begin{align} \text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{align}$

where S(.) is a sparsity cost function which penalizes a_i for being far from zero.函數

稀疏懲罰項學習

We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of $\mathbf{x}$ and the second term as a sparsity penalty which forces our representation of $\mathbf{x}$ to be sparse. The constant λ is a scaling constant to determine the relative importance of these two contributions.測試

Although the most direct measure of sparsity is the "L₀" norm ( $S(a_i) = \mathbf{1}(|a_i|>0)$ ), it is non-differentiable （不可微） and difficult to optimize in general. In practice, common choices for the sparsity cost S(.) are the L₁ penalty $S(a_i)=\left|a_i\right|_1$ and the log penalty $S(a_i)=\log(1+a_i^2)$ .優化

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down a_i and scaling $\mathbf{\phi}_i$ up by some large constant. To prevent this from happening, we will constrain $\left|\left|\mathbf{\phi}\right|\right|^2$ to be less than some constant C. The full sparse coding cost function including our constraint on $\mathbf{\phi}$ isui

$\begin{array}{rc} \text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} & \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \\ \text{subject to} & \left|\left|\mathbf{\phi}_i\right|\right|^2 \leq C, \forall i = 1,...,k \\ \end{array}$

Probabilistic Interpretation

到目前爲止，咱們所考慮的稀疏編碼，是爲了尋找到一個稀疏的、超完備基向量集，來覆蓋咱們的輸入數據空間。如今換一種方式，咱們能夠從機率的角度出發，將稀疏編碼算法看成一種「生成模型」。this

咱們將天然圖像建模問題當作是一種線性疊加，疊加元素包括 k 個獨立的源特徵 $\mathbf{\phi}_i$ 以及加性噪聲 ν ：

$\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x}) \end{align}$

咱們的目標是找到一組特徵基向量 $\mathbf{\phi}$ ，它使得圖像的分佈函數 $P(\mathbf{x}\mid\mathbf{\phi})$ 儘量地近似於輸入數據的經驗分佈函數 $P^*(\mathbf{x})$ 。一種實現方式是，最小化 $P^*(\mathbf{x})$ 與 $P(\mathbf{x}\mid\mathbf{\phi})$ 之間的 KL 散度，此 KL 散度表示以下：

$\begin{align} D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x} \end{align}$

由於不管咱們如何選擇 $\mathbf{\phi}$ ，經驗分佈函數 $P^*(\mathbf{x})$ 都是常量，也就是說咱們只須要最大化對數似然函數 $P(\mathbf{x}\mid\mathbf{\phi})$ 。假設 ν 是具備方差σ² 的高斯白噪音，則有下式：

$\begin{align} P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right) \end{align}$

爲了肯定分佈 $P(\mathbf{x}\mid\mathbf{\phi})$ ，咱們須要指定先驗分佈 $P(\mathbf{a})$ 。假定咱們的特徵變量是獨立的，咱們就能夠將先驗機率分解爲：

$\begin{align} P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i) \end{align}$

此時，咱們將「稀疏」假設加入進來——假設任何一幅圖像都是由相對較少的一些源特徵組合起來的。所以，咱們但願 a_i 的機率分佈在零值附近是凸起的，並且峯值很高。一個方便的參數化先驗分佈就是：

$\begin{align} P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i)) \end{align}$

這裏 S(a_i) 是決定先驗分佈的形狀的函數。

當定義了 $P(\mathbf{x} \mid \mathbf{a} , \mathbf{\phi})$ 和 $P(\mathbf{a})$ 後，咱們就能夠寫出在由 $\mathbf{\phi}$ 定義的模型之下的數據 $\mathbf{x}$ 的機率分佈：

$\begin{align} P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a} \end{align}$

那麼，咱們的問題就簡化爲尋找：

$\begin{align} \mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) > \end{align}$

這裏 < . > 表示的是輸入數據的指望值。

不幸的是，經過對 $\mathbf{a}$ 的積分計算 $P(\mathbf{x} \mid \mathbf{\phi})$ 一般是難以實現的。雖然如此，咱們注意到若是 $P(\mathbf{x} \mid \mathbf{\phi})$ 的分佈（對於相應的 $\mathbf{a}$ ）足夠陡峭的話，咱們就能夠用 $P(\mathbf{x} \mid \mathbf{\phi})$ 的最大值來估算以上積分。估算方法以下：

$\begin{align} \mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) > \end{align}$

跟以前同樣，咱們能夠經過減少 a_i 或增大 $\mathbf{\phi}$ 來增長几率的估算值（由於 P(a_i) 在零值附近陡升）。所以咱們要對特徵向量 $\mathbf{\phi}$ 加一個限制以防止這種狀況發生。

最後，咱們能夠定義一種線性生成模型的能量函數，從而將原先的代價函數從新表述爲：

$\begin{array}{rl} E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\ &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{array}$

其中 λ = 2σ²β ，而且關係不大的常量已被隱藏起來。由於最大化對數似然函數等同於最小化能量函數，咱們就能夠將原先的優化問題從新表述爲：

$\begin{align} \mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{align}$

使用機率理論來分析，咱們能夠發現，選擇 L₁ 懲罰和 $\log(1+a_i^2)$ 懲罰做爲函數 S(.) ，分別對應於使用了拉普拉斯機率 $P(a_i) \propto \exp\left(-\beta|a_i|\right)$ 和柯西先驗機率 $P(a_i) \propto \frac{\beta}{1+a_i^2}$ 。

Learning

使用稀疏編碼算法學習基向量集的方法，是由兩個獨立的優化過程組合起來的。第一個是逐個使用訓練樣本 $\mathbf{x}$ 來優化係數 a_i ，第二個是一次性處理多個樣本對基向量 $\mathbf{\phi}$ 進行優化。

若是使用 L₁ 範式做爲稀疏懲罰函數，對 $a^{(j)}_i$ 的學習過程就簡化爲求解由 L₁ 範式正則化的最小二乘法問題，這個問題函數在域 $a^{(j)}_i$ 內爲凸，已經有不少技術方法來解決這個問題（諸如CVX之類的凸優化軟件能夠用來解決L1正則化的最小二乘法問題）。若是 S(.) 是可微的，好比是對數懲罰函數，則能夠採用基於梯度算法的方法，如共軛梯度法。

用 L₂ 範式約束來學習基向量，一樣能夠簡化爲一個帶有二次約束的最小二乘問題，其問題函數在域 $\mathbf{\phi}$ 內也爲凸。標準的凸優化軟件（如CVX）或其它迭代方法就能夠用來求解 $\mathbf{\phi}$ ，雖然已經有了更有效的方法，好比求解拉格朗日對偶函數（Lagrange dual）。

根據前面的的描述，稀疏編碼是有一個明顯的侷限性的，這就是即便已經學習獲得一組基向量，若是爲了對新的數據樣本進行「編碼」，咱們必須再次執行優化過程來獲得所需的係數。這個顯著的「實時」消耗意味着，即便是在測試中,實現稀疏編碼也須要高昂的計算成本，尤爲是與典型的前饋結構算法相比。

相關標籤/搜索

jenkins+git+coding+maven

coding+webhook+git

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。