[Scikit-learn] 1.1 Generalized Linear Models - Lasso Regression

時間 2019-11-18

標籤 scikit learn 1.1 generalized linear models lasso regression 简体版

原文原文鏈接

Ref: http://blog.csdn.net/daunxx/article/details/51596877html

Ref: https://www.youtube.com/watch?v=ipb2MhSRGdwpython

Ref: nullege.com/codes算法

大綱 OUTLINE

初步認識

1、Lasso迴歸的幾種算法

Lasso Regression（L1）api

　　|-- Coordinate descent【最快算法】安全

　　|-- Least Angle Regression【最好算法】網絡

　　|-- ElasticNet【混合算法】app

　　|-- Compressive sensing【究極應用】dom

2、Lasso迴歸模型

是一個用於估計稀疏參數的線性模型，特別適用於參數數目縮減。基於這個緣由，Lasso迴歸模型在壓縮感知（compressed sensing）中應用的十分普遍。從數學上來講，Lasso是在線性模型上加上了一個Lide

對LSparsity and Some Basics of L1 Regularization函數

Lasso 與「稀疏解」

圖上畫了原始的 least square 解，LASSO 的解以及 ridge regression 的解，用上面一樣的方法（不過因爲 ridge regularizer 是 smooth 的，因此過程卻簡單得多）能夠得知 ridge regression 的解是以下形式

ridge regression 只是作了一個全局縮放
LASSO 則是作了一個 soft thresholding
- 將絕對值小於的那些係數直接變成零了，這也就更加使人信服地解釋了 LASSO 爲什麼可以產生稀疏解了。

算法詳解

1、座標降低法 - Coordinate descent

Lasso迴歸的最快解法

座標降低法在稀疏矩陣上的計算速度很是快，同時也是Lasso迴歸最快的解法。（最快不必定最好）

代碼詳見: http://blog.csdn.net/daunxx/article/details/51596877

#!/usr/bin/python
# -*- coding: utf-8 -*-


"""
author ： duanxxnj@163.com
time : 2016-06-06_15-41

Lasso 迴歸應用於稀疏信號

"""
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
import time

from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score

# 用於產生稀疏數據
np.random.seed(int(time.time()))
# 生成係數數據，樣本爲50個，參數爲200維
n_samples, n_features = 50, 200
# 基於高斯函數生成數據
X = np.random.randn(n_samples, n_features)
# 每一個變量對應的係數
coef = 3 * np.random.randn(n_features)
# 變量的下標
inds = np.arange(n_features)
# 變量下標隨機排列
np.random.shuffle(inds)
# 僅僅保留10個變量的係數，其餘係數所有設置爲0
# 生成稀疏參數
coef[inds[10:]] = 0
# 獲得目標值，y
y = np.dot(X, coef)
# 爲y添加噪聲
y += 0.01 * np.random.normal((n_samples,))

# 將數據分爲訓練集和測試集
n_samples = X.shape[0]
X_train, y_train = X[:n_samples / 2], y[:n_samples / 2]
X_test, y_test = X[n_samples / 2:], y[n_samples / 2:]

# Lasso 迴歸的參數
alpha = 0.1
lasso = Lasso(max_iter=10000, alpha=alpha)

# 基於訓練數據，獲得的模型的測試結果
# 這裏使用的是座標軸降低算法（coordinate descent）
y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)

# 這裏是R2可決係數（coefficient of determination）
# 迴歸平方和（RSS）在總變差（TSS）中所佔的比重稱爲可決係數
# 可決係數能夠做爲綜合度量回歸模型對樣本觀測值擬合優度的度量指標。
# 可決係數越大，說明在總變差中由模型做出瞭解釋的部分佔的比重越大，模型擬合優度越好。
# 反之可決係數小，說明模型對樣本觀測值的擬合程度越差。
# R2可決係數最好的效果是1。
r2_score_lasso = r2_score(y_test, y_pred_lasso)

print("測試集上的R2可決係數 : %f" % r2_score_lasso)

plt.plot(lasso.coef_, label='Lasso coefficients')
plt.plot(coef, '--', label='original coefficients')
plt.legend(loc='best')

plt.show()

View Code

可見，200個係數的最終結果以下：

跡線圖查看：（跡線顏色的設置：http://matplotlib.org/api/colors_api.html）

Standardize data

運行結果比對以下：

2、最小角迴歸 - Least Angle Regression（LARS）

最小角迴歸是針對高維數據的迴歸算法，由Bradley Efron, Trevor Hastie, Iain Johnstone 和 Robert Tibshirani開發。

LARS模型可使用estimator Lars ，或者底層實現 lars_path 。

LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on coordinate_descent,

this yields the exact solution, which is piecewise linear as a function of the norm of its coefficients.

LARS的優點

- 當 p >> n 時計算是很是高效的。（好比當維數遠大於點數）

- 它和前向選擇計算速度差很少同樣塊，而且和普通最小二乘複雜度同樣。

- 它生成一個完整的分段線性的解的路徑，這對於交叉驗證或者相似的嘗試來調整模型是有效的。

- 若是兩個變量的相應老是相同，那麼它們的係數應該有近似相同的增加速率。所以這算法和直覺判斷同樣，而且增加老是穩定的。

- It is easily modified to produce solutions for other estimators, like the Lasso.

LARS的缺點

- 由於LARS是基於剩餘偏差屢次迭代擬合，因此對噪聲的影響比較敏感。這個問題在 Efron et al. (2004) Annals of Statistics article這篇文章中討論部分詳細談論了。

基本原理

在闡述最小角迴歸（ Least Angle Regression）算法，首先介紹兩種更爲簡單直觀的前向（Forward）算法作一些說明，最小回歸算法是以這兩種前向算法爲基礎的：

- 前向選擇算法
- 前向梯度算法

代碼實踐

良心博文：http://blog.csdn.net/xbinworld/article/details/44284293

事實上，不管是Lasso仍是Stagewise方法都是Least angle regression（LARS）的變種。

LARS的選擇不須要經歷那麼多小的迭代，能夠每次都在須要的方向上一步走到最遠，所以計算速度很快，下面來具體描述一下LARS。

#!/usr/bin/env python
""" ===================== Lasso path using LARS ===================== Computes Lasso Path along the regularization parameter using the LARS algorithm on the diabetes dataset. Each color represents a different feature of the coefficient vector, and this is displayed as a function of the regularization parameter. """
print(__doc__) # Author: Fabian Pedregosa <fabian.pedregosa@inria.fr> # Alexandre Gramfort <alexandre.gramfort@inria.fr> # License: BSD 3 clause

import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model from sklearn import datasets
 # Get data diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes.target 
# training print("Computing regularization path using the LARS ...") alphas, _, coefs = linear_model.lars_path(X, y, method='lasso', verbose=True) xx = np.sum(np.abs(coefs.T), axis=1) xx /= xx[-1] plt.plot(xx, coefs.T) ymin, ymax = plt.ylim() plt.vlines(xx, ymin, ymax, linestyle='dashed') plt.xlabel('|coef| / max|coef|') plt.ylabel('Coefficients') plt.title('LASSO Path') plt.axis('tight') plt.show()

運行結果：

3、ElasticNet

適用的狀況

ElasticNet 是一種使用L1和L2先驗做爲正則化矩陣的線性迴歸模型。

這種組合用於只有不多的權重非零的稀疏模型，好比:class:Lasso，可是又能保持:class:Ridge 的正則化屬性。

原理分析

咱們可使用 l1_ratio 參數來調節L1和L2的凸組合(一類特殊的線性組合)。

當多個特徵和另外一個特徵相關的時候彈性網絡很是有用。Lasso 傾向於隨機選擇其中一個，而彈性網絡更傾向於選擇兩個.

在實踐中，Lasso 和 Ridge 之間權衡的一個優點是它容許在循環過程（Under rotate）中繼承 Ridge 的穩定性.

彈性網絡的目標函數是最小化:

$\underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 + \frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}$

代碼展現

############################################################################### # ElasticNet
from sklearn.linear_model import ElasticNet enet = ElasticNet(alpha=alpha, l1_ratio=0.7) y_pred_enet = enet.fit(X_train, y_train).predict(X_test) r2_score_enet = r2_score(y_test, y_pred_enet) print(enet) print("r^2 on test data : %f" % r2_score_enet) plt.plot(enet.coef_, label='Elastic net coefficients') plt.plot(lasso.coef_, label='Lasso coefficients') plt.plot(coef, '--', label='original coefficients') plt.legend(loc='best') plt.title("Lasso R^2: %f, Elastic Net R^2: %f" % (r2_score_lasso, r2_score_enet)) plt.show()

print("Computing regularization path using the elastic net...") alphas_enet, coefs_enet, _ = enet_path( X, y, eps=eps, l1_ratio=0.8, return_models=False, fit_intercept=False) print("Computing regularization path using the positve elastic net...") alphas_positive_enet, coefs_positive_enet, _ = enet_path( X, y, eps=eps, l1_ratio=0.8, positive=True, return_models=False, fit_intercept=False) 
 pl.figure(3) ax = pl.gca() ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k']) l1 = pl.plot(-np.log10(alphas_enet), coefs_enet.T) l2 = pl.plot(-np.log10(alphas_positive_enet), coefs_positive_enet.T, linestyle='--') pl.xlabel('-Log(alpha)') pl.ylabel('coefficients') pl.title('Elastic-Net and positive Elastic-Net') pl.legend((l1[-1], l2[-1]), ('Elastic-Net', 'positive Elastic-Net'), loc='lower left') pl.axis('tight') pl.show()

運行結果：

三種模型的比較

4、Compressive sensing

Ref: 初識壓縮感知Compressive Sensing

Ref: 壓縮感知進階——有關稀疏矩陣

若是要想採集不多一部分數據而且期望從這些少許數據中「解壓縮」出大量信息，就須要保證：

第一：這些少許的採集到的數據包含了原信號的全局信息，

有趣的是，在某些特定的場合，上述第一件事情是自動獲得知足的。最典型的例子就是醫學圖像成像，例如斷層掃描（CT）技術和核磁共振（MRI）技術。對這兩種技術稍有了解的人都知道，這兩種成像技術中，儀器所採集到的都不是直接的圖像像素，而是圖像經歷過全局傅立葉變換後的數據。也就是說，每個單獨的數據都在某種程度上包含了全圖像的信息。在這種狀況下，去掉一部分採集到的數據並不會致使一部分圖像信息永久的丟失（它們仍舊被包含在其它數據裏）。這正是咱們想要的狀況。

第二：存在一種算法可以從這些少許的數據中還原出原先的信息來。

第二件事就要歸功於陶哲軒和坎戴的工做了。他們的工做指出，若是假定信號（不管是圖像仍是聲音仍是其餘別的種類的信號）知足某種特定的「稀疏性」，那麼從這些少許的測量數據中，確實有可能還原出原始的較大的信號來，其中所須要的計算部分是一個複雜的迭代優化過程，即所謂的「L1-最小化」算法。

壓縮感知：

其理論依據是：若是隻須要10萬個份量就能夠重建絕大部分的圖像，那何須還要作全部的200萬次測量，只作10萬次不就夠了嗎？（在實際應用中，咱們會留一個安全餘量，好比說測量30萬像素，以應付可能遭遇的全部問題，從干擾到量化噪聲，以及恢復算法的故障。）這樣基本上能使節能上一個數量級，這對消費攝影沒什麼意義，對感知器網絡而言卻有實實在在的好處。

不過，正像我前面說的，相機本身不會預先知道兩百萬小波係數中須要記錄哪十萬個。要是相機選取了另外10萬（或者30萬），反而把圖片中全部有用的信息都扔掉了怎麼辦？

解決的辦法簡單可是不太直觀。就是用非小波的算法來作30萬個測量——儘管我前面確實講太小波算法是觀察和壓縮圖像的最佳手段。實際上最好的測量其實應該是（僞）隨機測量：

好比說隨機生成30萬個濾鏡（mask）圖像並測量真實圖像與每一個mask的相關程度。這樣，圖像與mask之間的這些測量結果（也就是「相關性」）頗有多是很是小很是隨機的。

可是——這是關鍵所在——構成圖像的2百萬種可能的小波函數會在這些隨機的mask的測量下生成本身特有的「特徵」，它們每個都會與某一些mask成正相關，與另外一些mask成負相關，可是與更多的mask不相關。

但是（在極大的機率下）2百萬個特徵都各不相同；這就致使，其中任意十萬個的線性組合仍然是各不相同的（以線性代數的觀點來看，這是由於一個30萬維線性子空間中任意兩個10萬維的子空間幾乎互不相交）。

所以，理論上是有可能從這30萬個隨機mask數據中恢復圖像的（至少能夠恢復圖像中的10萬個主要細節）。簡而言之，咱們是在討論一個哈希函數的線性代數版本。

PS: 這裏的原理就是說，若是3維線性子空間中的任意兩個2維子空間是線性相關的話，好比：

①3x+2y=8

②6x+4y=16

③4x+5y=10

中，①②就是線性相關的，①③不是線性相關的。將200萬個小波函數降維到30萬個掩模基的過程就是相似去掉①②中冗餘基的過程。

然而這種方式仍然存在兩個技術問題。

首先是噪聲問題：10萬個小波係數的疊加並不能徹底表明整幅圖像，另190萬個係數也有少量貢獻。這些小小貢獻有可能會干擾那10萬個小波的特徵，這就是所謂的「失真」問題。
第二個問題是如何運用獲得的30萬測量數據來重建圖像。

咱們先來關注後一個問題（怎樣恢復圖像）。若是咱們知道了2百萬小波中哪10萬個是有用的，那就可使用標準的線性代數方法（高斯消除法，最小二乘法等等）來重建信號。（這正是線性編碼最大的優勢之一——它們比非線性編碼更容易求逆。大多數哈希變換其實是不可能求逆的——這在密碼學上是一大優點，在信號恢復中卻不是。）但是，就像前面說的那樣，咱們事前並不知道哪些小波是有用的。怎麼找出來呢？

一個單純的最小二乘近似法會得出牽扯到所有2百萬係數的可怕結果，生成的圖像也含有大量顆粒噪點。要否則也能夠代之以一種強力搜索，爲每一組可能的10萬關鍵係數都作一次線性代數處理，不過這樣作的耗時很是恐怖（總共要考慮大約10的17萬次方個組合！），並且這種強力搜索一般是NP-complete的（其中有些特例是所謂的「子集合加總」問題）。不過還好，仍是有兩種可行的手段來恢復數據：

• 匹配追蹤：

找到一個其標記看上去與收集到的數據相關的小波；

在數據中去除這個標記的全部印跡；

不斷重複直到咱們能用小波標記「解釋」收集到的全部數據。

Matching pursuit: locate a wavelet whose signature seems to correlate with the data collected; remove all traces of that signature from the data; and repeat until we have totally 「explained」 the data collected in terms of wavelet signatures.

就是先找出一個貌似是基的小波，而後去掉該小波在圖像中的份量，迭代直到找出全部10w個小波.

• 基追蹤（又名L1模最小化）：

在全部與(image)數據匹配的小波組合中，找到一個「最稀疏的」基，也就是其中全部係數的絕對值總和越小越好。（這種最小化的結果趨向於迫使絕大多數係數都消失了。）這種最小化算法能夠利用單純形法之類的凸規劃算法，在合理的時間內計算出來。
Basis pursuit (or $l^1$ minimisation): Out of all the possible combinations of wavelets which would fit the data collected, find the one which is 「sparsest」 in the sense that the total sum of the magnitudes of all the coefficients is as small as possible. (It turns out that this particular minimisation tends to force most of the coefficients to vanish.) This type of minimisation can be computed in reasonable time via convex optimisation methods such as the simplex method.

須要注意到的是，這類圖像恢復算法仍是須要至關的運算能力的（不過也還不是太變態），不過在傳感器網絡這樣的應用中這不成問題，由於圖像恢復是在接收端（這端有辦法鏈接到強大的計算機）而不是傳感器端（這端就沒辦法了）進行的。

如今已經有嚴密的結果顯示，對原始圖像設定不一樣的壓縮率或稀疏性，這兩種算法完美或近似完美地重建圖像的成功率都很高。

匹配追蹤法一般比較快，
基追蹤算法在考慮到噪聲時則顯得比較準確。

Compressive sensing: tomography reconstruction with L1 prior (Lasso)

This example shows the reconstruction of an image from a set of parallel projections, acquired along different angles. Such a dataset is acquired in computed tomography (CT).

Without any prior information on the sample, the number of projections required to reconstruct the image is of the order of the linear size l of the image (in pixels). For simplicity we consider here a sparse image, where only pixels on the boundary of objects have a non-zero value. Such data could correspond for example to a cellular material. Note however that most images are sparse in a different basis, such as the Haar wavelets. Only l/7 projections are acquired, therefore it is necessary to use prior information available on the sample (its sparsity): this is an example of compressive sensing.

The tomography projection operation is a linear transformation. In addition to the data-fidelity term corresponding to a linear regression, we penalize the L1 norm of the image to account for its sparsity. The resulting optimization problem is called the Lasso. We use the class sklearn.linear_model.Lasso, that uses the coordinate descent algorithm. Importantly, this implementation is more computationally efficient on a sparse matrix, than the projection operator used here.

The reconstruction with L1 penalization gives a result with zero error (all pixels are successfully labeled with 0 or 1), even if noise was added to the projections. In comparison, an L2 penalization (sklearn.linear_model.Ridge) produces a large number of labeling errors for the pixels. Important artifacts are observed on the reconstructed image, contrary to the L1 penalization. Note in particular the circular artifact separating the pixels in the corners, that have contributed to fewer projections than the central disk.

代碼複雜，待往後研究。不過在此以前，有必要先複習下小波。

"""
======================================================================
Compressive sensing: tomography reconstruction with L1 prior (Lasso)
======================================================================

This example shows the reconstruction of an image from a set of parallel
projections, acquired along different angles. Such a dataset is acquired in
**computed tomography** (CT).

Without any prior information on the sample, the number of projections
required to reconstruct the image is of the order of the linear size
``l`` of the image (in pixels). For simplicity we consider here a sparse
image, where only pixels on the boundary of objects have a non-zero
value. Such data could correspond for example to a cellular material.
Note however that most images are sparse in a different basis, such as
the Haar wavelets. Only ``l/7`` projections are acquired, therefore it is
necessary to use prior information available on the sample (its
sparsity): this is an example of **compressive sensing**.

The tomography projection operation is a linear transformation. In
addition to the data-fidelity term corresponding to a linear regression,
we penalize the L1 norm of the image to account for its sparsity. The
resulting optimization problem is called the :ref:`lasso`. We use the
class :class:`sklearn.linear_model.Lasso`, that uses the coordinate descent
algorithm. Importantly, this implementation is more computationally efficient
on a sparse matrix, than the projection operator used here.

The reconstruction with L1 penalization gives a result with zero error
(all pixels are successfully labeled with 0 or 1), even if noise was
added to the projections. In comparison, an L2 penalization
(:class:`sklearn.linear_model.Ridge`) produces a large number of labeling
errors for the pixels. Important artifacts are observed on the
reconstructed image, contrary to the L1 penalization. Note in particular
the circular artifact separating the pixels in the corners, that have
contributed to fewer projections than the central disk.
"""

print(__doc__)

# Author: Emmanuelle Gouillart <emmanuelle.gouillart@nsup.org>
# License: BSD 3 clause

import numpy as np
from scipy import sparse
from scipy import ndimage
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt


def _weights(x, dx=1, orig=0):
    x = np.ravel(x)
    floor_x = np.floor((x - orig) / dx)
    alpha = (x - orig - floor_x * dx) / dx
    return np.hstack((floor_x, floor_x + 1)), np.hstack((1 - alpha, alpha))


def _generate_center_coordinates(l_x):
    X, Y = np.mgrid[:l_x, :l_x].astype(np.float64)
    center = l_x / 2.
    X += 0.5 - center
    Y += 0.5 - center
    return X, Y


def build_projection_operator(l_x, n_dir):
    """ Compute the tomography design matrix.

    Parameters
    ----------

    l_x : int
        linear size of image array

    n_dir : int
        number of angles at which projections are acquired.

    Returns
    -------
    p : sparse matrix of shape (n_dir l_x, l_x**2)
    """
    X, Y = _generate_center_coordinates(l_x)
    angles = np.linspace(0, np.pi, n_dir, endpoint=False)
    data_inds, weights, camera_inds = [], [], []
    data_unravel_indices = np.arange(l_x ** 2)
    data_unravel_indices = np.hstack((data_unravel_indices,
                                      data_unravel_indices))
    for i, angle in enumerate(angles):
        Xrot = np.cos(angle) * X - np.sin(angle) * Y
        inds, w = _weights(Xrot, dx=1, orig=X.min())
        mask = np.logical_and(inds >= 0, inds < l_x)
        weights += list(w[mask])
        camera_inds += list(inds[mask] + i * l_x)
        data_inds += list(data_unravel_indices[mask])
    proj_operator = sparse.coo_matrix((weights, (camera_inds, data_inds)))
    return proj_operator


def generate_synthetic_data():
    """ Synthetic binary data """
    rs = np.random.RandomState(0)
    n_pts = 36.
    x, y = np.ogrid[0:l, 0:l]
    mask_outer = (x - l / 2) ** 2 + (y - l / 2) ** 2 < (l / 2) ** 2
    mask = np.zeros((l, l))
    points = l * rs.rand(2, n_pts)
    mask[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
    mask = ndimage.gaussian_filter(mask, sigma=l / n_pts)
    res = np.logical_and(mask > mask.mean(), mask_outer)
    return res - ndimage.binary_erosion(res)


# Generate synthetic images, and projections
l = 128
proj_operator = build_projection_operator(l, l / 7.)
data = generate_synthetic_data()
proj = proj_operator * data.ravel()[:, np.newaxis]
proj += 0.15 * np.random.randn(*proj.shape)

# Reconstruction with L2 (Ridge) penalization
rgr_ridge = Ridge(alpha=0.2)
rgr_ridge.fit(proj_operator, proj.ravel())
rec_l2 = rgr_ridge.coef_.reshape(l, l)

# Reconstruction with L1 (Lasso) penalization
# the best value of alpha was determined using cross validation
# with LassoCV
rgr_lasso = Lasso(alpha=0.001)
rgr_lasso.fit(proj_operator, proj.ravel())
rec_l1 = rgr_lasso.coef_.reshape(l, l)

plt.figure(figsize=(8, 3.3))
plt.subplot(131)
plt.imshow(data, cmap=plt.cm.gray, interpolation='nearest')
plt.axis('off')
plt.title('original image')
plt.subplot(132)
plt.imshow(rec_l2, cmap=plt.cm.gray, interpolation='nearest')
plt.title('L2 penalization')
plt.axis('off')
plt.subplot(133)
plt.imshow(rec_l1, cmap=plt.cm.gray, interpolation='nearest')
plt.title('L1 penalization')
plt.axis('off')

plt.subplots_adjust(hspace=0.01, wspace=0.01, top=1, bottom=0, left=0,
                    right=1)

plt.show()