【機器學習實驗】scikit-learn的主要模塊和基本使用

引言

對於一些開始搞機器學習算法有懼怕下手的小朋友，該如何快速入門，這讓人挺掙扎的。
在從事數據科學的人中，最經常使用的工具就是R和Python了，每一個工具都有其利弊，可是Python在各方面都相對勝出一些，這是由於scikit-learn庫實現了不少機器學習算法。html

加載數據(Data Loading)

咱們假設輸入時一個特徵矩陣或者csv文件。
首先，數據應該被載入內存中。
scikit-learn的實現使用了NumPy中的arrays，因此，咱們要使用NumPy來載入csv文件。
如下是從UCI機器學習數據倉庫中下載的數據。python

import numpy as np import urllib # url with dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" # download the file raw_data = urllib.urlopen(url) # load the CSV file as a numpy matrix dataset = np.loadtxt(raw_data, delimiter=",") # separate the data from the target attributes X = dataset[:,0:7] y = dataset[:,8]

咱們要使用該數據集做爲例子，將特徵矩陣做爲X，目標變量做爲y。git

數據歸一化(Data Normalization)

大多數機器學習算法中的梯度方法對於數據的縮放和尺度都是很敏感的，在開始跑算法以前，咱們應該進行歸一化或者標準化的過程，這使得特徵數據縮放到0-1範圍中。scikit-learn提供了歸一化的方法：github

from sklearn import preprocessing # normalize the data attributes normalized_X = preprocessing.normalize(X) # standardize the data attributes standardized_X = preprocessing.scale(X)

特徵選擇(Feature Selection)

在解決一個實際問題的過程當中，選擇合適的特徵或者構建特徵的能力特別重要。這成爲特徵選擇或者特徵工程。
特徵選擇時一個很須要創造力的過程，更多的依賴於直覺和專業知識，而且有不少現成的算法來進行特徵的選擇。
下面的樹算法(Tree algorithms)計算特徵的信息量：算法

from sklearn import metrics from sklearn.ensemble import ExtraTreesClassifier model = ExtraTreesClassifier() model.fit(X, y) # display the relative importance of each attribute print(model.feature_importances_)

算法的使用

scikit-learn實現了機器學習的大部分基礎算法，讓咱們快速瞭解一下。dom

邏輯迴歸

大多數問題均可以歸結爲二元分類問題。這個算法的優勢是能夠給出數據所在類別的機率。機器學習

from sklearn import metrics from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

結果：函數

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)
precision recall f1-score support工具
0.0       0.79      0.89      0.84       500
   1.0       0.74      0.55      0.63       268
avg / total 0.77 0.77 0.77 768post

[[447 53]
[120 148]]

樸素貝葉斯

這也是著名的機器學習算法，該方法的任務是還原訓練樣本數據的分佈密度，其在多類別分類中有很好的效果。

from sklearn import metrics from sklearn.naive_bayes import GaussianNB model = GaussianNB() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

結果：

GaussianNB()
precision recall f1-score support
0.0       0.80      0.86      0.83       500
    1.0       0.69      0.60      0.64       268
avg / total 0.76 0.77 0.76 768

[[429 71]
[108 160]]

k近鄰

k近鄰算法經常被用做是分類算法一部分，好比能夠用它來評估特徵，在特徵選擇上咱們能夠用到它。

from sklearn import metrics from sklearn.neighbors import KNeighborsClassifier # fit a k-nearest neighbor model to the data model = KNeighborsClassifier() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

結果：

KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski,
n_neighbors=5, p=2, weights=uniform)
precision recall f1-score support
0.0       0.82      0.90      0.86       500
    1.0       0.77      0.63      0.69       268
avg / total 0.80 0.80 0.80 768

[[448 52]
[ 98 170]]

決策樹

分類與迴歸樹(Classification and Regression Trees ,CART)算法經常使用於特徵含有類別信息的分類或者回歸問題，這種方法很是適用於多分類狀況。

from sklearn import metrics from sklearn.tree import DecisionTreeClassifier # fit a CART model to the data model = DecisionTreeClassifier() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

結果：

DecisionTreeClassifier(compute_importances=None, criterion=gini,
max_depth=None, max_features=None, min_density=None,
min_samples_leaf=1, min_samples_split=2, random_state=None,
splitter=best)
precision recall f1-score support
0.0       1.00      1.00      1.00       500
    1.0       1.00      1.00      1.00       268
avg / total 1.00 1.00 1.00 768

[[500 0]
[ 0 268]]

支持向量機

SVM是很是流行的機器學習算法，主要用於分類問題，如同邏輯迴歸問題，它可使用一對多的方法進行多類別的分類。

from sklearn import metrics from sklearn.svm import SVC # fit a SVM model to the data model = SVC() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

結果：

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel=rbf, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
precision recall f1-score support
0.0       1.00      1.00      1.00       500
    1.0       1.00      1.00      1.00       268
avg / total 1.00 1.00 1.00 768

[[500 0]
[ 0 268]]

除了分類和迴歸算法外，scikit-learn提供了更加複雜的算法，好比聚類算法，還實現了算法組合的技術，如Bagging和Boosting算法。

如何優化算法參數

一項更加困難的任務是構建一個有效的方法用於選擇正確的參數，咱們須要用搜索的方法來肯定參數。scikit-learn提供了實現這一目標的函數。
下面的例子是一個進行正則參數選擇的程序：

import numpy as np from sklearn.linear_model import Ridge from sklearn.grid_search import GridSearchCV # prepare a range of alpha values to test alphas = np.array([1,0.1,0.01,0.001,0.0001,0]) # create and fit a ridge regression model, testing each alpha model = Ridge() grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas)) grid.fit(X, y) print(grid) # summarize the results of the grid search print(grid.best_score_) print(grid.best_estimator_.alpha)

結果：

GridSearchCV(cv=None,
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver=auto, tol=0.001),
estimator__alpha=1.0, estimator__copy_X=True,
estimator__fit_intercept=True, estimator__max_iter=None,
estimator__normalize=False, estimator__solver=auto,
estimator__tol=0.001, fit_params={}, iid=True, loss_func=None,
n_jobs=1,
param_grid={'alpha': array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,
1.00000e-04, 0.00000e+00])},
pre_dispatch=2*n_jobs, refit=True, score_func=None, scoring=None,
verbose=0)
0.282118955686
1.0

有時隨機從給定區間中選擇參數是頗有效的方法，而後根據這些參數來評估算法的效果進而選擇最佳的那個。

import numpy as np from scipy.stats import uniform as sp_rand from sklearn.linear_model import Ridge from sklearn.grid_search import RandomizedSearchCV # prepare a uniform distribution to sample for the alpha parameter param_grid = {'alpha': sp_rand()} # create and fit a ridge regression model, testing random alpha values model = Ridge() rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100) rsearch.fit(X, y) print(rsearch) # summarize the results of the random parameter search print(rsearch.best_score_) print(rsearch.best_estimator_.alpha)

結果：

RandomizedSearchCV(cv=None,
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver=auto, tol=0.001),
estimator__alpha=1.0, estimator__copy_X=True,
estimator__fit_intercept=True, estimator__max_iter=None,
estimator__normalize=False, estimator__solver=auto,
estimator__tol=0.001, fit_params={}, iid=True, n_iter=100,
n_jobs=1,
param_distributions={'alpha': <scipy.stats.distributions.rv_frozen object at 0x04B86DD0>},
pre_dispatch=2*n_jobs, random_state=None, refit=True,
scoring=None, verbose=0)
0.282118643885
0.988443794636

小結

咱們整體瞭解了使用scikit-learn庫的大體流程，但願這些總結能讓初學者沉下心來，一步一步儘快的學習如何去解決具體的機器學習問題。

做者Jason Ding及其出處
GitCafe博客主頁(http://jasonding1354.gitcafe.io/)
Github博客主頁(http://jasonding1354.github.io/)
CSDN博客(http://blog.csdn.net/jasonding1354)
簡書主頁(http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
百度搜索jasonding1354進入個人博客主頁

ML Experiments