ML之sklearn:sklearn的make_pipeline函數、RobustScaler函數、KFold函數、cross_val_score函數的代碼解釋、使用方法之詳細攻略html
目錄python
sklearn的make_pipeline函數的代碼解釋、使用方法express
sklearn的make_pipeline函數的代碼解釋數組
sklearn的make_pipeline函數的使用方法緩存
一、使用Pipeline類來表示在使用MinMaxScaler縮放數據以後再訓練一個SVM的工做流程app
sklearn的RobustScaler函數的代碼解釋、使用方法機器學習
sklearn的cross_val_score函數的代碼解釋、使用方法
sklearn的make_pipeline函數的代碼解釋、使用方法
爲了簡化構建變換和模型鏈的過程,Scikit-Learn提供了pipeline類,能夠將多個處理步驟合併爲單個Scikit-Learn估計器。pipeline類自己具備fit、predict和score方法,其行爲與Scikit-Learn中的其餘模型相同。
sklearn的make_pipeline函數的代碼解釋
def make_pipeline(*steps, **kwargs): This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically. Parameters memory : None, str or object with the joblib.Memory interface, optional |
根據給定的估算器構造一條管道。 這是管道構造函數的簡寫;它不須要,也不容許命名估算器。相反,它們的名稱將自動設置爲類型的小寫。 參數 ---------- *steps :評估表、 memory:無,str或帶有joblib的對象。內存接口,可選 用於緩存安裝在管道中的變壓器。默認狀況下,不執行緩存。若是給定一個字符串,它就是到緩存目錄的路徑。啓用緩存會在安裝前觸發變壓器的克隆。所以,給管線的變壓器實例不能直接檢查。使用屬性' ' named_steps ' ' '或' ' steps ' '檢查管道中的評估器。當裝配耗時時,緩存變壓器是有利的。 |
Examples Returns |
sklearn的make_pipeline函數的使用方法
Examples -------- >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.preprocessing import StandardScaler >>> make_pipeline(StandardScaler(), GaussianNB(priors=None)) ... # doctest: +NORMALIZE_WHITESPACE Pipeline(memory=None, steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('gaussiannb', GaussianNB(priors=None))]) Returns ------- p : Pipeline
一、使用Pipeline類來表示在使用MinMaxScaler縮放數據以後再訓練一個SVM的工做流程
from sklearn.pipeline import Pipeline pipe = Pipeline([("scaler",MinMaxScaler()),("svm",SVC())]) pip.fit(X_train,y_train) pip.score(X_test,y_test)
二、make_pipeline函數建立管道
用Pipeline類構建管道時語法有點麻煩,咱們一般不須要爲每個步驟提供用戶指定的名稱,這種狀況下,就能夠用make_pipeline函數建立管道,它能夠爲咱們建立管道並根據每一個步驟所屬的類爲其自動命名。
from sklearn.pipeline import make_pipeline pipe = make_pipeline(MinMaxScaler(),SVC())
參考文章
《Python機器學習基礎教程》構建管道(make_pipeline)
Python sklearn.pipeline.make_pipeline() Examples
sklearn的RobustScaler函數的代碼解釋、使用方法
RobustScaler函數的代碼解釋
class RobustScaler(BaseEstimator, TransformerMixin): This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results. .. versionadded:: 0.17 Read more in the :ref:`User Guide <preprocessing_scaler>`. Parameters with_scaling : boolean, True by default quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0 .. versionadded:: 0.18 copy : boolean, optional, default is True Attributes scale_ : array of floats .. versionadded:: 0.17 See also :class:`sklearn.decomposition.PCA` Notes https://en.wikipedia.org/wiki/Median_(statistics) |
使用對離羣值穩健的統計數據來衡量特徵。 這個標量去除中值,並根據分位數範圍(默認爲IQR:四分位數範圍)對數據進行縮放。 數據集的標準化是許多機器學習估計器的常見需求。這一般是經過去除平均值和縮放到單位方差來實現的。然而,異常值每每會對樣本均值/方差產生負面影響。在這種狀況下,中位數和四分位範圍一般會獲得更好的結果。 . .versionadded:: 0.17 詳見:ref: ' User Guide '。</preprocessing_scaler> 參數 with_scaling:布爾值,默認爲True quantile_range:元組(q_min, q_max), 0.0 < q_min < q_max < 100.0 . .versionadded:: 0.18 布爾值,可選,默認爲真 屬性 浮點數數組 . .versionadded:: 0.17 另請參閱 類:「sklearn.decomposition.PCA」 筆記 https://en.wikipedia.org/wiki/Median_(統計) |
def __init__(self, with_centering=True, with_scaling=True, def _check_array(self, X, copy): if sparse.issparse(X): def fit(self, X, y=None): Parameters if self.with_scaling: q = np.percentile(X, self.quantile_range, axis=0) def transform(self, X): Can be called on sparse input, provided that ``RobustScaler`` has been Parameters if sparse.issparse(X): def inverse_transform(self, X): Parameters if sparse.issparse(X): |
RobustScaler函數的使用方法
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.5, random_state=1)) ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.5, l1_ratio=.9, random_state=3))
sklearn的KFold函數的代碼解釋、使用方法
KFold函數的代碼解釋
class KFold Found at: sklearn.model_selection._split class KFold(_BaseKFold): |
在:sklearn.model_select ._split找到的類KFold 類KFold (_BaseKFold): |
Examples -------- >>> from sklearn.model_selection import KFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4]) >>> kf = KFold(n_splits=2) >>> kf.get_n_splits(X) 2 >>> print(kf) # doctest: +NORMALIZE_WHITESPACE KFold(n_splits=2, random_state=None, shuffle=False) >>> for train_index, test_index in kf.split(X): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [2 3] TEST: [0 1] TRAIN: [0 1] TEST: [2 3] Notes ----- The first ``n_samples % n_splits`` folds have size ``n_samples // n_splits + 1``, other folds have size ``n_samples // n_splits``, where ``n_samples`` is the number of samples. See also -------- StratifiedKFold Takes group information into account to avoid building folds with imbalanced class distributions (for binary or multiclass classification tasks). GroupKFold: K-fold iterator variant with non-overlapping groups. RepeatedKFold: Repeats K-Fold n times. """ |
另請參閱 -------- StratifiedKFold 考慮組信息,以免構建不平衡的類分佈的摺疊(對於二進制或多類分類任務)。 GroupKFold:不重疊組的K-fold迭代器變體。 RepeatedKFold:重複K-Fold n次。 」「」 |
def __init__(self, n_splits=3, shuffle=False, random_state=None): super(KFold, self).__init__(n_splits, shuffle, random_state) def _iter_test_indices(self, X, y=None, groups=None): n_samples = _num_samples(X) indices = np.arange(n_samples) if self.shuffle: check_random_state(self.random_state).shuffle(indices) n_splits = self.n_splits fold_sizes = (n_samples // n_splits) * np.ones(n_splits, dtype=np. int) fold_sizes[:n_samples % n_splits] += 1 current = 0 for fold_size in fold_sizes: start, stop = current, current + fold_size yield indices[start:stop] current = stop |
KFold函數的使用方法
Examples -------- >>> from sklearn.model_selection import KFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4]) >>> kf = KFold(n_splits=2) >>> kf.get_n_splits(X) 2 >>> print(kf) # doctest: +NORMALIZE_WHITESPACE KFold(n_splits=2, random_state=None, shuffle=False) >>> for train_index, test_index in kf.split(X): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [2 3] TEST: [0 1] TRAIN: [0 1] TEST: [2 3]
sklearn的cross_val_score函數的代碼解釋、使用方法
cross_val_score函數的代碼解釋
def cross_val_score Found at: sklearn.model_selection._validation def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs'): |
經過交叉驗證來評估一個分數 更多信息參見:ref: ' User Guide '。 |
Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data. X : array-like The data to fit. Can be for example a list, or an array. y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning. groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)``. cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 3-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - An object to be used as a cross-validation generator. - An iterable yielding train, test splits. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here. n_jobs : integer, optional The number of CPUs to use to do the computation. -1 means 'all CPUs'. verbose : integer, optional The verbosity level. fit_params : dict, optional Parameters to pass to the fit method of the estimator. pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A string, giving an expression as a function of n_jobs, as in '2*n_jobs' Returns ------- scores : array of float, shape=(len(list(cv)),) Array of scores of the estimator for each run of the cross validation. |
參數 ---------- estimator:實現「適合」對象以適合數據。
X:數組類 須要匹配的數據。能夠是列表,也能夠是數組。
y : 相似數組,可選,默認:無 在監督學習的狀況下,預測的目標變量。
groups : 類數組,形狀(n_samples,),可選 將數據集分割爲訓練/測試集時使用的樣本的標籤分組。
scoring : 字符串,可調用或無,可選,默認:無 一個字符串(參見模型評估文檔)或簽名爲' ' scorer(estimator, X, y) ' '的scorer可調用對象/函數。
cv : int,交叉驗證生成器或可迭代,可選 肯定交叉驗證分割策略。 cv可能的輸入有: -無,使用默認的三折交叉驗證, -整數,用於指定「(分層的)KFold」中的摺疊數, -用做交叉驗證生成器的對象。 -一個可迭代產生的序列,測試分裂。 對於整數/無輸入,若是估計器是一個分類器,而且' ' y ' '是二進制的或多類的,則使用:class: ' StratifiedKFold '。在全部其餘狀況下,使用:class: ' KFold '。
請參考:ref: ' User Guide ',瞭解能夠在這裏使用的各類交叉驗證策略。
n_jobs:整數,可選 用於進行計算的cpu數量。-1表示「全部cpu」。
verbose:整數,可選 冗長的水平。
fit_params :dict,可選 參數傳遞給估計器的擬合方法。
pre_dispatch: int或string,可選 控制並行執行期間分派的做業數量。當分配的做業多於cpu可以處理的任務時,減小這個數量有助於避免內存消耗激增。該參數能夠爲: -無,在這種狀況下,當即建立並派生全部做業。將此用於輕量級和快速運行的做業,以免因爲按需生成做業而形成的延遲 -一個int,給出生成的做業的確切總數 一個字符串,給出一個做爲n_jobs函數的表達式,如'2*n_jobs'
返回 ------- (len(list(cv)),) 交叉驗證的每次運行估計器的分數數組。 |
Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_score >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data[:150] >>> y = diabetes.target[:150] >>> lasso = linear_model.Lasso() >>> print(cross_val_score(lasso, X, y)) # doctest: +ELLIPSIS [ 0.33150734 0.08022311 0.03531764] See Also --------- :func:`sklearn.model_selection.cross_validate`: To run cross-validation on multiple metrics and also to return train scores, fit times and score times. :func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function. """ # To ensure multimetric format is not supported scorer = check_scoring(estimator, scoring=scoring) cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups, scoring={'score':scorer}, cv=cv, return_train_score=False, n_jobs=n_jobs, verbose=verbose, fit_params=fit_params, pre_dispatch=pre_dispatch) return cv_results['test_score'] |
另請參閱 --------- :func:「sklearn.model_selection.cross_validate」: 在多個指標上進行交叉驗證,並返回訓練分數、適應時間和得分時間。 :func:「sklearn.metrics.make_scorer」: 從性能度量或損失函數中製做一個記分員。 」「」 #以確保不支持多度量格式 |
scoring參數可選的對象
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
Scoring |
Function |
Comment |
---|---|---|
Classification |
||
‘accuracy’ |
||
‘balanced_accuracy’ |
||
‘average_precision’ |
||
‘neg_brier_score’ |
||
‘f1’ |
for binary targets |
|
‘f1_micro’ |
micro-averaged |
|
‘f1_macro’ |
macro-averaged |
|
‘f1_weighted’ |
weighted average |
|
‘f1_samples’ |
by multilabel sample |
|
‘neg_log_loss’ |
requires |
|
‘precision’ etc. |
suffixes apply as with ‘f1’ |
|
‘recall’ etc. |
suffixes apply as with ‘f1’ |
|
‘jaccard’ etc. |
suffixes apply as with ‘f1’ |
|
‘roc_auc’ |
||
‘roc_auc_ovr’ |
||
‘roc_auc_ovo’ |
||
‘roc_auc_ovr_weighted’ |
||
‘roc_auc_ovo_weighted’ |
||
Clustering |
||
‘adjusted_mutual_info_score’ |
||
‘adjusted_rand_score’ |
||
‘completeness_score’ |
||
‘fowlkes_mallows_score’ |
||
‘homogeneity_score’ |
||
‘mutual_info_score’ |
||
‘normalized_mutual_info_score’ |
||
‘v_measure_score’ |
||
Regression |
||
‘explained_variance’ |
||
‘max_error’ |
||
‘neg_mean_absolute_error’ |
||
‘neg_mean_squared_error’ |
||
‘neg_root_mean_squared_error’ |
||
‘neg_mean_squared_log_error’ |
||
‘neg_median_absolute_error’ |
||
‘r2’ |
||
‘neg_mean_poisson_deviance’ |
||
‘neg_mean_gamma_deviance’ |
cross_val_score函數的使用方法
一、分類預測——糖尿病
>>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_score >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data[:150] >>> y = diabetes.target[:150] >>> lasso = linear_model.Lasso() >>> print(cross_val_score(lasso, X, y)) # doctest: +ELLIPSIS [ 0.33150734 0.08022311 0.03531764]
二、分類預測——iris鳶尾花
from sklearn import datasets #自帶數據集 from sklearn.model_selection import train_test_split,cross_val_score #劃分數據 交叉驗證 from sklearn.neighbors import KNeighborsClassifier #一個簡單的模型,只有K一個參數,相似K-means import matplotlib.pyplot as plt iris = datasets.load_iris() #加載sklearn自帶的數據集 X = iris.data #這是數據 y = iris.target #這是每一個數據所對應的標籤 train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=1/3,random_state=3) #這裏劃分數據以1/3的來劃分 訓練集訓練結果 測試集測試結果 k_range = range(1,31) cv_scores = [] #用來放每一個模型的結果值 for n in k_range: knn = KNeighborsClassifier(n) #knn模型,這裏一個超參數能夠作預測,當多個超參數時須要使用另外一種方法GridSearchCV scores = cross_val_score(knn,train_X,train_y,cv=10,scoring='accuracy') #cv:選擇每次測試折數 accuracy:評價指標是準確度,能夠省略使用默認值,具體使用參考下面。 cv_scores.append(scores.mean()) plt.plot(k_range,cv_scores) plt.xlabel('K') plt.ylabel('Accuracy') #經過圖像選擇最好的參數 plt.show() best_knn = KNeighborsClassifier(n_neighbors=3) # 選擇最優的K=3傳入模型 best_knn.fit(train_X,train_y) #訓練模型 print(best_knn.score(test_X,test_y)) #看看評分