Python機器學習（基礎篇---監督學習（線性分類器））

監督學習經典模型算法

機器學習中的監督學習模型的任務重點在於，根據已有的經驗知識對未知樣本的目標/標記進行預測。根據目標預測變量的類型不一樣，咱們把監督學習任務大致分爲分類學習與迴歸預測兩類。監督學習任務的基本流程：首先準備訓練數據，能夠是文本、圖像、音頻等；而後抽取所須要的特徵，造成特徵向量，接着把這些特徵向量連同對應的標記/目標（Labels）一併送入學習算法中，訓練一個預測模型，而後採用一樣的特徵抽取方法做用於新測試數據，獲得用於測試的特徵向量，最後使用預測模型對這些待測試的特徵向量進行預測並獲得結果。dom

1.分類學習機器學習

最基礎的是二分類問題，即判斷是非，從兩個類別中選擇一個做爲預測結果。多分類問題，即在多餘兩個類別中選擇一個，多標籤分類問題，判斷一個樣本是否同時屬於多個不一樣類別。函數

1.1線性分類器工具

模型介紹：線性分類器是一種假設特徵與分類結果存在線性關係的模型。經過累加計算每一個維度的特徵與各自權重的乘積來幫助類別決策。性能

若是咱們定義x=<x1,x2,...,xn>來表明n維特徵列向量，同時用n維列向量w=<w1,w2,...wn>來表明對應得權重，避免座標過座標原點，假設截距爲b。線性關係可表達爲：學習

f（w,x,b）=w^Tx+b測試

咱們所要處理的簡單二分類問題但願f∈{0,1}；所以須要一個函數把原先的f∈R映射到（0,1），邏輯斯蒂函數：spa

g(z)=1/(1+e^-z)code

將z替換爲f，邏輯斯蒂迴歸模型：

h_w,b(x)=g(f(w,x,b))=1/(1+e^-f)=1/(1+e-^(wTx+b)

實例1：良/惡性乳腺癌腫瘤預測----------邏輯斯蒂迴歸分類器

數據描述：

Number of Instances: 699 (as of 15 July 1992)

Number of Attributes: 10 plus the class attribute
Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain

   -- -----------------------------------------

   1. Sample code number            id number

   2. Clump Thickness               1 - 10

   3. Uniformity of Cell Size       1 - 10

   4. Uniformity of Cell Shape      1 - 10

   5. Marginal Adhesion             1 - 10

   6. Single Epithelial Cell Size   1 - 10

   7. Bare Nuclei                   1 - 10

   8. Bland Chromatin               1 - 10

   9. Normal Nucleoli               1 - 10

  10. Mitoses                       1 - 10

  11. Class:                        (2 for benign, 4 for malignant)

Missing attribute values: 16

   There are 16 instances in Groups 1 to 6 that contain a single missing

   (i.e., unavailable) attribute value, now denoted by "?".

Class distribution:

   Benign: 458 (65.5%)

   Malignant: 241 (34.5%)

#步驟一：良/惡性乳腺癌腫瘤數據預處理

#導入pandas與numpy工具包

import pandas as pd

import numpy as np

#建立特徵列表

column_names=['Sample code number','Clump Thickness','Uniformity of Cell Size',

              'Uniformity of Cell Shape','Marginal Adhesion',

              'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin',

              'Normal Nucleoli','Mitoses','Class']

#使用pandas.read_csv函數從互聯網讀取指定數據

data=pd.read_csv('

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',names=column_names)

# print(data)#[699 rows x 11 columns]

# print(data[:5])

#Sample code number Clump Thickness Uniformity of Cell Size \

#0 1000025 5 1

#1 1002945 5 4

#2 1015425 3 1

#3 1016277 6 8

#4 1017023 4 1

#Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size \

#0 1 1 2

#1 4 5 7

#2 1 1 2

#3 8 1 3

#4 1 3 2

#Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class

#0 1 3 1 1 2

#1 10 3 2 1 2

#2 2 3 1 1 2

#3 4 3 7 1 2

#4 1 3 1 1 2

data=data.replace(to_replace='?',value=np.nan)

data=data.dropna(how='any')

print(data.shape)#(683, 11)

#步驟二：準備良/惡性乳腺癌腫瘤訓練、測試數據

#使用sklearn.cross_validation裏的train_test_split模塊用於分割數據

from sklearn.cross_validation import train_test_split

#隨機採樣25%的數據用於測試，剩下的75%用於構建訓練集合

X_train,X_test,y_train,y_test=train_test_split(data[column_names[1:10]],data[column_names[10]],test_size=0.25,random_state=33)

#檢查訓練樣本的數量和類別分佈

print(y_train.value_counts())

# 2 344

# 4 168

# Name: Class, dtype: int64

print(y_test.value_counts())

# 2 100

# 4 71

# Name: Class, dtype: int64

#步驟三：使用線性分類模型從事良/惡性腫瘤預測任務 #從sklearn.preprocessing裏導入StandardScaler from sklearn.preprocessing import StandardScaler #從sklearn.preprocessing裏導入LogisticRegression與SGDClassifier from sklearn.linear_model import LogisticRegression from sklearn.linear_model import SGDClassifier #標準化數據，保證每一個維度的特徵數據方差爲1，均值爲0。使得預測結果不會被某些維度過大的特徵值而主導 ss=StandardScaler() X_train=ss.fit_transform(X_train) X_test=ss.fit_transform(X_test) #初始化LogisticRegression與SGDClassifier lr=LogisticRegression() sgdc=SGDClassifier() #調用LogisticRegression中的fit函數/模塊用來訓練模型參數 lr.fit(X_train,y_train) #使用訓練好的模型lr對X_test lr_y_predict=lr.predict(X_test) #調用SGDClassifier中的fit函數/模塊用來訓練模型參數 sgdc.fit(X_train,y_train) sgdc_y_predict=sgdc.predict(X_test) #步驟四：使用線性分類模型從事良/惡性腫瘤預測任務的性能分析 #從sklearn.metrics裏導入classification_report模塊 from sklearn.metrics import classification_report #使用邏輯斯蒂迴歸模型自帶的評分函數score得到模型在測試集上的準確性結果 print('Accuracy of LR Classifier:',lr.score(X_test,y_test)) #利用classification_report模塊得到LogisticRegression其餘三個指標的結果。 print(classification_report(y_test,lr_y_predict,target_names=['Benign','Malignant'])) #使用隨機梯度降低模型自帶的評分函數score得到模型在測試集上的準確性結果 print('Accuracy of SGD Classifier:',sgdc.score(X_test,y_test)) #利用classification_report模塊得到LogisticRegression其餘三個指標的結果。 print(classification_report(y_test,sgdc_y_predict,target_names=['Benign','Malignant']))