[機器學習實踐] 針對Breast-Cancer數據集

時間 2019-12-07

標籤機器學習實踐針對 breast cancer 數據简体版

原文原文鏈接

本篇博客中，咱們將對一個UCI數據庫中的數據集：Breast-Cancer數據集，應用已有的機器學習方法來實現一個分類器。html

本文代碼連接數據庫

數據集概況

數據集的地址爲：linkdom

在該頁面中，能夠進入Data Set Description 來查看數據的說明文檔，另一個鏈接是Data Folder 查看數據集的下載地址。機器學習

這裏咱們使用的文件是：函數

breast-cancer-wisconsin.data
breast-cancer-wisconsin.names

即：性能

這兩個文件，第一個文件（鏈接）是咱們的數據文件，第二個文件（鏈接）是數據的說明文檔。學習

對於這樣的一份數據，咱們應該首先閱讀說明文檔中的內容來對數據有一個基本的瞭解。測試

對數據的預處理

咱們能夠知道文件有11個列，第1個列爲id號，第2-10列爲特徵，11列爲標籤（2爲良性、4爲惡性）。具體的特徵內容在文檔中，可是咱們能夠不關心醫學上的具體意義，這部分在文檔中的描述以下：.net

7. Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

另外從文檔中咱們還能夠知道一些其餘的信息：code

數據集中共有699條信息
數據集中有16處缺失值，缺失值使用"?"表示
數據集中良性數據有458條，惡性數據有241條

缺失值處理和分割數據集

由於缺失的數據很少（11條），因此咱們暫時先採用丟棄帶有「？」的數據，加上前面讀取數據、添加表頭的操做，代碼以下：

# import the packets
import numpy as np
import pandas as pd

DATA_PATH = "breast-cancer-wisconsin.data"

# create the column names
columnNames = [
    'Sample code number',
    'Clump Thickness',
    'Uniformity of Cell Size',
    'Uniformity of Cell Shape',
    'Marginal Adhesion',
    'Single Epithelial Cell Size',
    'Bare Nuclei',
    'Bland Chromatin',
    'Normal Nucleoli',
    'Mitoses',
    'Class'
]

data = pd.read_csv(DATA_PATH, names = columnNames)
# show the shape of data
print data.shape

# use standard missing value to replace "?"
data = data.replace(to_replace = "?", value = np.nan)
# then drop the missing value
data = data.dropna(how = 'any')

print data.shape

輸出結果爲：

(699, 11)
(683, 11)

能夠看到，如今數據中帶有缺失值的數據都被丟棄掉了。

咱們能夠經過相似 data['Class'] 的方式來訪問特定的屬性，以下圖：

而後咱們會將數據集分割爲兩部分：訓練數據集和測試數據集，使用了train_test_split，這個函數已經自動完成了隨機分割的功能，函數文檔。

而後咱們分割數據集：

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data[ columnNames[1:10] ], # features
    data[ columnNames[10]   ], # labels
    test_size = 0.25,
    random_state = 33
)

獲得的變量爲：

X_train ：訓練數據集的特徵
X_test ：測試數據集的特徵
y_train ：訓練數據集的標籤
y_test ：測試數據集的標籤

由於是監督學習，因此全部數據都有標籤，且認爲標籤的內容百分之百準確。

應用機器學習模型

應用機器模型前，應該將每一個特徵的數值轉化爲均值爲0，方差爲1的數據，使訓練出的模型不會被某些維度過大的值主導。

這裏使用的使scikit-learn 中的 StandardScaler 模塊，doc連接。

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train = ss.fit_transform(X_train) # fit_transform for train data
X_test = ss.transform(X_test)

而後咱們將創建一個機器學習模型，這裏咱們使用了Logestic Regression 和 SVM：

# use logestic-regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_y = lr.predict(X_test)

# use svm
from sklearn.svm import LinearSVC
lsvc = LinearSVC()
lsvc.fit(X_train, y_train)
svm_y = lsvc.predict(X_test)

分類器的效果評估

首先咱們用分類器自帶的.score方法來對準確性進行打印：

# now we will check the performance of the classifier
from sklearn.metrics import classification_report
# use the classification_report to present result
# `.score` method can be used to test the accuracy
print 'Accuracy of the LogesticRegression: ', lr.score(X_test, y_test)
# print 'Accuracy on the train dataset: ', lr.score(X_train, y_train)
# print 'Accuracy on the predict result (should be 1.0): ', lr.score(X_test, lr_y)
print 'Accuracy of the SVM: ' , lsvc.score(X_test, y_test)

輸出爲：

Accuracy of the LogesticRegression:  0.953216374269
Accuracy of the SVM:  0.959064327485

除此之外，咱們還可使用classification_report對分類器查看更詳細的性能測試結果：

print classification_report(y_test, svm_y, target_names = ['Benign', 'Malignant'])

其結果以下：

precision    recall  f1-score   support

     Benign       0.96      0.98      0.97       111
  Malignant       0.96      0.92      0.94        60

avg / total       0.96      0.96      0.96       171

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。