做者:Susan Li python
編譯:袁雪瑤、吳雙、姜範波git
根據美國疾病控制預防中心的數據,如今美國1/7的成年人患有糖尿病。可是到2050年,這個比例將會快速增加至高達1/3。咱們在UCL機器學習數據庫裏一個糖尿病數據集,但願能夠經過這一數據集,瞭解如何利用機器學習來幫助咱們預測糖尿病,讓咱們開始吧!github
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/diabetes.csv算法
糖尿病數據集可從UCI機器學習庫中獲取並下載。數據庫
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline diabetes=pd.read_csv('C:\Download\Machine-Learning-with-Python-master\Machine-Learning-with-Python-master\diabetes.csv') print(diabetes.columns)
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')
特徵(懷孕次數,血糖,血壓,皮脂厚度,胰島素,BMI身體質量指數,糖尿病遺傳函數,年齡,結果)
diabetes.head()
print(diabetes.groupby('Outcome').size())
Outcome 0 500 1 268 dtype: int64 「結果」是咱們將要預測的特徵,0意味着未患糖尿病,1意味着患有糖尿病。在768個數據點中,500個被標記爲0,268個標記爲1。
print("dimennsion of diabetes data:{}".format(diabetes.shape))
dimennsion of diabetes data:(768, 9),糖尿病數據集由768個數據點組成,各有9個特徵。
import seaborn as sns sns.countplot(diabetes['Outcome'],label="Count")
k-NN算法幾乎能夠說是機器學習中最簡單的算法。創建模型只需存儲訓練數據集。而爲了對新的數據點作出預測,該算法會在訓練數據集中找到與其相距最近的數據點——也就是它的「近鄰點」。首先,讓咱們研究一下是否可以確認模型的複雜度和精確度之間的關係:app
from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(diabetes.loc[:,diabetes.columns !='Outcome'],diabetes['Outcome'],stratify=diabetes['Outcome'],random_state=66) from sklearn.neighbors import KNeighborsClassifier training_accuracy=[] test_accuracy=[] #try n_neighbors from 1 to 10 neighbors_settings=range(1,11) for n_neighbors in neighbors_settings: #build the model knn=KNeighborsClassifier(n_neighbors=n_neighbors) knn.fit(x_train,y_train) #record training set accuracy training_accuracy.append(knn.score(x_train,y_train)) #record test set accuracy test_accuracy.append(knn.score(x_test,y_test)) plt.plot(neighbors_settings,training_accuracy,label="training accuracy") plt.plot(neighbors_settings,test_accuracy,label="test accuracy") plt.ylabel("Accuracy") plt.xlabel("n_neighbors") plt.legend() plt.savefig('knn_compare_model')
上圖展現了訓練集和測試集在模型預測準確度(y軸)和近鄰點個數設置(x軸)之間的關係。若是咱們僅選擇一個近鄰點,那麼訓練集的預測是絕對正確的。可是當更多的近鄰點被選入做爲參考時,訓練集的準確度會降低,這代表了使用單一近鄰會致使模型太過複雜。這裏的最好方案能夠從圖中看出是選擇9個近鄰點。dom
圖中建議咱們應該選擇n_neighbors=9,下面給出:機器學習
knn=KNeighborsClassifier(n_neighbors=9) knn.fit(x_train,y_train) print('Accuracy of K-NN classifier on training set:{:.2f}'.format(knn.score(x_train,y_train))) print('Accuracy of K-NN classifier on training set:{:.2f}'.format(knn.score(x_test,y_test)))
Accuracy of K-NN classifier on training set:0.79 Accuracy of K-NN classifier on training set:0.78