周志華《機器學習》課後習題練習——Ch3.3編程實現對率迴歸

　　好久沒來寫博客了，感受本身也懈怠了不少，最近畢業，工做，身份變化很大，煩心事也不少，對本身的第一份工做不是特別滿意，因此決定自學一下機器學習，給本身留一條後路。但願多日之後的本身再看到這篇文章的時候還能記起當時痛苦的心情。python

　　題目是：編程實現對率迴歸，並給出西瓜數據集3.0α上的結果。web

　　西瓜數據集以下：算法

　　在這裏咱們主要使用了sklean，matplotlib，numpy和pandas幾個庫，因爲sklearn中自帶了有關線性迴歸的算法，因此能夠直接調用，另外使用了matplotlib對其進行可視化處理。編程

　　代碼以下：數組

 1 import numpy as np  2 import matplotlib.pyplot as plt  3 import pandas as pd  4  5 from sklearn import model_selection  6 from sklearn.linear_model import LogisticRegression  7 from sklearn import metrics  8  9 # load the CSV file as a numpy matrix 10 dataset = pd.read_csv('F:\PythonTest\Machine Learning\watermelon3a.csv') 11 12 # separate the data from the target attributes 13 X = dataset[['密度','含糖率']] 14 y = dataset['好瓜'] 15 good_melon = dataset[dataset['好瓜']==1] 16 bad_melon = dataset[dataset['好瓜']==0] 17 18 # draw scatter diagram to show the raw data 19 f1 = plt.figure(1) 20 plt.title('watermelon_3a') 21 plt.xlabel('density') 22 plt.ylabel('ratio_sugar') 23 plt.xlim(0,1) 24 plt.ylim(0,1) 25 plt.scatter(bad_melon['密度'], bad_melon['含糖率'], marker = 'o', color = 'k', s=100, label = 'bad') 26 plt.scatter(good_melon['密度'], good_melon['含糖率'], marker = 'o', color = 'g', s=100, label = 'good') 27 plt.legend(loc = 'upper right') 28 29 30 # generalization of test and train set 31 X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.5, random_state=0) 32 33 # model training 34 log_model = LogisticRegression() 35 log_model.fit(X_train, y_train) 36 37 # model testing 38 y_pred = log_model.predict(X_test) 39 40 # summarize the accuracy of fitting 41 print(metrics.confusion_matrix(y_test, y_pred)) 42 print(metrics.classification_report(y_test, y_pred)) 43 print(log_model.coef_) 44 45 theta1,theta2 = log_model.coef_[0][0],log_model.coef_[0][1] 46 x_pred = np.linspace(0,1,100) 47 line_pred = theta1+theta2*x_pred 48 plt.plot(x_pred,line_pred) 49 50 plt.show()

　　代碼比較簡單，很久沒用python了，對其中的一些庫和相應的操做有些不熟悉了，因此仍是花費了很長的時間。主要是這些庫的操做不是很熟練，包括使用matplotlib進行畫圖，使用了linspace這個函數，而後還有經過調用pandas的read_csv進行數據集的提取，以及使用sklearn調用線性迴歸等。有關線性迴歸的幾個點摘抄以下，來自https://blog.csdn.net/qq_29083329/article/details/48653391dom

參數：

fit_intercept: 布爾型，默認爲true機器學習

說明：是否對訓練數據進行中心化。若是該變量爲false，則代表輸入的數據已經進行了中心化，在下面的過程裏不進行中心化處理；不然，對輸入的訓練數據進行中心化處理函數

normalize布爾型，默認爲false學習

說明：是否對數據進行標準化處理測試

copy_X 布爾型，默認爲true

說明：是否對X複製，若是選擇false，則直接對原數據進行覆蓋。（即通過中心化，標準化後，是否把新數據覆蓋到原數據上）

n_jobs 整型，默認爲1

說明：計算時設置的任務個數(number of jobs)。若是選擇-1則表明使用全部的CPU。這一參數的對於目標個數>1（n_targets>1）且足夠大規模的問題有加速做用。

返回值：

coef_ 數組型變量，形狀爲(n_features,)或(n_targets, n_features)

說明：對於線性迴歸問題計算獲得的feature的係數。若是輸入的是多目標問題，則返回一個二維數組(n_targets, n_features)；若是是單目標問題，返回一個一維數組 (n_features,)。

intercept_ 數組型變量

說明：線性模型中的獨立項。

注：該算法僅僅是scipy.linalg.lstsq通過封裝後的估計器。

方法：

decision_function(X) 對訓練數據X進行預測

fit(X, y[, n_jobs]) 對訓練集X, y進行訓練。是對scipy.linalg.lstsq的封裝

get_params([deep]) 獲得該估計器(estimator)的參數。

predict(X) 使用訓練獲得的估計器對輸入爲X的集合進行預測（X能夠是測試集，也能夠是須要預測的數據）。

score(X, y[,]sample_weight) 返回對於以X爲samples，以y爲target的預測效果評分。

set_params(**params) 設置估計器的參數