原文:http://blog.csdn.net/zouxy09/article/details/48903179html
1、概述python
機器學習算法在近幾年大數據點燃的熱火薰陶下已經變得被人所「熟知」,就算不懂得其中各算法理論,叫你喊上一兩個著名算法的名字,你也能昂首挺胸脫口而出。固然了,算法之林雖大,但能者仍是有限,能適應某些環境並取得較好效果的算法會脫穎而出,而表現平平者則被歷史所淡忘。隨着機器學習社區的發展和實踐驗證,這羣脫穎而出者也逐漸被人所承認和青睞,同時得到了更多社區力量的支持、改進和推廣。算法
以最普遍的分類算法爲例,大體能夠分爲線性和非線性兩大派別。線性算法有著名的邏輯迴歸、樸素貝葉斯、最大熵等,非線性算法有隨機森林、決策樹、神經網絡、核機器等等。線性算法舉的大旗是訓練和預測的效率比較高,但最終效果對特徵的依賴程度較高,須要數據在特徵層面上是線性可分的。所以,使用線性算法須要在特徵工程上下很多功夫,儘可能對特徵進行選擇、變換或者組合等使得特徵具備區分性。而非線性算法則牛逼點,能夠建模複雜的分類面,從而能更好的擬合數據。數據庫
那在咱們選擇了特徵的基礎上,哪一個機器學習算法能取得更好的效果呢?誰也不知道。實踐是檢驗哪一個好的不二標準。那難道要苦逼到寫五六個機器學習的代碼嗎?No,機器學習社區的力量是強大的,碼農界的共識是不重複造輪子!所以,對某些較爲成熟的算法,總有某些優秀的庫能夠直接使用,省去了大夥調研的大部分時間。網絡
基於目前使用python較多,而python界中遠近聞名的機器學習庫要數scikit-learn莫屬了。這個庫優勢不少。簡單易用,接口抽象得很是好,並且文檔支持實在感人。本文中,咱們能夠封裝其中的不少機器學習算法,而後進行一次性測試,從而便於分析取優。固然了,針對具體算法,超參調優也很是重要。dom
2、Scikit-learn的python實踐機器學習
2.一、Python的準備工做函數
Python一個備受歡迎的點是社區支持不少,有很是多優秀的庫或者模塊。可是某些庫之間有時候也存在依賴,因此要安裝這些庫也是挺繁瑣的過程。但總有人忍受不了這種繁瑣,都會開發出很多自動化的工具來節省各位客官的時間。其中,我的總結,安裝一個python的庫有如下三種方法:工具
1)Anaconda學習
這是一個很是齊全的python發行版本,最新的版本提供了多達195個流行的python包,包含了咱們經常使用的numpy、scipy等等科學計算的包。有了它,媽媽不再用擔憂我焦頭爛額地安裝一個又一個依賴包了。Anaconda在手,輕鬆我有!下載地址以下:http://www.continuum.io/downloads
2)Pip
使用過Ubuntu的人,對apt-get的愛只有本身懂。其實對Python的庫的下載和安裝能夠藉助pip工具的。須要安裝什麼庫,直接下載和安裝一條龍服務。在pip官網https://pypi.python.org/pypi/pip下載安裝便可。將來的需求就在#pip install xx 中。
3)源碼包
若是上述兩種方法都沒有找到你的庫,那你直接把庫的源碼下載回來,解壓,而後在目錄中會有個setup.py文件。執行#python setup.py install 便可把這個庫安裝到python的默認庫目錄中。
2.二、Scikit-learn的測試
scikit-learn已經包含在Anaconda中。也能夠在官方下載源碼包進行安裝。本文代碼裏封裝了以下機器學習算法,咱們修改數據加載函數,便可一鍵測試:
classifiers = {'NB':naive_bayes_classifier, 'KNN':knn_classifier, 'LR':logistic_regression_classifier, 'RF':random_forest_classifier, 'DT':decision_tree_classifier, 'SVM':svm_classifier, 'SVMCV':svm_cross_validation, 'GBDT':gradient_boosting_classifier }
train_test.py
#!usr/bin/env python #-*- coding: utf-8 -*- import sys import os import time from sklearn import metrics import numpy as np import cPickle as pickle reload(sys) sys.setdefaultencoding('utf8') # Multinomial Naive Bayes Classifier def naive_bayes_classifier(train_x, train_y): from sklearn.naive_bayes import MultinomialNB model = MultinomialNB(alpha=0.01) model.fit(train_x, train_y) return model # KNN Classifier def knn_classifier(train_x, train_y): from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier() model.fit(train_x, train_y) return model # Logistic Regression Classifier def logistic_regression_classifier(train_x, train_y): from sklearn.linear_model import LogisticRegression model = LogisticRegression(penalty='l2') model.fit(train_x, train_y) return model # Random Forest Classifier def random_forest_classifier(train_x, train_y): from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=8) model.fit(train_x, train_y) return model # Decision Tree Classifier def decision_tree_classifier(train_x, train_y): from sklearn import tree model = tree.DecisionTreeClassifier() model.fit(train_x, train_y) return model # GBDT(Gradient Boosting Decision Tree) Classifier def gradient_boosting_classifier(train_x, train_y): from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier(n_estimators=200) model.fit(train_x, train_y) return model # SVM Classifier def svm_classifier(train_x, train_y): from sklearn.svm import SVC model = SVC(kernel='rbf', probability=True) model.fit(train_x, train_y) return model # SVM Classifier using cross validation def svm_cross_validation(train_x, train_y): from sklearn.grid_search import GridSearchCV from sklearn.svm import SVC model = SVC(kernel='rbf', probability=True) param_grid = {'C': [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 'gamma': [0.001, 0.0001]} grid_search = GridSearchCV(model, param_grid, n_jobs = 1, verbose=1) grid_search.fit(train_x, train_y) best_parameters = grid_search.best_estimator_.get_params() for para, val in best_parameters.items(): print para, val model = SVC(kernel='rbf', C=best_parameters['C'], gamma=best_parameters['gamma'], probability=True) model.fit(train_x, train_y) return model def read_data(data_file): import gzip f = gzip.open(data_file, "rb") train, val, test = pickle.load(f) f.close() train_x = train[0] train_y = train[1] test_x = test[0] test_y = test[1] return train_x, train_y, test_x, test_y if __name__ == '__main__': data_file = "mnist.pkl.gz" thresh = 0.5 model_save_file = None model_save = {} test_classifiers = ['NB', 'KNN', 'LR', 'RF', 'DT', 'SVM', 'GBDT'] classifiers = {'NB':naive_bayes_classifier, 'KNN':knn_classifier, 'LR':logistic_regression_classifier, 'RF':random_forest_classifier, 'DT':decision_tree_classifier, 'SVM':svm_classifier, 'SVMCV':svm_cross_validation, 'GBDT':gradient_boosting_classifier } print 'reading training and testing data...' train_x, train_y, test_x, test_y = read_data(data_file) num_train, num_feat = train_x.shape num_test, num_feat = test_x.shape is_binary_class = (len(np.unique(train_y)) == 2) print '******************** Data Info *********************' print '#training data: %d, #testing_data: %d, dimension: %d' % (num_train, num_test, num_feat) for classifier in test_classifiers: print '******************* %s ********************' % classifier start_time = time.time() model = classifiers[classifier](train_x, train_y) print 'training took %fs!' % (time.time() - start_time) predict = model.predict(test_x) if model_save_file != None: model_save[classifier] = model if is_binary_class: precision = metrics.precision_score(test_y, predict) recall = metrics.recall_score(test_y, predict) print 'precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall) accuracy = metrics.accuracy_score(test_y, predict) print 'accuracy: %.2f%%' % (100 * accuracy) if model_save_file != None: pickle.dump(model_save, open(model_save_file, 'wb'))
4、測試結果
本次使用mnist手寫體庫進行實驗:http://deeplearning.net/data/mnist/mnist.pkl.gz。共5萬訓練樣本和1萬測試樣本。
代碼運行結果以下:
reading training and testing data... ******************** Data Info ********************* #training data: 50000, #testing_data: 10000, dimension: 784 ******************* NB ******************** training took 0.287000s! accuracy: 83.69% ******************* KNN ******************** training took 31.991000s! accuracy: 96.64% ******************* LR ******************** training took 101.282000s! accuracy: 91.99% ******************* RF ******************** training took 5.442000s! accuracy: 93.78% ******************* DT ******************** training took 28.326000s! accuracy: 87.23% ******************* SVM ******************** training took 3152.369000s! accuracy: 94.35% ******************* GBDT ******************** training took 7623.761000s! accuracy: 96.18%
在這個數據集中,因爲數據分佈的團簇性較好(若是對這個數據庫瞭解的話,看它的t-SNE映射圖就能夠看出來。因爲任務簡單,其在deep learning界已被認爲是toy dataset),所以KNN的效果不賴。GBDT是個很是不錯的算法,在kaggle等大數據比賽中,狀元探花榜眼之列常常能見其身影。三個臭皮匠勝過諸葛亮,仍是被驗證有道理的,特別是三個臭皮匠還能力互補的時候!
還有一個在實際中很是有效的方法,就是融合這些分類器,再進行決策。例如簡單的投票,效果都很是不錯。建議在實踐中,你們均可以嘗試下。