學完了Coursera上Andrew Ng的Machine Learning後,火燒眉毛地想去參加一場Kaggle的比賽,卻發現從理論到實踐的轉變實在是太困難了,在此記錄學習過程.html
教程大多推薦使用Jupyter Notebook來進行數據科學的相關編程,咱們經過Anaconda來安裝Jupyter Notebook和須要用到的一些python庫,按照如下方法從新安裝了Anaconda,平臺Win10python
Anaconda安裝git
參照如下兩篇文章配置好了Jupyter Notebook,學習了相關的基本操做github
Jupyter Notebook經常使用快捷鍵apache
官方文檔api
官方教程數組
官方教程中文翻譯服務器
Jupyter Notebook Viewer的matplotlib lecture
建議先看官方教程,經過折線圖熟悉基本操做,而後看入門教程第三章到第六章掌握各類圖的畫法
上面兩個教程用於速成,下面這本書是pandas的做者寫的,用於仔細瞭解
特徵工程:
在機器學習中,很重要的一步是對特徵的處理,咱們參考下文,先給出一些經常使用的特徵處理方法在sklearn中的用法
from sklearn.preprocessing import StandardScaler data_train = StandardScaler().fit_transform(data_train) data_test = StandardScaler().fit_transform(data_test)
from sklearn.preprocessing import MinMaxScaler data = MinMaxScaler().fit_transform(data)
from sklearn.preprocessing import Normalizer data = Normalizer().fit_transform(data)
from sklearn.preprocessing import Binarizer data = Binarizer(threshold = epsilon).fit_transform(data)
實際上就是保留數值型特徵,並將不一樣的類別轉換爲啞變量(獨熱編碼),可參考:python中DictVectorizer的使用
from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer(sparse = False) X_train = vec.fit_transform(X_train.to_dict(orient = 'recoed'))
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 #選擇K個最好的特徵,返回選擇特徵後的數據 skb = SelectKBest(chi2, k = 10).fit(X_train, y_train) X_train = skb.transform(X_train) X_test = skb.transform(X_test)
from sklearn.feature_selection import SelectKBest from minepy import MINE #因爲MINE的設計不是函數式的,定義mic方法將其爲函數式的,返回一個二元組,二元組的第2項設置成固定的P值0.5 def mic(x, y): m = MINE() m.compute_score(x, y) return (m.mic(), 0.5) #選擇K個最好的特徵,返回特徵選擇後的數據 SelectKBest(lambda X, Y: array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)
from sklearn.decomposition import PCA estimator = PCA(n_components=2)#幾個主成分 X_pca = estimator.fit_transform(X_data)
學習算法:
劃分訓練集和測試集:
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 33)
訓練:
from sklearn import LearnAlgorithm#導入對應的學習算法包 la = LearnAlgorithm() la.fit(X_train, y_train) y_predict = la.predict(x_test)
隨機梯度降低法(SGD):
from sklearn.linear_model import SGDClassifier sgd = SGDClassifier() from sklearn.linear_model import SGDRegressor sgd = SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
支持向量機(SVM):
支持向量分類(SVC):
from sklearn.svm import SVC svc_linear = SVC(kernel='linear')#線性核,能夠選用不一樣的核
支持向量迴歸(SVR):
from sklearn.svm import SVR svr_linear = SVR(kernel='linear')#線性核,能夠選用不一樣的核如poly,rbf
樸素貝葉斯(NaiveBayes):
from sklearn.naive_bayes import MultinomialNB mnb = MultinomialNB()
決策樹(DecisionTreeClassifier):
from sklearn.tree import DecisionTreeClassifier dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)#最大深度和最小樣本數,用於防止過擬合
隨機森林(RandomForestClassifier):
from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(max_depth=3, min_samples_leaf=5)
梯度提高樹(GBDT):
from sklearn.ensemble import GradientBoostingClassifier gbc = GradientBoostingClassifier(max_depth=3, min_samples_leaf=5)
極限迴歸森林(ExtraTreesRegressor):
from sklearn.ensemble import ExtraTreesRegressor() etr = ExtraTreesRegressor()
評估:
from sklearn import metrics accuracy_rate = metrics.accuracy_score(y_test, y_predict) metrics.classification_report(y_test, y_predict, target_names = data.target_names)#能夠獲取準確率,召回率等數據
K折交叉檢驗:
from sklearn.cross_validation import cross_val_score,KFold cv = KFold(len(y), K, shuffle=True, random_state = 0) scores = cross_val_score(clf, X, y, cv = cv)
或
from sklearn.cross_validation import cross_val_score scores = cross_val_score(dt, X_train, y_train, cv = K)
注意這裏的X,y須要爲ndarray類型,若是是DataFrame則須要用df.values和df.values.flatten()轉化
Pipeline機制:
pipeline機制實現了對所有步驟的流式化封裝和管理,應用於參數集在數據集上的重複使用.Pipeline對象接受二元tuple構成的list,第一個元素爲自定義名稱,第二個元素爲sklearn中的transformer或estimator,即處理特徵和用於學習的方法.以樸素貝葉斯爲例,根據處理特徵的不一樣方法有如下代碼:
clf_1 = Pipeline([('count_vec', CountVectorizer()), ('mnb', MultinomialNB())]) clf_2 = Pipeline([('hash_vec', HashingVectorizer(non_negative=True)), ('mnb', MultinomialNB())]) clf_3 = Pipeline([('tfidf_vec', TfidfVectorizer()), ('mnb', MultinomialNB())])
特徵選擇:
from sklearn import feature_selection fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=per) X_train_fs = fs.fit_transform(X_train, y_train)
咱們以特徵選擇和5折交叉檢驗爲例,實現一個完整的參數選擇過程:
from sklearn import feature_selection from sklearn.cross_validation import cross_val_score percentiles = range(1,100) results= [] for i in percentiles: fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i) X_train_fs = fs.fit_transform(X_train, y_train) scores = cross_val_score(dt, X_train_fs, y_train, cv = 5) results = np.append(results, scores.mean()) opt = np.where(results == results.max())[0] fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=opt) X_train_fs = fs.fit_transform(X_train, y_train) dt.fit(X_train_fs, y_train) y_predict = dt.predict(x_test)
超參數:
超參數指機器學習模型裏的框架參數,在競賽和工程中都很是重要
集成學習(Ensemble Learning):
經過對多個模型融合以提高總體性能,如隨機森林,XGBoost,參考下文:
Ensemble Learning-模型融合-Python實現
多線程網格搜索:
用於尋找最優參數,可參考下文:
from sklearn.cross_validation import train_test_split from sklearn.grid_search import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33) from sklearn.svm import SVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())]) parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)} gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1) %time _=gs.fit(X_train, y_train) gs.best_params_, gs.best_score_ print gs.score(X_test, y_test)
學習完以上內容後,能夠參考下文,已經能夠完成一些較爲簡單的kaggle contest了