數據相關性可視化及交叉驗證預測分析-大數據ML樣本集案例實戰

版權聲明:本套技術專欄是做者(秦凱新)平時工做的總結和昇華,經過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和集羣環境容量規劃等內容,請持續關注本套博客。QQ郵箱地址:1120746959@qq.com,若有任何學術交流,可隨時聯繫。node

1 數據預處理

  • DF加上表頭web

    5.1,3.5,1.4,0.2,Iris-setosa
      4.9,3.0,1.4,0.2,Iris-setosa
      4.7,3.2,1.3,0.2,Iris-setosa
      4.6,3.1,1.5,0.2,Iris-setosa
      5.0,3.6,1.4,0.2,Iris-setosa
      5.4,3.9,1.7,0.4,Iris-setosa
      4.6,3.4,1.4,0.3,Iris-setosa
    
      import pandas as pd
      import matplotlib.pyplot as plt
      import numpy as np
      iris_data = pd.read_csv('C:\\ML\\MLData\\iris.data')
      iris_data.columns = ['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm', 'class']
      iris_data.head()
    複製代碼

  • 讀取圖片算法

    from PIL import Image
      img=Image.open('test.jpg')
      plt.imshow(img)
      plt.show()
    複製代碼

  • 數值描述(數值區間)bootstrap

    iris_data.describe()
    複製代碼

  • 高級可視化庫pairplotapp

    %matplotlib inline
      
      import matplotlib.pyplot as plt
      import seaborn as sb
      sb.pairplot(iris_data.dropna(), hue='class')
    複製代碼

  • 高級可視化庫 violinplot分佈範圍(花瓣相對能夠區分出不一樣特徵)dom

    plt.figure(figsize=(10, 10))
      for column_index, column in enumerate(iris_data.columns):
          if column == 'class':
              continue
          plt.subplot(2, 2, column_index + 1)
          sb.violinplot(x='class', y=column, data=iris_data)
    複製代碼

  • 版權聲明:本套技術專欄是做者(秦凱新)平時工做的總結和昇華,經過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和集羣環境容量規劃等內容,請持續關注本套博客。QQ郵箱地址:1120746959@qq.com,若有任何學術交流,可隨時聯繫。

2 構造分類器(sklearn.cross_validation過時)

  • 測試集與訓練集測試

    from sklearn.model_selection import KFold
      from sklearn.model_selection import train_test_split
      
      all_inputs = iris_data[['sepal_length_cm', 'sepal_width_cm',
                                   'petal_length_cm', 'petal_width_cm']].values
      
      all_classes = iris_data['class'].values
      
       (training_inputs,
       testing_inputs,
       training_classes,
       testing_classes) = train_test_split(all_inputs, all_classes, train_size=0.75, random_state=1)
    複製代碼
  • 參數設置詳解spa

    from sklearn.tree import DecisionTreeClassifier
      
      #  1.criterion  gini  or  entropy(基於gini係數和熵值來指定)
      
      #  2.splitter  best or random 前者是在全部特徵中找最好的切分點 後者是在部分特徵中(數據量大的時候)
      
      #  3.max_features  None(全部) 特徵小於50的時候通常使用全部的 ,log2,sqrt,N  
      
      #  4.max_depth  數據少或者特徵少的時候能夠無論這個值,若是模型樣本量多,特徵也多的狀況下,能夠嘗試限制下
      
      #  5.min_samples_split  若是某節點的樣本數少於min_samples_split,則不會繼續再嘗試選擇最優特徵來進行劃分
      #                       若是樣本量不大,不須要管這個值。若是樣本量數量級很是大,則推薦增大這個值。
      
      #  6.min_samples_leaf  這個值限制了葉子節點最少的樣本數,若是某葉子節點數目小於樣本數,則會和兄弟節點一塊兒被
      #                      剪枝,若是樣本量不大,不須要管這個值,大些如10W但是嘗試下5
      
      #  7.min_weight_fraction_leaf 這個值限制了葉子節點全部樣本權重和的最小值,若是小於這個值,則會和兄弟節點一塊兒
      #                          被剪枝默認是0,就是不考慮權重問題。通常來講,若是咱們有較多樣本有缺失值,
      #                          或者分類樹樣本的分佈類別誤差很大,就會引入樣本權重,這時咱們就要注意這個值了。
      
      #  8.max_leaf_nodes 經過限制最大葉子節點數,能夠防止過擬合,默認是"None」,即不限制最大的葉子節點數。
      #                   若是加了限制,算法會創建在最大葉子節點數內最優的決策樹。
      #                   若是特徵很少,能夠不考慮這個值,可是若是特徵分紅多的話,能夠加以限制
      #                   具體的值能夠經過交叉驗證獲得。
      
      #  9.class_weight 指定樣本各種別的的權重,主要是爲了防止訓練集某些類別的樣本過多
      #                 致使訓練的決策樹過於偏向這些類別。這裏能夠本身指定各個樣本的權重
      #                 若是使用「balanced」,則算法會本身計算權重,樣本量少的類別所對應的樣本權重會高。
      
      #  10.min_impurity_split 這個值限制了決策樹的增加,若是某節點的不純度
      #                       (基尼係數,信息增益,均方差,絕對差)小於這個閾值
      #                       則該節點再也不生成子節點。即爲葉子節點 。
      
      decision_tree_classifier = DecisionTreeClassifier()
      
      # Train the classifier on the training set
      decision_tree_classifier.fit(training_inputs, training_classes)
      
      # Validate the classifier on the testing set using classification accuracy
      decision_tree_classifier.score(testing_inputs, testing_classes)
      
      0.9736842105263158
    複製代碼
  • 版權聲明:本套技術專欄是做者(秦凱新)平時工做的總結和昇華,經過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和集羣環境容量規劃等內容,請持續關注本套博客。QQ郵箱地址:1120746959@qq.com,若有任何學術交流,可隨時聯繫。3d

3 交叉驗證

from sklearn.model_selection import KFold

# 但目前train_test_split已被cross_validation被廢棄了
# 廢棄 from sklearn.cross_validation import cross_val_score

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import numpy as np

decision_tree_classifier = DecisionTreeClassifier()
# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
# 10倍交叉驗證
cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_classes, cv=10)
print (cv_scores)
#kde=False
sb.distplot(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))

[1.         0.93333333 1.         0.93333333 0.93333333 0.86666667
 0.93333333 0.93333333 1.         1.        ]
複製代碼

decision_tree_classifier = DecisionTreeClassifier(max_depth=1)

cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_classes, cv=10)
print (cv_scores)
sb.distplot(cv_scores, kde=False)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
複製代碼

  • 4 參數網格rest

    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import StratifiedKFold
    
    decision_tree_classifier = DecisionTreeClassifier()
    
    parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
                      'max_features': [1, 2, 3, 4]}
    cross_validation = StratifiedKFold(10)
    
    grid_search = GridSearchCV(decision_tree_classifier,
                               param_grid=parameter_grid,
                               cv=cross_validation)
    
    grid_search.fit(all_inputs, all_classes)
    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))
    複製代碼
  • 5 heatmap堆疊熱力圖使用

    grid_visualization = []
      
      for grid_pair in grid_search.cv_results_['mean_test_score']:
          grid_visualization.append(grid_pair)
          
      grid_visualization = np.array(grid_visualization)
      grid_visualization.shape = (5, 4)
      sb.heatmap(grid_visualization, cmap='Blues')
      plt.xticks(np.arange(4) + 0.5, grid_search.param_grid['max_features'])
      plt.yticks(np.arange(5) + 0.5, grid_search.param_grid['max_depth'][::-1])
      plt.xlabel('max_features')
      plt.ylabel('max_depth')
    複製代碼

  • 6 生成決策樹iris_dtc.dot文件

    import sklearn.tree as tree
      from sklearn.externals.six import StringIO
      
      with open('C:\\ML\\MLData\\iris_dtc.dot', 'w') as out_file:
          out_file = tree.export_graphviz(decision_tree_classifier, out_file=out_file)
    複製代碼
  • 7 下載解析器

    http://www.graphviz.org/
      
     Graphviz is open source graph visualization software. Graph visualization is a way of representing
     structural information as diagrams of abstract graphs and networks. It has important applications in
     networking, bioinformatics,  software engineering, database and web design, machine learning, and in
     visual interfaces for other technical domains.  
    複製代碼

dot -Tpdf iris_dtc.dot -o iris.pdf
複製代碼

  • 8 多參數網格以及交叉驗證(最新版)

    from sklearn.ensemble import RandomForestClassifier
      from sklearn.model_selection import GridSearchCV
      from sklearn.model_selection import StratifiedKFold
      from sklearn.model_selection import KFold
      random_forest_classifier = RandomForestClassifier()
      
      parameter_grid = {'n_estimators': [5, 10, 25, 50],
                        'criterion': ['gini', 'entropy'],
                        'max_features': [1, 2, 3, 4],
                        'warm_start': [True, False]}
      
      cross_validation = StratifiedKFold(10)
      
      grid_search = GridSearchCV(random_forest_classifier,
                                 param_grid=parameter_grid,
                                 cv=cross_validation)
      
      grid_search.fit(all_inputs, all_classes)
      print('Best score: {}'.format(grid_search.best_score_))
      print('Best parameters: {}'.format(grid_search.best_params_))
      
      Best score: 0.9664429530201343
      Best parameters: {'criterion': 'gini', 'max_features': 2, 'n_estimators': 5, 'warm_start': False}
      
      grid_search.best_estimator_
      
      RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
          max_depth=None, max_features=2, max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=None,
          oob_score=False, random_state=None, verbose=0,
          warm_start=False)
    複製代碼

4 總結

sklearn多參數網格和交叉驗證的使用,版本很重要,否則都運行不了。

版權聲明:本套技術專欄是做者(秦凱新)平時工做的總結和昇華,經過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和集羣環境容量規劃等內容,請持續關注本套博客。QQ郵箱地址:1120746959@qq.com,若有任何學術交流,可隨時聯繫。

秦凱新 於深圳 201812082235

相關文章
相關標籤/搜索