Random-Forest-Python

1. 近期目標,實現隨機森林進行點雲分類python

  1)學習階段:

【乾貨】Kaggle 數據挖掘比賽經驗分享git

Kaggle Machine Learning Competition: Predicting Titanic Survivorsgithub

Kaggle Titanic 生存預測 -- 詳細流程吐血梳理 app

機器學習實戰之Kaggle_Titanic預測dom

https://www.codeproject.com/Articles/1197167/Random-Forest-Python機器學習

https://blog.csdn.net/hexingwei/article/details/50740404 ide

 

  2)實踐階段:

  (1)原始點雲字段(X,Y,Z,density,curvature,Classification),利用點雲的高程Z,密度曲率進行train和分類。分類結果不好就是了。學習

    須要考慮哪些特徵對分類結果的影響比較大?用什麼樣的點雲特徵更好,特徵工程問題?測試

 1 # -*- coding: utf-8 -*-
 2 """
 3 Created on Sat Nov 10 10:12:02 2018
 4 @author: yhexie
 5 """
 6 import numpy as np
 7 import pandas as pd
 8 from sklearn import model_selection
 9 from sklearn.ensemble import RandomForestClassifier
10  
11 df = pd.read_csv('C:/Users/yhexie/.spyder-py3/pointcloudcls/train_pcloud2.csv', header=0)
12 x_train = df[['Z','Volume','Ncr']]
13 y_train = df.Classification
14  
15 df2 = pd.read_csv('C:/Users/yhexie/.spyder-py3/pointcloudcls/test_pcloud2.csv', header=0)
16 x_test = df2[['Z','Volume','Ncr']]
17  
18 clf = RandomForestClassifier(n_estimators=10)
19 clf.fit(x_train, y_train)
20 clf_y_predict = clf.predict(x_test)
21  
22 data_arry=[]
23 data_arry.append(df2.X)
24 data_arry.append(df2.Y)
25 data_arry.append(df2.Z)
26 data_arry.append(clf_y_predict)
27  
28 np_data = np.array(data_arry)
29 np_data = np_data.T
30 np.array(np_data)
31 save = pd.DataFrame(np_data, columns = ['X','Y','Z','Classification'])
32 save.to_csv('C:/Users/yhexie/.spyder-py3/pointcloudcls/predict_pcloud2.csv',index=False,header=True)  #index=False,header=False表示不保存行索引和列標題
View Code

  (2)對訓練集進行split,用75%的數據訓練,25%的數據驗證模型的擬合精度和泛化能力。spa

    a. 增長定性特徵,進行dummy處理。

  目前採用Z值和8個特徵相關的點雲特徵進行分類,點雲近鄰搜索半徑2.5m

 1 # -*- coding: utf-8 -*-
 2 """
 3 Created on Wed Nov 28 10:54:48 2018
 4 
 5 @author: yhexie
 6 """
 7 
 8 import numpy as np
 9 import pandas as pd
10 import matplotlib.pyplot as plt
11 from sklearn import model_selection
12 from sklearn.ensemble import RandomForestClassifier
13 
14 df = pd.read_csv('C:/Users/yhexie/.spyder-py3/pointcloudcls/train_pc.csv', header=0)
15 x_train = df[['Z','Linearity',    'Planarity','Scattering','Omnivariance',    'Anisotropy',
16               'EigenEntropy','eig_sum'    ,'changeOfcurvature']]
17 y_train = df.Classification
18 
19 from sklearn.cross_validation import train_test_split
20 train_data_X,test_data_X,train_data_Y,test_data_Y = train_test_split(x_train, y_train, test_size=0.25, random_state=33)
21 
22 df2 = pd.read_csv('C:/Users/yhexie/.spyder-py3/pointcloudcls/test_pc.csv', header=0)
23 x_test = df2[['Z','Linearity',    'Planarity','Scattering','Omnivariance',    'Anisotropy',
24               'EigenEntropy','eig_sum'    ,'changeOfcurvature']]
25 
26 clf = RandomForestClassifier(n_estimators=10)
27 clf.fit(train_data_X, train_data_Y)
28 
29 print('Accuracy on training set:{:.3f}:'.format(clf.score(train_data_X,train_data_Y)))
30 print('Accuracy on training set:{:.3f}:'.format(clf.score(test_data_X,test_data_Y)))
31 print('Feature inportances:{}'.format(clf.feature_importances_))
32 n_features=9
33 plt.barh(range(n_features),clf.feature_importances_,align='center')
34 plt.yticks(np.arange(n_features),['Z','Linearity',    'Planarity','Scattering','Omnivariance',    'Anisotropy',
35               'EigenEntropy','eig_sum'    ,'changeOfcurvature'])
36 plt.xlabel('Feature importance')
37 plt.ylabel('Feature')
38 
39 clf_y_predict = clf.predict(x_test)
40 
41 data_arry=[]
42 data_arry.append(df2.X)
43 data_arry.append(df2.Y)
44 data_arry.append(df2.Z)
45 data_arry.append(clf_y_predict)
46 
47 np_data = np.array(data_arry)
48 np_data = np_data.T
49 np.array(np_data)
50 save = pd.DataFrame(np_data, columns = ['X','Y','Z','Classification']) 
51 save.to_csv('C:/Users/yhexie/.spyder-py3/pointcloudcls/predict_pcloud2.csv',index=False,header=True)  #index=False,header=False表示不保存行索引和列標題
View Code

  計算結果:能夠看到在測試集上的結果仍是不好

1 Accuracy on training set:0.984:
2 Accuracy on test set:0.776:

 特徵重要程度:

 


新的測試:

Accuracy on training set:0.994:
Accuracy on training set:0.891:
Feature inportances:[0.02188956 0.02742479 0.10124688 0.01996966 0.1253002 0.02563489
0.03265565 0.100919 0.15808224 0.01937961 0.02727676 0.05498342
0.0211147 0.02387439 0.01900164 0.023478 0.02833916 0.0302441
0.02249598 0.06629199 0.05039737]

感受Z值的重要程度過高了。房屋分類結果應該是不好,綠色的不少被錯誤分類了。

問題:目前訓練集中的每一個類別的樣本數目並不相同,這個對訓練結果有沒有影響?

相關文章
相關標籤/搜索
本站公眾號
   歡迎關注本站公眾號,獲取更多信息