咱們使用決策樹來建立一個能屏蔽網頁橫幅廣告的軟件。數組
已知圖片的數據判斷它屬於廣告仍是文章內容。spa
數據來自 http://archive.ics.uci.edu/ml/datasets/Internet+Advertisementscode
其中包含3279張圖片的數據,該數據集中的類的比例是不均衡的,459張圖片是廣告,零位2820張圖片是文章內容。blog
首先導入數據,數據預處理圖片
# -*- coding: utf-8 -*- import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV df = pd.read_csv('ad-dataset/ad.data',header=None) variable_col = set(df.columns.values) #共有幾列 variable_col.remove(len(df.columns.values)-1) #最後一列是標籤 label_col= df[len(df.columns.values)-1] #把標籤列取出來 y = [1 if e=='ad.' else 0 for e in label_col] #把標籤轉爲數值 X = df[list(variable_col)].copy() #把前面的全部列做爲X X.replace(to_replace=' *?',value=-1,regex=True,inplace=True) #數據中的缺失值是 *?,咱們用-1替換缺失值 X_train,X_test,y_train,y_test = train_test_split(X,y)
創建決策樹,網格搜索微調模型ip
# In[1] 網格搜索微調模型 pipeline = Pipeline([ ('clf',DecisionTreeClassifier(criterion='entropy')) ]) parameters={ 'clf__max_depth':(150,155,160), 'clf__min_samples_split':(2,3), 'clf__min_samples_leaf':(1,2,3) } #GridSearchCV 用於系統地遍歷多種參數組合,經過交叉驗證肯定最佳效果參數。 grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,verbose=-1,scoring='f1') grid_search.fit(X_train,y_train) # 獲取搜索到的最優參數 best_parameters = grid_search.best_estimator_.get_params() print("最好的F1值爲:",grid_search.best_score_) print('最好的參數爲:') for param_name in sorted(parameters.keys()): print('t%s: %r' % (param_name,best_parameters[param_name]))
最好的F1值爲: 0.8753026365252053 最好的參數爲: tclf__max_depth: 160 tclf__min_samples_leaf: 1 tclf__min_samples_split: 3
評價模型utf-8
# In[2] 輸出預測結果並評價 predictions = grid_search.predict(X_test) print(classification_report(y_test,predictions))
precision recall f1-score support 0 0.98 0.99 0.98 695 1 0.93 0.89 0.91 125 micro avg 0.97 0.97 0.97 820 macro avg 0.95 0.94 0.94 820 weighted avg 0.97 0.97 0.97 820