German Credit Data, 咱們來看看數據的格式,python
A1 到 A15 爲 15個不一樣類別的特徵,A16 爲 label 列,一共有 690條數據,下面列舉其中一條看成例子:app
A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | A15 | A16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
b | 30.83 | 0 | u | g | w | v | 1.25 | t | t | 01 | f | g | 00202 | 0 | + |
A1: b, a. A2: continuous. A3: continuous. A4: u, y, l, t. A5: g, p, gg. A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff. A7: v, h, bb, j, n, z, dd, ff, o. A8: continuous. A9: t, f. A10: t, f. A11: continuous. A12: t, f. A13: g, p, s. A14: continuous. A15: continuous. A16: +,- (class attribute)
37 cases (5%) have one or more missing values. The missing values from particular attributes are: A1: 12 A2: 12 A4: 6 A5: 6 A6: 9 A7: 9 A14: 13
+: 307 (44.5%) -: 383 (55.5%)
下面展現一下數據處理流程,主要是處理了一下缺失值,而後根據特徵按連續型和離散型進行分別處理,使用了 sklearn 裏面的 LogisticRegression 包,下面的代碼都有很詳細的註釋。dom
import pandas as pd import numpy as np import matplotlib as plt import seaborn as sns # 讀取數據 data = pd.read_csv("./crx.data") # 給數據增長列標籤 data.columns = ["f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10", "f11", "f12", "f13", "f14", "f15", "label"] # 替換 label 映射 label_mapping = { "+": 1, "-": 0 } data["label"] = data["label"].map(label_mapping) # 處理缺省值的方法 data = data.replace("?", np.nan) # 將 object 類型的列轉換爲 float型 data["f2"] = pd.to_numeric(data["f2"]) data["f14"] = pd.to_numeric(data["f14"]) # 連續型特徵若是有缺失值的話,用它們的平均值替代 data["f2"] = data["f2"].fillna(data["f2"].mean()) data["f3"] = data["f3"].fillna(data["f3"].mean()) data["f8"] = data["f8"].fillna(data["f8"].mean()) data["f11"] = data["f11"].fillna(data["f11"].mean()) data["f14"] = data["f14"].fillna(data["f14"].mean()) data["f15"] = data["f15"].fillna(data["f15"].mean()) # 離散型特徵若是有缺失值的話,用另一個不一樣的值替代 data["f1"] = data["f1"].fillna("c") data["f4"] = data["f4"].fillna("s") data["f5"] = data["f5"].fillna("gp") data["f6"] = data["f6"].fillna("hh") data["f7"] = data["f7"].fillna("ee") data["f13"] = data["f13"].fillna("ps") tf_mapping = { "t": 1, "f": 0 } data["f9"] = data["f9"].map(tf_mapping) data["f10"] = data["f10"].map(tf_mapping) data["f12"] = data["f12"].map(tf_mapping)
# 給離散的特徵進行 one-hot 編碼 data = pd.get_dummies(data)
from sklearn.linear_model import LogisticRegression # 打亂順序 shuffled_rows = np.random.permutation(data.index) # 劃分本地測試集和訓練集 highest_train_row = int(data.shape[0] * 0.70) train = data.iloc[0:highest_train_row] loc_test = data.iloc[highest_train_row:] # 去掉最後一列 label 以後的纔是 feature features = train.drop(["label"], axis = 1).columns model = LogisticRegression() X_train = train[features] y_train = train["label"] == 1 model.fit(X_train, y_train) X_test = loc_test[features] test_prob = model.predict(X_test) test_label = loc_test['label'] # 本地測試集上的準確率 accuracy_test = (test_prob == loc_test["label"]).mean() print accuracy_test
0.835748792271
from sklearn import cross_validation, metrics #驗證集上的auc值 test_auc = metrics.roc_auc_score(test_label, test_prob)#驗證集上的auc值 print test_auc
0.835748792271
簡單使用了一下邏輯迴歸,發現準確率是 0.835748792271,AUC 值是 0.835748792271,效果還不錯,接下來對模型進行優化來進一步提升準確率。測試