基於python的信用卡評分模型（German Credit德國信用數據集）

時間 2020-12-09

標籤 python 算法數據庫編程 api 安全網絡 app less dom 欄目 Python 简体版

原文原文鏈接

時值螞蟻上市之際，馬雲在上海灘發表演講。馬雲的核心邏輯其實只有一個，在全球數字經濟時代，有且只有一種金融優點，那就是基於消費者大數據的純信用！python

咱們不妨稱之爲數據信用，它比抵押更靠譜，它比擔保更保險，它比監管更高明，它是一種面向將來的財產權，它是數字貨幣背後核心的抵押資產，它決定了數字貨幣時代信用創造的方向、速度和規模。一句話，誰掌握了數據信用，誰就控制了數字貨幣的發行權！算法

數據信用判斷依靠的就是金融風控模型。更準確的說誰能掌握風控模型知識，誰就掌握了數字貨幣的發行權！數據庫

歡迎各位同窗學習python信用評分卡建模視頻系列教程（附代碼，博主錄製）：編程

https://edu.51cto.com/sd/edde1
api

做者介紹安全

Toby，持牌照消費金融公司模型專家，發明金融模型算法專利，和中科院，清華大學，百度，騰訊，愛奇藝，同盾，聚信立，友盟等平臺保持長期項目合做；與國內多所財經大學有模型項目。熟悉消費金融場景業務，包括現金貸，商品貸，醫美，反欺詐汽車金融等等。擅長Python機器學習建模，對變量篩選，衍生變量構造，變量缺失率高，正負樣本不平衡，共線性高，多算法比較，調參等有良好解決方法。網絡

課程簡介app

A級優質課程，360度講解python信用評分卡構建流程，附代碼直接使用，支持老師答疑。算法採用邏輯迴歸。彌補了網絡上講解不全，信息良莠不齊的短板。此課程目的是創建模型，自動化審批客戶資質，讓銀行，消費金融，小額貸貸款風險最小化並將利潤最大化。該課程採用數據集爲German credit數據。less

實用人羣dom

銀行，消費金融，小額貸，現金貸等線上貸款場景的風控建模相關工做人員，貸前審批模型人員或想從此從事模型崗位工做人員；大學生fintech建模競賽，論文，專利。

學習計劃和方法

1.天天保證1-2個小時學習時間，預計14-30天能夠學習完整門課程。
2.每節課的代碼實操要保證，建議不要直接複製粘貼代碼，本身實操一遍代碼對大腦記憶很重要，有利於鞏固知識。
3.第二次學習時要總結上一節課內容，必要時作好筆記，加深大腦理解。
4.不懂問題要羅列出來，先本身上網查詢，查不到的能夠諮詢老師。

課程目錄

章節1前言
章節1Python環境搭建
課時1 建評分卡模型，python，R，SAS誰最好？
課時2 Anaconda快速入門指南
課時3 Anaconda下載和安裝
課時4 canopy下載和安裝
課時5 Anaconda Navigato導航器
課時6 python安裝第三方包：pip和conda install
課時7 Python非官方擴展包下載地址
課時8 Anaconda安裝不一樣版本python
課時9 jupyter1_爲何使用jupyter notebook？
課時10 jupyter2_jupyter基本文本編輯操做
課時11 如何用jupyter notebook打開指定文件夾內容？
課時12 jupyter4_jupyter轉換PPT實操
課時13 jupyter notebook用matplotlib不顯示圖片解決方案

章節2 python編程基礎知識
課時14 Python文件基本操做
課時15 變量_表達式_運算符_值
課時16 字符串string
課時17 列表list
課時18 程序的基本構架（條件，循環）
課時19 數據類型_函數_面向對象編程
課時20 python2和3的區別
課時21 編程技巧和學習方法

章節3 python機器學習基礎
課時22 UCI機器學習經常使用數據庫介紹
課時23 機器學習書籍推薦
課時24 如何選擇算法
課時25 機器學習語法速查表
課時26 python數據科學經常使用的庫
課時27 python數據科學入門介紹(選修)

章節4 GermanCredit信用評分數據下載和介紹
課時28 GermanCredit信用評分數據下載和介紹

章節5信用評分卡開發流程（上）
課時29 評分卡開發流程概述
課時30 第一步：數據收集
課時31 第二步：數據準備
課時32 變量可視化分析
課時33 樣本量須要多少？
課時34 壞客戶定義
課時35 第三步：變量篩選
課時36 變量重要性評估_iv和信息增益混合方法
課時37 衍生變量05:01
課時38 第四步：變量分箱01:38

章節6信用評分卡開發流程（下）
課時39 第五步：創建邏輯迴歸模型
課時40 odds賠率
課時41 woe計算
課時42 變量係數
課時43 A和B計算
課時44 Excel手動計算壞客戶機率
課時45 Python腳本計算壞客戶機率
課時46 客戶評分
課時47 評分卡誕生-變量分數計算
課時48 拒絕演繹reject inference
課時49 第六步：模型驗證
課時50 第七步：模型部署
課時51 常見模型部署問題

章節7 Python信用評分卡-邏輯迴歸腳本
課時52 Python信用評分卡腳本運行演示
課時53 描述性統計腳本_缺失率和共線性分析
課時54 woe腳本(kmean分箱)
課時55 iv計算獨家腳本
課時56 Excel手動推導變量woe和iv值
課時57 評分卡腳本1（sklearn）
課時58 評分卡腳本2（statsmodel）
課時59 生成評分卡腳本
課時60 模型驗證腳本

章節8PSI(population stability index)穩定指標
課時61 拿破崙遠征歐洲失敗/華爾街股災真兇-PSI模型穩定指標揭祕
課時62 excel推導PSI的計算公式
課時63 PSI計算公式原理_獨家祕密
課時64 PSI的python腳本講解

章節9難點1_壞客戶定義
課時65 壞客戶定義錯誤，全盤皆輸
課時66 不一樣場景壞客戶定義不同，壞客戶定義具備反覆性
課時67 壞客戶佔比不能過低
課時68 vintage源於葡萄酒釀造
課時69 vintage用於授信策略優化

章節10難點2_woe分箱
課時70 ln對數函數
課時71 excel手動計算woe值
課時72 python計算woe腳本
課時73 Iv計算推導
課時74 woe正負符號意義
課時75 WOE計算就這麼簡單？你想多了
課時76 Kmean算法原理
課時77 python kmean實現粗分箱腳本
課時78 自動化比較變量不一樣分箱的iv值
課時79 woe分箱第三方包腳本

章節11難點3_邏輯迴歸是最佳算法嗎？
課時80 邏輯迴歸是最優算法嗎？No
課時81 xgboost_支持腳本下載
課時82 隨機森林randomForest_支持腳本下載
課時83 支持向量SVM_支持腳本下載
課時84 神經網絡neural network_支持腳本下載
課時85 多算法比較重要性_模型競賽，百萬獎金任你拿

章節12難點4_變量缺失數據處理
課時86 imputer-缺失數據處理
課時87 xgboost簡單處理缺失數據
課時88 catboost處理缺失數據最簡單

章節13難點5.模型驗證
課時89 模型須要驗證碼？
課時90 商業銀行資本管理辦法(試行)
課時91 模型驗證_信用風險內部評級體系監管要求
課時92 模型驗證主要指標概述
課時93 交叉驗證cross validation
課時94 groupby分類統計函數
課時95 KS_模型區分能力指標
課時96 混淆矩陣（accuracy,precision，recall，f1 score）
新增課時模型排序能力-lift提高圖

章節14難點6.邏輯迴歸調參
課時97 菜鳥也能輕鬆調參
課時98 調參1_Penalty正則化選擇參數
課時99 調參2_classWeight類別權重
課時100 調參3_solver優化算法選擇參數
課時101 調參4_n_jobs
課時102 L-BFGS算法演化歷史
課時103 次要參數一覽

章節16 風控管理和詐騙中介（選修）
課時104 網絡信貸發展史
課時105 詐騙中介
課時106 風控管理
課時107 告別套路貸，高利貸，選擇正確貸款方式

章節17 2018-2019消費金融市場行情
課時108 揭祕：近年消費金融火爆發展根本緣由
課時109 持牌照消費金融公司盈利排行榜
課時110 消費金融，風控技術是瓶頸
課時111 誰能笑到最後：2018-2019消費金融公司註冊資本
課時112 蘿蔔加大棒：中央政策監管趨勢獨家預測
課時113 信用是金融交易的基石_P2P倒閉潮祕密

章節18 2018-2019年全球宏觀經濟
課時114 專家不會告訴你的祕密：美圓和黃金真實關係
課時115 宏觀經濟主要指標：債務率和失業率
課時116 2019年中國宏觀經濟分析_贈人民銀行發佈2018n年中國金融穩定報告
課時117 2019年發達國家宏觀經濟信息彙總_供下載
課時118 全球系統金融風險
課時119 基尼係數_貧富差別指標
課時120 GDP_利率_通貨膨脹
課時121 失業率_債務率
課時122 貿易差額_中美貿易戰根本緣由
課時123 信用評級_阿根廷金融危機獨家解讀

課程目的

Minimization of risk and maximization of profit on behalf of the bank.

To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.

The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Here is a link to the German Credit data (right-click and "save as" ). A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.

將銀行風險最小化並將利潤最大化

爲了從銀行的角度將損失降到最低，銀行須要制定決策規則，肯定誰批准貸款，誰不批准。在決定貸款申請以前，貸款經理會考慮申請人的人口統計和社會經濟概況。

GermanCredit信貸數據包含有關20個變量的數據，以及1000個貸款申請者被認爲是好信用風險仍是壞信用風險的分類。這是指向Germancredit信用數據的連接（右鍵單擊並另存爲）。預期基於此數據開發的預測模型將爲銀行經理提供指導，以根據他/她的我的資料來決定是否批准準申請人的貸款。

信用逾期時代的信用評分卡

隨着我國居民消費心理髮生改變和各大商家誘導性消費，很多朋友愈來愈依賴超前消費了。我國14億人口，消費羣體龐大，各種產品也有着很大的市場，因而如今的消費信貸市場成了不少銀行或者其餘機構發力的方向。根據央行公佈的數據來看，商業銀行發行的信用卡數量繼續擴張，但在「濫發」信用卡的背後，逾期壞帳不斷增長也成了銀行頭疼問題。

信用卡逾期半年以上壞帳突破900億

近日，央行公佈了三季度支付體系的運行報告，從央行公佈的數據來看，我國商業銀行發行的信用卡數量、授信總額以及壞帳總額均在保持增加。

數據顯示，截至今年三季度末，我國商業銀行發行的信用卡（包括借貸合一卡）的數量達到了7.66億張，環比增長1.29%。總授信額度達到了18.59萬億元，環比增長3.80%。

下卡量在增長，加上授信總額在不斷增加，說明銀行依舊很是重視信用卡市場，但同時這也給銀行帶來了不小的麻煩。由於截至今年三季度末，信用卡逾期半年以上的壞帳來到了906.63億元，環比大漲6.13%。

信用卡下卡數量不斷增長，說明在初審階段銀行並無管理的太嚴格，所以壞帳增長是客觀會存在的問題。但做爲專業的金融機構，銀行顯然是不會坐視壞帳繼續漲下去，否則就會影響到銀行的正常經營，也會引發監管層的注意。

因此在這種狀況下面，商業銀行會對已經下卡的客戶進行管理，通常是在消費場景以及防範套現上面下功夫。因此爲了你不被銀行二次風控，從而對你的信用卡封卡降額，一些不合規的刷卡消費最好仍是別碰。

銀行風控負責人改如何應對持續上升信用卡壞帳？做者認爲識別壞客戶（騙貸和還款能力不足人羣）是關鍵。只有銀行精準識別了壞客戶，才能顯著下降逾期和壞帳率。

以前銀行是當鋪思想，把錢借給有償還能力的人。這些人羣算是優質客羣。更糟糕的是但隨着量化寬鬆，財政貨幣刺激，M2激增，銀行，消費金融公司，小額貸公司紛紛把市場目標擴大到次級客戶，即償還能力不足或沒有工做的人，這些人還錢風險很高，所以借錢利息也很高。

國內黑產，灰產已經造成龐大產業鏈條。根據以前同盾公司統計，黑產團隊至少上千個，多大爲3人左右小團隊，100人以上大團隊也有幾十上百個。這些黑產團隊每天測試各大現金貸平臺漏洞，可謂專業產品經理。下圖是生產虛假號碼的手機卡，來自東南亞，國內可用，可最大程度規避國內安全監控，專門爲線上平臺現金貸詐騙用戶準備。若是沒有風控能力，就不要玩現金貸這行了。放款猶如肉包子打狗有去無回。

舉個身邊熟悉例子，做者在以前某寶關鍵詞搜索中，能夠發現黑產和灰產身影。

關鍵詞：

註冊機，短信服務，短信接收，短信驗證，app下單，智能終端代接m

黑產市場風起雲涌，銀行風控負責人改如何應對持續上升信用卡壞帳？做者認爲識別壞客戶（騙貸和還款能力不足人羣）是關鍵。只有銀行精準識別了壞客戶，才能顯著下降逾期和壞帳率。如何精準識別壞客戶，改課程會手把手教你你們Python信用評分卡模型，精準捕捉壞客戶，此乃風控守護神。

信用評分卡能夠成爲貸款人和借款人計算借款人償債能力的絕佳工具。對於貸方而言，評分卡能夠幫助他們評估借款人的風險，識別是不是騙貸用戶或還款能力不足用戶，並幫公司維持健康的投資組合 - 這最終將影響整個經濟。

模型就像一個黑箱，當用戶申請貸款時，模型會根據用戶信息，例如年齡，工做，職位，還款記錄，借貸次數等維度自動計算客戶壞客戶機率。業務線若是用模型計算出某用戶壞客戶機率較高，例如0.8，就會拒絕改客戶貸款申請。

所以風控模型就像信貸守護神，保護公司資產，免受黑產吞噬。

信用評分數據下載地址

http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

account balance 帳戶餘額

duration of credit持卡時長

數據信息Data Set Information:

Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".

For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.

This dataset requires use of a cost matrix (see below)

提供了兩個數據集。原始數據集以Hofmann教授提供的形式包含類別/符號屬性，而且位於文件「 german.data」中。

對於須要數字屬性的算法，斯特拉斯克萊德大學產生了文件「 german.data-numeric」。該文件已通過編輯，並添加了一些指標變量，以使其適用於沒法處理分類變量的算法。幾個按類別排序的屬性（例如屬性17）已編碼爲整數。這是StatLog使用的形式。

該數據集須要使用成本矩陣（請參見下文）

..... 1 2

1 0 1

2 5 0

(1 = Good, 2 = Bad)

The rows represent the actual classification and the columns the predicted classification.

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).

Attribute Information:

Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM / salary assignments for at least 1 year
A14 : no checking account

Attribute 2: (numerical)
Duration in month

Attribute 3: (qualitative)
Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)

Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others

Attribute 5: (numerical)
Credit amount

Attibute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account

Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years

Attribute 8: (numerical)
Installment rate in percentage of disposable income

Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single

Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor

Attribute 11: (numerical)
Present residence since

Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property

Attribute 13: (numerical)
Age in years

Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none

Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free

Attribute 16: (numerical)
Number of existing credits at this bank

Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer

Attribute 18: (numerical)
Number of people being liable to provide maintenance for

Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name

Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no

It is worse to class a customer as good when they are bad (5),

than it is to class a customer as bad when they are good (1).

歡迎各位同窗學習更多金融模型相關課程：
python金融風控評分卡模型和數據分析微專業課
https://edu.51cto.com/sd/f2e9b

模型變量重要性排序結果

python建模腳本

隨機森林算法
randomForest.py

random forest with 1000 trees:
accuracy on the training subset:1.000
accuracy on the test subset:0.772

準確性高於決策樹

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

trees=1000
#讀取文件
readFileName="German_credit.xlsx"
#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
X=df.ix[:,:-1]
y=df.ix[:,-1]
names=X.columns
x_train,x_test,y_train,y_test=train_test_split(X,y,random_state=0)
#n_estimators表示樹的個數，測試中100顆樹足夠
forest=RandomForestClassifier(n_estimators=trees,random_state=0)
forest.fit(x_train,y_train)
print("random forest with %d trees:"%trees) 
print("accuracy on the training subset:{:.3f}".format(forest.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(forest.score(x_test,y_test)))
print('Feature importances:{}'.format(forest.feature_importances_))
n_features=X.shape[1]
plt.barh(range(n_features),forest.feature_importances_,align='center')
plt.yticks(np.arange(n_features),names)
plt.title("random forest with %d trees:"%trees)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()

比較以前

決策樹可視化

準確率不高，且嚴重過分擬合
accuracy on the training subset:0.991
accuracy on the test subset:0.680

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np
import pydotplus
from IPython.display import Image
import graphviz
from sklearn.tree import export_graphviz
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

trees=1000
#讀取文件
readFileName="German_credit.xlsx"
#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)
#調參
list_average_accuracy=[]
depth=range(1,30)
for i in depth:
    #max_depth=4限制決策樹深度能夠下降算法複雜度，獲取更精確值
    tree= DecisionTreeClassifier(max_depth=i,random_state=0)
    tree.fit(x_train,y_train)
    accuracy_training=tree.score(x_train,y_train)
    accuracy_test=tree.score(x_test,y_test)
    average_accuracy=(accuracy_training+accuracy_test)/2.0
    #print("average_accuracy:",average_accuracy)
    list_average_accuracy.append(average_accuracy)

max_value=max(list_average_accuracy)
#索引是0開頭，結果要加1
best_depth=list_average_accuracy.index(max_value)+1
print("best_depth:",best_depth)
best_tree= DecisionTreeClassifier(max_depth=best_depth,random_state=0)
best_tree.fit(x_train,y_train)
accuracy_training=best_tree.score(x_train,y_train)
accuracy_test=best_tree.score(x_test,y_test)
print("decision tree:")   
print("accuracy on the training subset:{:.3f}".format(best_tree.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(best_tree.score(x_test,y_test)))

n_features=x.shape[1]
plt.barh(range(n_features),best_tree.feature_importances_,align='center')
plt.yticks(np.arange(n_features),names)
plt.title("Decision Tree:")
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()

#生成一個dot文件，之後用cmd形式生成圖片
export_graphviz(best_tree,out_file="creditTree.dot",class_names=['bad','good'],feature_names=names,impurity=False,filled=True)
'''
best_depth: 12
decision tree:
accuracy on the training subset:0.991
accuracy on the test subset:0.680
'''

支持向量最高預測率

#標準化數據
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd

#讀取文件
readFileName="German_credit.xlsx"
#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns
#random_state 至關於隨機數種子
X_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=42)
svm=SVC()
svm.fit(X_train,y_train)
print("accuracy on the training subset:{:.3f}".format(svm.score(X_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(svm.score(x_test,y_test)))
'''
accuracy on the training subset:1.000
accuracy on the test subset:0.700

'''
#觀察數據是否標準化
plt.plot(X_train.min(axis=0),'o',label='Min')
plt.plot(X_train.max(axis=0),'v',label='Max')
plt.xlabel('Feature Index')
plt.ylabel('Feature magnitude in log scale')
plt.yscale('log')
plt.legend(loc='upper right')

#標準化數據
X_train_scaled = preprocessing.scale(X_train)
x_test_scaled = preprocessing.scale(x_test)
svm1=SVC()
svm1.fit(X_train_scaled,y_train)
print("accuracy on the scaled training subset:{:.3f}".format(svm1.score(X_train_scaled,y_train)))
print("accuracy on the scaled test subset:{:.3f}".format(svm1.score(x_test_scaled,y_test)))
'''
accuracy on the scaled training subset:0.867
accuracy on the scaled test subset:0.800
'''
#改變C參數，調優,kernel表示核函數，用於平面轉換，probability表示是否須要計算機率
svm2=SVC(C=10,gamma="auto",kernel='rbf',probability=True)
svm2.fit(X_train_scaled,y_train)
print("after c parameter=10,accuracy on the scaled training subset:{:.3f}".format(svm2.score(X_train_scaled,y_train)))
print("after c parameter=10,accuracy on the scaled test subset:{:.3f}".format(svm2.score(x_test_scaled,y_test)))
'''
after c parameter=10,accuracy on the scaled training subset:0.972
after c parameter=10,accuracy on the scaled test subset:0.716
'''
#計算樣本點到分割超平面的函數距離
#print (svm2.decision_function(X_train_scaled))
#print (svm2.decision_function(X_train_scaled)[:20]>0)
#支持向量機分類
#print(svm2.classes_)
#malignant和bening機率計算,輸出結果包括惡性機率和良性機率
#print(svm2.predict_proba(x_test_scaled))
#判斷數據屬於哪一類，0或1表示
#print(svm2.predict(x_test_scaled))

神經網絡
效果不如支持向量和隨機森林
最好幾率
accuracy on the training subset:0.916
accuracy on the test subset:0.720

from sklearn.neural_network import MLPClassifier
#標準化數據，不然神經網絡結果不許確，和SVM相似
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import mglearn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#讀取文件
readFileName="German_credit.xlsx"

#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns

#random_state 至關於隨機數種子
x_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=42)
mlp=MLPClassifier(random_state=42)
mlp.fit(x_train,y_train)
print("neural network:")   
print("accuracy on the training subset:{:.3f}".format(mlp.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp.score(x_test,y_test)))

scaler=StandardScaler()
x_train_scaled=scaler.fit(x_train).transform(x_train)
x_test_scaled=scaler.fit(x_test).transform(x_test)

mlp_scaled=MLPClassifier(max_iter=1000,random_state=42)
mlp_scaled.fit(x_train_scaled,y_train)
print("neural network after scaled:")   
print("accuracy on the training subset:{:.3f}".format(mlp_scaled.score(x_train_scaled,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp_scaled.score(x_test_scaled,y_test)))

mlp_scaled2=MLPClassifier(max_iter=1000,alpha=1,random_state=42)
mlp_scaled2.fit(x_train_scaled,y_train)
print("neural network after scaled and alpha change to 1:")   
print("accuracy on the training subset:{:.3f}".format(mlp_scaled2.score(x_train_scaled,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp_scaled2.score(x_test_scaled,y_test)))

#繪製顏色圖,熱圖
plt.figure(figsize=(20,5))
plt.imshow(mlp_scaled.coefs_[0],interpolation="None",cmap="GnBu")
plt.yticks(range(30),names)
plt.xlabel("columns in weight matrix")
plt.ylabel("input feature")
plt.colorbar()

'''
neural network:
accuracy on the training subset:0.700
accuracy on the test subset:0.700
neural network after scaled:
accuracy on the training subset:1.000
accuracy on the test subset:0.704
neural network after scaled and alpha change to 1:
accuracy on the training subset:0.916
accuracy on the test subset:0.720

xgboost
區分能力還能夠
AUC: 0.8134
ACC: 0.7720
Recall: 0.9521
F1-score: 0.8480
Precesion: 0.7644

import xgboost as xgb
from sklearn.cross_validation import train_test_split
import pandas as pd
import matplotlib.pylab as plt

#讀取文件
readFileName="German_credit.xlsx"

#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns

train_x, test_x, train_y, test_y=train_test_split(x,y,random_state=0)

dtrain=xgb.DMatrix(train_x,label=train_y)
dtest=xgb.DMatrix(test_x)

params={'booster':'gbtree',
    #'objective': 'reg:linear',
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth':4,
    'lambda':10,
    'subsample':0.75,
    'colsample_bytree':0.75,
    'min_child_weight':2,
    'eta': 0.025,
    'seed':0,
    'nthread':8,
     'silent':1}

watchlist = [(dtrain,'train')]

bst=xgb.train(params,dtrain,num_boost_round=100,evals=watchlist)

ypred=bst.predict(dtest)

# 設置閾值, 輸出一些評價指標
y_pred = (ypred >= 0.5)*1

#模型校驗
from sklearn import metrics
print ('AUC: %.4f' % metrics.roc_auc_score(test_y,ypred))
print ('ACC: %.4f' % metrics.accuracy_score(test_y,y_pred))
print ('Recall: %.4f' % metrics.recall_score(test_y,y_pred))
print ('F1-score: %.4f' %metrics.f1_score(test_y,y_pred))
print ('Precesion: %.4f' %metrics.precision_score(test_y,y_pred))
metrics.confusion_matrix(test_y,y_pred)

print("xgboost:") 
#print("accuracy on the training subset:{:.3f}".format(bst.get_score(train_x,train_y)))
#print("accuracy on the test subset:{:.3f}".format(bst.get_score(test_x,test_y)))
print('Feature importances:{}'.format(bst.get_fscore()))

'''
AUC: 0.8135
ACC: 0.7640
Recall: 0.9641
F1-score: 0.8451
Precesion: 0.7523

#特徵重要性和隨機森林差很少
Feature importances:{'Account Balance': 80, 'Duration of Credit (month)': 119,
 'Most valuable available asset': 54, 'Payment Status of Previous Credit': 84,
 'Value Savings/Stocks': 66, 'Age (years)': 94, 'Credit Amount': 149,
 'Type of apartment': 20, 'Instalment per cent': 37,
 'Length of current employment': 70, 'Sex & Marital Status': 29,
 'Purpose': 67, 'Occupation': 13, 'Duration in Current address': 25,
 'Telephone': 15, 'Concurrent Credits': 23, 'No of Credits at this Bank': 7,
 'Guarantors': 28, 'No of dependents': 6}
'''

 最終結論：

xgboost 有時候特徵重要性分析比隨機森林還準確，可見其強大之處

隨機森林重要因子排序    xgboost權重指數
Credit amount信用保證金  149
age 年齡                            94
account balance 帳戶餘額 80
duration of credit持卡時間 119 （信用卡逾期時間，每一個銀行有所不一樣，以招商銀行爲例，兩個月就會被停卡）

2018-9-18數據更新

邏輯迴歸驗證數據和catboost驗證數據差很少，可見邏輯迴歸穩定性

model accuracy is: 0.755
model precision is: 0.697841726618705
model sensitivity is: 0.3233333333333333
f1_score: 0.44191343963553525
AUC: 0.7626619047619048

根據iv值刪除後預測結果沒有變量徹底保留的高
model accuracy is: 0.724
model precision is: 0.61320754717
model sensitivity is: 0.216666666667
f1_score: 0.320197044335
AUC: 0.7031
good classifier

帶入German_credit原始數據結果
accuracy on the training subset:0.777
accuracy on the test subset:0.740
A: 6.7807190511263755
B: 14.426950408889635
model accuracy is: 0.74
model precision is: 0.7037037037037037
model sensitivity is: 0.38
f1_score: 0.49350649350649356
AUC: 0.7885
"""
import math
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
import statsmodels.api as sm
#混淆矩陣計算
from sklearn import metrics
from sklearn.metrics import roc_curve, auc,roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

#df_german=pd.read_excel("german_woe.xlsx")
df_german=pd.read_excel("german_credit.xlsx")
#df_german=pd.read_excel("df_after_vif.xlsx")
y=df_german["target"]
x=df_german.ix[:,"Account Balance":"Foreign Worker"]
#x=df_german.ix[:,"Credit Amount":"Purpose"]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

#驗證
print("accuracy on the training subset:{:.3f}".format(classifier.score(X_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(classifier.score(X_test,y_test)))

#得分公式
'''
P0 = 50
PDO = 10
theta0 = 1.0/20
B = PDO/np.log(2)
A = P0 + Bnp.log(theta0)
'''
def Score(probability):
#底數是e
score = A-Bnp.log(probability/(1-probability))
return score
#批量獲取得分
def List_score(pos_probablity_list):
list_score=[]
for probability in pos_probablity_list:
score=Score(probability)
list_score.append(score)
return list_score

P0 = 50
PDO = 10
theta0 = 1.0/20
B = PDO/np.log(2)
A = P0 + B*np.log(theta0)
print("A:",A)
print("B:",B)
listcoef = list(classifier.coef[0])
intercept= classifier.intercept_

#獲取全部x數據的預測機率,包括好客戶和壞客戶，0爲好客戶，1爲壞客戶
probablity_list=classifier.predict_proba(x)
#獲取全部x數據的壞客戶預測機率
pos_probablity_list=[i[1] for i in probablity_list]
#獲取全部客戶分數
list_score=List_score(pos_probablity_list)
list_predict=classifier.predict(x)
df_result=pd.DataFrame({"label":y,"predict":list_predict,"pos_probablity":pos_probablity_list,"score":list_score})

df_result.to_excel("score_proba.xlsx")

#變量名列表
list_vNames=df_german.columns
#去掉第一個變量名target
list_vNames=list_vNames[1:]
df_coef=pd.DataFrame({"variable_names":list_vNames,"coef":list_coef})
df_coef.to_excel("coef.xlsx")

y_true=y_test
y_pred=classifier.predict(X_test)
accuracyScore = accuracy_score(y_true, y_pred)
print('model accuracy is:',accuracyScore)

#precision,TP/(TP+FP) （真陽性）/（真陽性+假陽性）
precision=precision_score(y_true, y_pred)
print('model precision is:',precision)

#recall（sensitive）敏感度，(TP)/（TP+FN）
sensitivity=recall_score(y_true, y_pred)
print('model sensitivity is:',sensitivity)

#F1 = 2 x (精確率 x 召回率) / (精確率 + 召回率)
#F1 分數會同時考慮精確率和召回率，以便計算新的分數。可將 F1 分數理解爲精確率和召回率的加權平均值，其中 F1 分數的最佳值爲一、最差值爲 0：
f1Score=f1_score(y_true, y_pred)
print("f1_score:",f1Score)

def AUC(y_true, y_scores):
auc_value=0
#auc第二種方法是經過fpr,tpr，經過auc(fpr,tpr)來計算AUC
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_scores, pos_label=1)
auc_value= auc(fpr,tpr) ###計算auc的值
#print("fpr:",fpr)
#print("tpr:",tpr)
#print("thresholds:",thresholds)
if auc_value<0.5:
auc_value=1-auc_value
return auc_value

def Draw_roc(auc_value):
fpr, tpr, thresholds = metrics.roc_curve(y, list_score, pos_label=0)
#畫對角線
plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Diagonal line')
plt.plot(fpr,tpr,label='ROC curve (area = %0.2f)' % auc_value)
plt.title('ROC curve')
plt.legend(loc="lower right")

#評價AUC表現
def AUC_performance(AUC):
if AUC >=0.7:
print("good classifier")
if 0.7>AUC>0.6:
print("not very good classifier")
if 0.6>=AUC>0.5:
print("useless classifier")
if 0.5>=AUC:
print("bad classifier,with sorting problems")

#Auc驗證，數據採用測試集數據
auc_value=AUC(y, list_score)
print("AUC:",auc_value)
#評價AUC表現
AUC_performance(auc_value)
#繪製ROC曲線
Draw_roc(auc_value)

**catboost腳本**

catboost-
accuracy on the training subset:1.000
accuracy on the test subset:0.763
test數據指標
accuracy on the test subset:0.757
model accuracy is: 0.7566666666666667
model precision is: 0.813953488372093
model sensitivity is: 0.35
f1_score: 0.48951048951048953
AUC: 0.7595999999999999
"""
import catboost as cb
import math
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
import statsmodels.api as sm
#混淆矩陣計算
from sklearn import metrics
from sklearn.metrics import roc_curve, auc,roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

classifier = cb.CatBoostClassifier()
classifier.fit(X_train, y_train)

list_score=classifier.predict_proba(X_test)
list_score=[i[1] for i in list_score]

#驗證
print("accuracy on the training subset:{:.3f}".format(classifier.score(X_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(classifier.score(X_test,y_test)))

list_predict=classifier.predict(x)
y_true=y_test
y_pred=classifier.predict(X_test)
accuracyScore = accuracy_score(y_true, y_pred)
print('model accuracy is:',accuracyScore)

#precision,TP/(TP+FP) （真陽性）/（真陽性+假陽性）
precision=precision_score(y_true, y_pred)
print('model precision is:',precision)

#recall（sensitive）敏感度，(TP)/（TP+FN）
sensitivity=recall_score(y_true, y_pred)
print('model sensitivity is:',sensitivity)