數據分析框架:實現99%準確率

  我寫這篇文章的目的,是爲參加數據科學社區Kaggle簡單指引。 大多數初學者無從下手,由於他們使用本身不理解的庫和算法,就像陷入黑盒。 本教程將經過提供一個框架來教您如何像數據科學家同樣思考與編碼,從而爲您提供數據分析的領域優點。 html

目錄:node

一 、引言:數據科學家如何戰勝賠率python

二 、 數據科學框架綜述git

3、步驟1:明確問題、步驟2:準備數據github

4、步驟3:數據清洗算法

5、數據清理的4 C:糾正,完成,建立和轉換docker

6、步驟4:進行探索性分析api

7、步驟5:模型數據安全

8、評估模型性能網絡

9、具備超參數的調整模型

10、具備特徵選擇的調整模型

11、步驟6:驗證和實施

12、步驟7:優化和制定戰略


一 、引言:數據科學家如何戰勝賠率

  預測二元事件的結果是一個經典的問題。 例如,你贏了或沒贏,你經過測試或沒有經過測試。 常見的業務應用程序是流失或客戶保留。 另外一個流行的用例是醫療保健的死亡率或生存分析。 二進制事件建立了一個有趣的動態,由於咱們從統計上知道,隨機猜想應該達到50%的準確率,就像投硬幣同樣,而無需建立單個算法或編寫一行代碼。 然而,就像自動更正拼寫檢查技術同樣,有時咱們人類可能由於本身的利益而過於聰明,實際上表現不如硬幣翻轉。 在本文中,我使用Kaggle的入門競賽,泰坦尼克號數據,介紹如何使用數據科學框架來克服困難。

二 、 數據科學框架綜述

  1. 定義問題:俗話說,不要把車放在馬前。在解決問題以前,必需要明白問題是什麼,並且能夠應用之前的模型或者算法,而不是直接嘗試新的方法。
  2. 收集數據:約翰·奈斯比特在他1984年的書「大趨勢」中寫道,咱們「淹沒在數據中,但仍然須要知識。」因此,數據集已經存在於某個地方,某種格式。多是外部或內部的,結構化的或非結構化的,靜態的或流式的,客觀的或主觀的等等。俗話說,你沒必要從新發明輪子,你只須要知道在哪裏找到它。在下一步中,咱們擔憂將「髒數據」轉換爲「清理數據」。
  3. 數據清洗:是將「瘋狂」數據轉換爲「可管理」數據的必需過程。數據包括實現用於存儲和處理的數據架構,開發用於質量和控制的數據治理標準,數據提取(即ETL和網絡抓取)以及用於識別異常,丟失或異常數據點的數據清理。
  4. 探索性分析:任何曾經使用過數據的人都知道,垃圾進入,垃​​圾進出(GIGO)。所以,部署描述性和圖形化統計信息以查找數據集中的潛在問題,模式,分類,相關性和比較很是重要。此外,數據分類(即定性與定量)對於理解和選擇正確的假設檢驗或數據模型也很重要。
  5. 模型數據:與描述性和推論性統計數據同樣,數據建模能夠彙總數據或預測將來結果。算法是工具而不是魔法棒或銀子彈,你必須知道如何爲工做選擇合適工具的主人。錯誤的模型最壞的狀況下會致使糟糕的表現和錯誤的結論。
  6. 驗證和實施數據模型:在根據數據子集訓練模型後,是時候測試模型了。這有助於確保不會過分擬合模型或使其特定於所選子集,由於它不能準確地適合同一數據集中的另外一個子集。在這一步中,咱們肯定咱們的模型是否適合,歸納或不適合咱們的數據集。
  7. 優化和策略:你在這個過程當中重複一遍,讓它更好......更強......比之前更快。做爲數據科學家,您的策略應該是將開發人員操做和應用程序管道外包,這樣您就有更多時間專一於建議和設計。

3、步驟1:明確問題、步驟2:準備數據

步驟1:定義問題
對於這個項目,問題陳述在上述計劃中已經給出,開發一種算法來預測泰坦尼克號上乘客的生存結果。

......

項目概要:RMS泰坦尼克號沉沒是歷史上最臭名昭着的沉船之一。 1912年4月15日,在她的處女航中,泰坦尼克號在與冰山相撞後沉沒,在2224名乘客和機組人員中造

成1502人死亡。這場聳人聽聞的悲劇震驚了國際社會,並致使了更好的船舶安全規定。

形成海難失事的緣由之一是乘客和機組人員沒有足夠的救生艇。儘管倖存下沉有一些運氣因素,但有些人比其餘人更容易生存,好比女人,孩子和上流社會。

在這個挑戰中,咱們要求您完成對哪些人可能存活的分析。特別是,咱們要求您運用機器學習工具來預測哪些乘客倖免於悲劇。

練習技巧

  • 二進制分類
  • Python知識

第2步:收集數據
Kaggle的泰坦尼克號上的測試和訓練數據在:災難中的機器學習

4、步驟3:數據清洗

步驟3:數據清洗

收集了數據以後,必須對數據進行清洗。

3.1導入所須要的包

下面的代碼是用Python 3.x編寫的。 預先編寫和導入一些庫來執行必要的任務。 

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))

import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))

import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))

import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))

import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__)) 

import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
print("IPython version: {}". format(IPython.__version__)) 

import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))

#misc libraries
import random
import time


#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)

3.11加載數據和建模的庫
  咱們將使用流行的scikit-learn庫來開發咱們的機器學習算法。 在sklearn中,算法稱爲Estimators並在其本身的類中實現。 對於數據可視化,咱們將使用matplotlib和seaborn庫。 如下是要加載的常見類。

#導入所須要的包
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

3.2瞭解數據類型
  經過名字瞭解數據,並瞭解它的一些信息。它是什麼樣的(數據類型和值),是什麼使得它(獨立/特徵變量(s)),它的目標是什麼(依賴/目標變量)。

要開始此步驟,咱們首先導入數據。接下來,咱們使用info()和sample()函數來得到可變數據類型(即定性與定量)。單擊此處獲取源數據。

  • 倖存變量(Survived)是咱們的結果或因變量。對於倖存者,它是二進制標稱數據類型1,而對於死者,它是0。
  • PassengerID和Ticket變量假定爲隨機惟一標識符,對結果變量沒有影響。所以,他們將被排除在分析以外。
  • Pclass變量是故障單類的序數數據類型,是社會經濟階層(SES):1 =上層,2 =中產階級,3 =下層。
  • Name變量是標稱數據類型。它能夠用於特徵工程中,從標題中得到性別,從姓氏中獲取家庭大小,從醫生或主人等標題中獲取SES。因爲這些變量已經存在,咱們將利用它來查看標題,如master,是否有所做爲。
  • Sex和Embarked變量是名義數據類型。它們將被轉換爲虛擬變量以進行數學計算。
  • Age和Fare變量是連續的定量數據類型。
  • SibSp表明船上相關兄弟姐妹/配偶的數量,Parch表明船上相關父母/子女的數量。二者都是離散的定量數據類型。這能夠用於特徵工程來建立族大小而且是單獨變量。
  • Cabin變量是一種標稱數據類型,可用於特徵工程中事件發生時船舶上的大體位置和甲板水平的SES。可是,因爲存在許多空值,所以會從分析中排除。
#import data from file: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
data_raw = pd.read_csv(r'F:\wd.jupyter\datasets\kaggle_data\titanic\train.csv')


#a dataset should be broken into 3 splits: train, test, and (final) validation
#the test file provided is the validation file for competition submission
#we will split the train set into train and test data in future sections
data_val  = pd.read_csv(r'F:\wd.jupyter\datasets\kaggle_data\titanic\test.csv')


#to play with our data we'll create a copy
#remember python assignment or equal passes by reference vs values, 
#so we use the copy function: https://stackoverflow.com/questions/46327494/python-pandas-dataframe-copydeep-false-vs-copydeep-true-vs
data1 = data_raw.copy(deep = True)

#however passing by reference is convenient, because we can clean both datasets at once
data_cleaner = [data1, data_val]


#preview data
print (data_raw.info()) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html
#data_raw.head() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
#data_raw.tail() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html
data_raw.sample(10) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html

5、數據清理的4 C:糾正,完成,建立和轉換

5.1 數據清理的4 C:糾正,完成,建立和轉換

在這個階段,咱們將清理咱們的數據:(1)糾正異常值和異常值,(2)完成缺失信息,(3)建立新的分析功能,(4)將字段轉換爲正確的格式進行計算和顯示。

  • 更正(correcting):查看數據時,彷佛沒有任何異常或不可接受的數據輸入。此外,咱們發現咱們可能在年齡和票價方面存在潛在的異常值。可是,因爲它們是合理的值,咱們將等到咱們完成探索性分析後肯定是否應該包含或排除數據集。應該注意的是,若是它們是不合理的值,例如age= 800而不是80,那麼如今修正它十一個正確的決定。可是,當咱們從原始值修改數據時,咱們要謹慎使用,由於咱們須要建立一個準確的模型。
  • 完成(completing):年齡,艙室和登船區域中存在空值或缺失數據。丟失的值可能很糟糕,由於某些算法不知道如何處理空值而且會失敗。而其餘模型,如決策樹,能夠處理空值。所以,在開始建模以前修復是很重要的,由於咱們將比較和對比幾個模型。有兩種經常使用方法,要麼刪除記錄,要麼使用合理的輸入填充缺失值。建議不要刪除記錄,尤爲是大部分記錄,除非它真正表明不完整的記錄。相反,最好填充缺失值。定性數據的基本方法是使用衆數。定量數據的基本方法是使用均值,中位數或均值+隨機標準差來估算。中間方法是使用基於特定標準的基本方法;好比按班級劃分的平均年齡或按票價和SES登船。有更復雜的方法,但在使用以前,應將其與基礎模型進行比較,以肯定複雜性是否真正增長了價值。對於此數據集,年齡將使用中位數估算,艙室屬性將被刪除,而且將使用模式估算出來。隨後的模型迭代能夠修改該決策以肯定它是否提升了模型的準確性。
  • 建立(creating):特徵工程是指咱們使用現有功能建立新功能以肯定它們是否提供新信號來預測結果。對於此數據集,咱們將建立一個標題功能,以肯定它是否在生存中發揮做用。
  • 轉換(converting):最後,但一樣重要的是,咱們將處理格式化。沒有日期或貨幣格式,但數據類型格式。咱們的分類數據做爲對象導入,這使得數學計算變得困難。對於此數據集,咱們將對象數據類型轉換爲分類虛擬變量。
print('Train columns with null values:\n', data1.isnull().sum())
print("-"*10)

print('Test/Validation columns with null values:\n', data_val.isnull().sum())
print("-"*10)

data_raw.describe(include = 'all')

5.2 清洗數據

如今咱們知道要清理什麼,讓咱們執行咱們的代碼。

開發者文檔:

for dataset in data_cleaner:    
    #complete missing age with median
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)

    #complete embarked with mode
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)

    #complete missing fare with median
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
    
#delete the cabin feature/column and others previously stated to exclude in train dataset
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)

print(data1.isnull().sum())
print("-"*10)
print(data_val.isnull().sum())

for dataset in data_cleaner:    
    #Discrete variables
    dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1

    dataset['IsAlone'] = 1 #initialize to yes/1 is alone
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

    #quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split
    dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]


    #Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
    #Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

    #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)


    
#cleanup rare title names
#print(data1['Title'].value_counts())
stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index

#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(data1['Title'].value_counts())
print("-"*10)


#preview data again
data1.info()
data_val.info()
data1.sample(10)

5.3 轉換格式
咱們將分類數據轉換爲虛擬變量以進行數學分析。 有多種方法能夠對分類變量進行編碼; 咱們將使用sklearn和pandas函數。

在此步驟中,咱們還將爲數據建模定義x(獨立/特徵/解釋/預測器/等)和y(依賴/目標/結果/響應/等)變量。

開發者文檔:

label = LabelEncoder()
for dataset in data_cleaner:    
    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label.fit_transform(dataset['Title'])
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])


#define y variable aka target/outcome
Target = ['Survived']

#define x variables for original features aka feature selection
data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts
data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation
data1_xy =  Target + data1_x
print('Original X Y: ', data1_xy, '\n')


#define x variables for original w/bin features to remove continuous variables
data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y: ', data1_xy_bin, '\n')


#define x and y variables for dummy features original
data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy, '\n')



data1_dummy.head()

5.4 再次檢查清理數據
如今咱們已經清理了咱們的數據,讓咱們作再次檢查!

print('Train columns with null values: \n', data1.isnull().sum())
print("-"*10)
print (data1.info())
print("-"*10)

print('Test/Validation columns with null values: \n', data_val.isnull().sum())
print("-"*10)
print (data_val.info())
print("-"*10)

data_raw.describe(include = 'all')

 

 

5.5 拆分訓練和測試數據
  如前所述,提供測試的文件其實是競賽提交的驗證數據。 所以,咱們將使用sklearn函數將訓練數據分紅兩個數據集; 75/25分裂。 這很重要,因此咱們不會過分擬合(overfitting)咱們的模型。 意思是,該算法對於給定子集是如此特定,它不能從同一數據集中準確地推廣另外一個子集。 重要的是咱們的算法沒有看到咱們將用於測試的子集,所以它不會經過記憶答案來「欺騙」。 咱們將使用sklearn的train_test_split函數。 在後面的部分中,咱們還將使用sklearn的交叉驗證函數(cross validation functions,),將咱們的數據集拆分爲訓練和測試數據建模比較。

#split train and test data with function defaults
#random_state -> seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)
train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)
train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)


print("Data1 Shape: {}".format(data1.shape))
print("Train1 Shape: {}".format(train1_x.shape))
print("Test1 Shape: {}".format(test1_x.shape))

train1_x_bin.head()

 

6、步驟4:進行探索性分析

  如今咱們的數據已經清理完畢,咱們將使用描述性統計和圖形化統計數據來探索咱們的數據。 在這個階段,你會發現本身對特徵進行分類並肯定它們與目標變量和彼此之間的相關性。

#Discrete Variable Correlation by Survival using
#group by aka pivot table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
for x in data1_x:
    if data1[x].dtype != 'float64' :
        print('Survival Correlation by:', x)
        print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())
        print('-'*10, '\n')
        

#using crosstabs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html
print(pd.crosstab(data1['Title'],data1[Target[0]]))

 

#IMPORTANT: Intentionally plotted different ways for learning purposes only. 

#optional plotting w/pandas: https://pandas.pydata.org/pandas-docs/stable/visualization.html

#we will use matplotlib.pyplot: https://matplotlib.org/api/pyplot_api.html

#to organize our graphics will use figure: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
#subplot: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot
#and subplotS: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html?highlight=matplotlib%20pyplot%20subplots#matplotlib.pyplot.subplots

#graph distribution of quantitative data
plt.figure(figsize=[16,12])

plt.subplot(231)
plt.boxplot(x=data1['Fare'], showmeans = True, meanline = True)
plt.title('Fare Boxplot')
plt.ylabel('Fare ($)')

plt.subplot(232)
plt.boxplot(data1['Age'], showmeans = True, meanline = True)
plt.title('Age Boxplot')
plt.ylabel('Age (Years)')

plt.subplot(233)
plt.boxplot(data1['FamilySize'], showmeans = True, meanline = True)
plt.title('Family Size Boxplot')
plt.ylabel('Family Size (#)')

plt.subplot(234)
plt.hist(x = [data1[data1['Survived']==1]['Fare'], data1[data1['Survived']==0]['Fare']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Fare Histogram by Survival')
plt.xlabel('Fare ($)')
plt.ylabel('# of Passengers')
plt.legend()

plt.subplot(235)
plt.hist(x = [data1[data1['Survived']==1]['Age'], data1[data1['Survived']==0]['Age']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()

plt.subplot(236)
plt.hist(x = [data1[data1['Survived']==1]['FamilySize'], data1[data1['Survived']==0]['FamilySize']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Family Size Histogram by Survival')
plt.xlabel('Family Size (#)')
plt.ylabel('# of Passengers')
plt.legend()

#we will use seaborn graphics for multi-variable comparison: https://seaborn.pydata.org/api.html

#graph individual features by survival fig, saxis = plt.subplots(2, 3,figsize=(16,12)) sns.barplot(x = 'Embarked', y = 'Survived', data=data1, ax = saxis[0,0]) sns.barplot(x = 'Pclass', y = 'Survived', order=[1,2,3], data=data1, ax = saxis[0,1]) sns.barplot(x = 'IsAlone', y = 'Survived', order=[1,0], data=data1, ax = saxis[0,2]) sns.pointplot(x = 'FareBin', y = 'Survived', data=data1, ax = saxis[1,0]) sns.pointplot(x = 'AgeBin', y = 'Survived', data=data1, ax = saxis[1,1]) sns.pointplot(x = 'FamilySize', y = 'Survived', data=data1, ax = saxis[1,2])

 

#graph distribution of qualitative data: Pclass
#we know class mattered in survival, now let's compare class and a 2nd feature fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(14,12)) sns.boxplot(x = 'Pclass', y = 'Fare', hue = 'Survived', data = data1, ax = axis1) axis1.set_title('Pclass vs Fare Survival Comparison') sns.violinplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = data1, split = True, ax = axis2) axis2.set_title('Pclass vs Age Survival Comparison') sns.boxplot(x = 'Pclass', y ='FamilySize', hue = 'Survived', data = data1, ax = axis3) axis3.set_title('Pclass vs Family Size Survival Comparison')

 

#graph distribution of qualitative data: Sex
#we know sex mattered in survival, now let's compare sex and a 2nd feature fig, qaxis = plt.subplots(1,3,figsize=(14,12)) sns.barplot(x = 'Sex', y = 'Survived', hue = 'Embarked', data=data1, ax = qaxis[0]) axis1.set_title('Sex vs Embarked Survival Comparison') sns.barplot(x = 'Sex', y = 'Survived', hue = 'Pclass', data=data1, ax = qaxis[1]) axis1.set_title('Sex vs Pclass Survival Comparison') sns.barplot(x = 'Sex', y = 'Survived', hue = 'IsAlone', data=data1, ax = qaxis[2]) axis1.set_title('Sex vs IsAlone Survival Comparison')

#more side-by-side comparisons
fig, (maxis1, maxis2) = plt.subplots(1, 2,figsize=(14,12)) #how does family size factor with sex & survival compare sns.pointplot(x="FamilySize", y="Survived", hue="Sex", data=data1, palette={"male": "blue", "female": "pink"}, markers=["*", "o"], linestyles=["-", "--"], ax = maxis1) #how does class factor with sex & survival compare sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data1, palette={"male": "blue", "female": "pink"}, markers=["*", "o"], linestyles=["-", "--"], ax = maxis2)

 
 
#how does embark port factor with class, sex, and survival compare
#facetgrid: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html e = sns.FacetGrid(data1, col = 'Embarked') e.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', ci=95.0, palette = 'deep') e.add_legend()

#how does embark port factor with class, sex, and survival compare
#facetgrid: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
a = sns.FacetGrid( data1, hue = 'Survived', aspect=4 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , data1['Age'].max()))
a.add_legend()

#histogram comparison of sex, class, and age by survival
h = sns.FacetGrid(data1, row = 'Sex', col = 'Pclass', hue = 'Survived')
h.map(plt.hist, 'Age', alpha = .75)
h.add_legend()

#pair plots of entire dataset
pp = sns.pairplot(data1, hue = 'Survived', palette = 'deep', size=1.2, diag_kind = 'kde', diag_kws=dict(shade=True), plot_kws=dict(s=10) )
pp.set(xticklabels=[])

 

 

#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(data1)

 

 

7、步驟5:模型數據

  數據科學是數學、統計學、計算機科學、和商業管理之間的多學科領域。大多數數據科學家來自三個領域之一,所以他們傾向於該學科。然而,數據科學就像一個三腳凳,沒有一條腿比另外一條腿更重要。所以,這一步將須要先進的數學知識。但不要擔憂,咱們只須要一個高級概述,咱們將在文章中介紹。此外,因爲計算機科學的發展,不少繁重的工做都用計算機完成。所以,曾經須要數學或統計學研究生學位的問題,如今只須要幾行代碼。最後,咱們須要一些商業頭腦來思考問題。畢竟,就像訓練寵物同樣,它是向咱們學習,須要咱們一點一點引導。

  機器學習(ML),顧名思義,就是教機器如何思考而不是思考什麼。雖然這個話題和大數據已經存在了幾十年,但它正變得比以往任什麼時候候都更受歡迎,由於對於企業和專業人士而言,進入門檻較低。這既好又壞。這很好,由於這些算法如今可供更多人使用,能夠解決現實世界中的更多問題。這很糟糕,由於進入門檻較低意味着,更多的人不會知道他們使用的工具,而且會得出錯誤的結論。之前,我曾經比喻過要求別人給你一把螺絲刀,他們會給你一把平頭螺絲刀或最差的錘子。充其量,它代表徹底缺少理解。在最壞的狀況下,它使項目變得不好;甚至最糟糕的是,得出錯誤的結論。因此,如今,我會告訴你該怎麼作,最重要的是,爲何你這樣作。

  首先,你必須明白,機器學習的目的是解決人類的問題。機器學習可分爲:監督學習,無監督學習和強化學習。監督學習是指經過向訓練數據集提供包含正確答案的訓練模型來訓練模型的地方。無監督學習是使用不包含正確答案的訓練數據集訓練模型的地方。強化學習是前二者的混合體,其中模型沒有當即給出正確的答案,可是在一系列事件以後增強學習。咱們正在進行有監督的機器學習,由於咱們經過提供一組特徵及其相應的目標來訓練咱們的算法。而後,咱們但願從同一數據集中爲它提供一個新的子集,而且在預測精度方面具備類似的結果。

  有許多機器學習算法,但它們能夠簡化爲四類:分類,迴歸,聚類或降維,具體取決於您的目標變量和數據建模目標。本文中專一於分類和迴歸。咱們能夠歸納出連續目標變量須要迴歸算法而離散目標變量須要分類算法。一方面注意,邏輯迴歸雖然名稱中有迴歸,但其實是一種分類算法。因爲咱們的問題是預測乘客是否倖存或未倖存,所以這是一個離散的目標變量。咱們將使用sklearn庫中的分類算法開始咱們的分析。咱們將使用交叉驗證和評分指標,在後面的章節中討論,以排名和比較咱們的算法的性能。

機器學習模型選擇:

機器學習常見的分類算法:

那麼,如何選擇機器學習算法(MLA)?
  重要提示:在數據建模方面,初學者的問題始終是「什麼是最好的機器學習算法?」對此,初學者必須學習機器學習的免費午飯定理(No Free Lunch Theorem (NFLT))。簡而言之,NFLT指出,對於全部數據集,沒有超級算法在全部狀況下都能發揮最佳做用。所以,最好的方法是嘗試多個工做量,調整它們,並根據您的具體狀況進行比較。話雖如此,已經進行了一些很好的研究來比較算法,例如 Caruana & Niculescu-Mizil 2006, Ogutu et al. 2011NIH完成基因組選擇, Fernandez-Delgado et al. 2014比較了來自17個家庭的179個分類器,而且還有一種思想流派認爲,Thoma 2016 sklearn comparison, 更多的數據戰勝了更好的算法。

  全部這些信息,初學者在哪裏開始?我建議從樹木,套袋,隨機森林和提高(Trees, Bagging, Random Forests, and Boosting)開始。它們基本上是決策樹的不一樣實現,這是最容易學習和理解的概念。與SVC相比,它們在下一節中也更容易調整。下面,我將概述如何運行和比較幾個MLA,但這個內核的其他部分將側重於經過決策樹及其衍生物學習數據建模。

#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
    
    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    XGBClassifier()    
    ]



#split dataset in cross-validation with this splitter class: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit
#note: this is an alternative to train_test_split
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%

#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

#create table to compare MLA predictions
MLA_predict = data1[Target]

#index through MLA and save performance to table
row_index = 0
for alg in MLA:

    #set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    
    #score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
    cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv  = cv_split)

    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()   
    #if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3   #let's know the worst that can happen!
    

    #save MLA predictions - see section 6 for usage
    alg.fit(data1[data1_x_bin], data1[Target])
    MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin])
    
    row_index+=1

    
#print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare
#MLA_predict
#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html
sns.barplot(x='MLA Test Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm')

#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')

 

8、評估模型性能

  讓咱們回顧一下,經過一些基本的數據清理,分析和機器學習算法(MLA),咱們可以以大約82%的準確度預測乘客的生存。 幾行代碼也不錯。 但咱們一直在問的問題是,咱們能投入更多時間,更重要的是得到更多的投資回報率嗎? 例如,若是咱們只是將準確度提升0.1,那麼開發真的值3個月。 若是你在研究中工做,答案是確定的,但若是你在商業中工做,答案確定是否認的。 所以,在改進模型時請記住這一點。

NOTE:肯定基線準確度
  在咱們決定如何使咱們的模型更好以前,讓咱們肯定咱們的模型是否值得保留。要作到這一點,咱們必須回到數據科學的基礎.咱們知道這是一個二元問題,由於只有兩種可能的結果;乘客倖存或死亡。因此,把它想象成硬幣翻轉問題。若是你有一個公平的硬幣而且猜到了頭或尾,那麼你有50%的機會猜想正確。因此,讓咱們將50%設爲最差的模特表現;由於低於那個,那麼爲何我只須要翻硬幣就須要你?

  因此,沒有關於數據集的信息,咱們老是能夠獲得50%的二進制問題。可是咱們有關於數據集的信息,因此咱們應該可以作得更好。咱們知道1,502 / 2,224或67.5%的人死亡。所以,若是咱們只是預測最多見的事件,那100%的人死亡,那麼咱們使用67.5%的機率。因此,讓咱們將68%設置爲糟糕的模型性能,由於再次,低於此值,那麼爲何我須要你,當我能夠預測使用最頻繁的事件。

NOTE:如何建立本身的模型
  咱們的準確性正在提升,但咱們能夠作得更好嗎?咱們的數據中是否有任何信號?爲了說明這一點,咱們將構建本身的決策樹模型,由於它最容易構思並須要簡單的加法和乘法計算。建立決策樹時,您須要提出分段目標響應的問題,將倖存的1和死亡0放入同類子組中。這是科學和藝術的一部分,因此讓咱們玩21個問題的遊戲,告訴你它是如何工做的。若是您想本身動手,請下載火車數據集並導入Excel。建立一個數據透視表,其中包含列中的生存,值的行數和計數,以及行中下面描述的功能。

  請記住,遊戲的名稱是使用決策樹模型建立子組,以便在一個存儲桶中存活1,在另外一個存儲桶中存在死0。咱們的經驗法則是多數規則。意思是,若是大部分或50%或更多存活,那麼咱們亞組中的每一個人都存活 1,但若是50%或更少存活,那麼若是咱們子組中的每一個人都死亡/0。此外,若是子組小於10或咱們的模型精度平穩或下降,咱們將中止。

問題1:你是泰坦尼克號嗎?若是是,則大部分(62%)死亡。請注意,咱們的樣本存活率與咱們68%的人口不一樣。儘管如此,若是咱們假設每一個人都死了,咱們的樣本準確率爲62%。

問題2:你是男性仍是女性?男性,大多數(81%)死亡。女性,大多數(74%)倖免於難。給咱們79%的準確率。

問題3A(沿着女性分支進行計數= 314):您是第1,2或3級? 1級,大多數(97%)存活,2級,大部分(92%)存活。因爲死子組小於10,咱們將中止進入該分支。 3級,甚至是50-50分。沒有得到改進咱們模型的新信息。

問題4A(沿着女性3級分支,計數= 144):您是從C,Q仍是S出發?咱們得到了一些信息。 C和Q,大部分仍然存活,因此沒有變化。此外,死子組小於10,因此咱們將中止。 S,大多數人(63%)死亡。因此,咱們將改變女性,第3類,讓S從假設倖存下來,假設他們死了。咱們的模型精度提升到81%。

問題5A(從女性班級3開始走向S分支,計數= 88):到目前爲止,看起來咱們作出了很好的決定。添加另外一個級別彷佛沒有得到更多信息。該小組55死亡,33人倖存,由於多數人死亡,咱們須要找到一個信號來識別33或一個小組,將他們從死亡變爲倖存,並提升咱們的模型準確性。咱們能夠玩咱們的功能。我找到的一個是0-8的票價,大多數倖存下來。這是一個11-9的小樣本,但常常用於統計。咱們略微提升了準確度,但沒有太多讓咱們超過82%。因此,咱們會在這裏停下來。

問題3B(沿着男性分支進行計數= 577):回到問題2,咱們知道大多數男性死亡。所以,咱們正在尋找一種可以識別大多數倖存下來的子羣的功能。使人驚訝的是,上課或甚至開始並無像女性那樣重要,可是頭銜確實讓咱們達到了82%。猜想並檢查其餘功能,彷佛沒有一個讓咱們超過82%。因此,咱們暫時停在這裏。

你作到了,信息不多,咱們的準確率達到了82%。在最糟糕的,壞的,好的,更好的和最好的規模上,咱們將82%設置爲好,由於它是一個簡單的模型,能夠產生不錯的結果。但問題仍然存在,咱們能比手工製做的模型更好嗎?

在咱們開始以前,讓咱們編寫上面剛寫的內容。請注意,這是由「手」建立的手動過程。您沒必要這樣作,但在開始使用MLA以前瞭解它很是重要。在微積分考試中將MLA想象成TI-89計算器。它很是強大,能夠幫助您完成大量繁重的工做。但若是你不知道你在考試中作了什麼,計算器,甚至是TI-89,都不會幫助你經過。因此,明智地研究下一節。

參考:: Cross-Validation and Decision Tree Tutorial

#le/generated/pandas.DataFrame.iterrows.html
for index, row in data1.iterrows(): 
    #random number generator: https://docs.python.org/2/library/random.html
    if random.random() > .5:     # Random float x, 0.0 <= x < 1.0    
        data1.set_value(index, 'Random_Predict', 1) #predict survived/1
    else: 
        data1.set_value(index, 'Random_Predict', 0) #predict died/0
    

#score random guess of survival. Use shortcut 1 = Right Guess and 0 = Wrong Guess
#the mean of the column will then equal the accuracy
data1['Random_Score'] = 0 #assume prediction wrong
data1.loc[(data1['Survived'] == data1['Random_Predict']), 'Random_Score'] = 1 #set to 1 for correct prediction
print('Coin Flip Model Accuracy: {:.2f}%'.format(data1['Random_Score'].mean()*100))

#we can also use scikit's accuracy_score function to save us a few lines of code
#http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
print('Coin Flip Model Accuracy w/SciKit: {:.2f}%'.format(metrics.accuracy_score(data1['Survived'], data1['Random_Predict'])*100))

 

#e.groupby.html
pivot_female = data1[data1.Sex=='female'].groupby(['Sex','Pclass', 'Embarked','FareBin'])['Survived'].mean()
print('Survival Decision Tree w/Female Node: \n',pivot_female)

pivot_male = data1[data1.Sex=='male'].groupby(['Sex','Title'])['Survived'].mean()
print('\n\nSurvival Decision Tree w/Male Node: \n',pivot_male)
def mytree(df):
    
    #initialize table to store predictions
    Model = pd.DataFrame(data = {'Predict':[]})
    male_title = ['Master'] #survived titles

    for index, row in df.iterrows():

        #Question 1: Were you on the Titanic; majority died
        Model.loc[index, 'Predict'] = 0

        #Question 2: Are you female; majority survived
        if (df.loc[index, 'Sex'] == 'female'):
                  Model.loc[index, 'Predict'] = 1

        #Question 3A Female - Class and Question 4 Embarked gain minimum information

        #Question 5B Female - FareBin; set anything less than .5 in female node decision tree back to 0       
        if ((df.loc[index, 'Sex'] == 'female') & 
            (df.loc[index, 'Pclass'] == 3) & 
            (df.loc[index, 'Embarked'] == 'S')  &
            (df.loc[index, 'Fare'] > 8)

           ):
                  Model.loc[index, 'Predict'] = 0

        #Question 3B Male: Title; set anything greater than .5 to 1 for majority survived
        if ((df.loc[index, 'Sex'] == 'male') &
            (df.loc[index, 'Title'] in male_title)
            ):
            Model.loc[index, 'Predict'] = 1
        
        
    return Model


#model data
Tree_Predict = mytree(data1)
print('Decision Tree Model Accuracy/Precision Score: {:.2f}%\n'.format(metrics.accuracy_score(data1['Survived'], Tree_Predict)*100))
#model data
Tree_Predict = mytree(data1)
print('Decision Tree Model Accuracy/Precision Score: {:.2f}%\n'.format(metrics.accuracy_score(data1['Survived'], Tree_Predict)*100))

#assification_report.html#sklearn.metrics.classification_report
#Where recall score = (true positives)/(true positive + false negative) w/1 being best:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
#And F1 score = weighted average of precision and recall w/1 being best: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
print(metrics.classification_report(data1['Survived'], Tree_Predict))

import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = metrics.confusion_matrix(data1['Survived'], Tree_Predict)
np.set_printoptions(precision=2)

class_names = ['Dead', 'Survived']
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, 
                      title='Normalized confusion matrix')

8.1 交叉驗證(CV)的模型性能
  在步驟5.0中,咱們使用sklearn cross_validate函數來訓練,測試和評分咱們的模型性能。

  請記住,重要的是咱們使用不一樣的火車數據子集來構建咱們的模型和測試數據來評估咱們的模型。不然,咱們的模型將被過分裝配。意思是它已經看到了「預測」數據的好處,可是在預測未見到的數據方面很糟糕;這根本不是預測。這就像在學校測驗中做弊得到100%,可是當你參加考試時,你失敗了,由於你從未真正學到任何東西。機器學習也是如此。

  CV基本上是一種快捷方式,能夠屢次拆分和評分咱們的模型,所以咱們能夠了解它對未見數據的執行狀況。它在計算機處理方面要貴一些,但它很重要,因此咱們不會得到虛假的信心。這在Kaggle比賽或任何須要避免一致性和意外的用例中頗有用。

  除了CV以外,咱們還使用了定製的 sklearn train test splitter,,以便在咱們的測試評分中得到更多隨機性。下面是默認CV拆分的圖像。

9、具備超參數的調整模型

  當咱們使用sklearn決策樹(DT)分類器時(sklearn Decision Tree (DT) Classifier),咱們接受了全部函數默認值。 這樣就有機會了解各類超參數設置如何改變模型的準確性。 ( (Click here to learn more about parameters vs hyper-parameters.)。)

  可是,爲了調整模型,咱們須要真正理解它。 這就是爲何我花時間在前幾節中向您展現預測是如何工做的。 如今讓咱們更多地瞭解咱們的DT算法。

參考:sklearn

 關於決策樹的優缺點,參見我寫的關於決策樹的文章。

  如下是可用的超參數和定義:

class sklearn.tree.DecisionTreeClassifier(criterion ='gini',splitter ='best',max_depth = None,min_samples_split = 2,min_samples_leaf = 1,min_weight_fraction_leaf = 0.0,max_features = None,random_state = None,max_leaf_nodes = None,min_impurity_decrease = 0.0 ,min_impurity_split = None,class_weight = None,presort = False)

  咱們將使用ParameterGridGridSearchCV,和自定義的sklearn評分來調整咱們的模型; 單擊此處以瞭解有關ROC_AUC分數的更多信息( sklearn scoringclick here to learn more about ROC_AUC scores)。 而後咱們將使用graphviz.可視化咱們的樹。 單擊此處以瞭解有關ROC_AUC分數的更多信息  (Click here to learn more about ROC_AUC scores.)。

#base model
dtree = tree.DecisionTreeClassifier(random_state = 0)
base_results = model_selection.cross_validate(dtree, data1[data1_x_bin], data1[Target], cv  = cv_split)
dtree.fit(data1[data1_x_bin], data1[Target])

print('BEFORE DT Parameters: ', dtree.get_params())
print("BEFORE DT Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) 
print("BEFORE DT Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
#print("BEFORE DT Test w/bin set score min: {:.2f}". format(base_results['test_score'].min()*100))
print('-'*10)


#tune hyper-parameters: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
param_grid = {'criterion': ['gini', 'entropy'],  #scoring methodology; two supported formulas for calculating information gain - default is gini
              #'splitter': ['best', 'random'], #splitting methodology; two supported strategies - default is best
              'max_depth': [2,4,6,8,10,None], #max depth tree can grow; default is none
              #'min_samples_split': [2,5,10,.03,.05], #minimum subset size BEFORE new split (fraction is % of total); default is 2
              #'min_samples_leaf': [1,5,10,.03,.05], #minimum subset size AFTER new split split (fraction is % of total); default is 1
              #'max_features': [None, 'auto'], #max features to consider when performing split; default none or all
              'random_state': [0] #seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
             }

#print(list(model_selection.ParameterGrid(param_grid)))

#choose best model with grid_search: #http://scikit-learn.org/stable/modules/grid_search.html#grid-search
#http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
tune_model.fit(data1[data1_x_bin], data1[Target])

#print(tune_model.cv_results_.keys())
#print(tune_model.cv_results_['params'])
print('AFTER DT Parameters: ', tune_model.best_params_)
#print(tune_model.cv_results_['mean_train_score'])
print("AFTER DT Training w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) 
#print(tune_model.cv_results_['mean_test_score'])
print("AFTER DT Test w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT Test w/bin score 3*std: +/- {:.2f}". format(tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)


#duplicates gridsearchcv
#tune_results = model_selection.cross_validate(tune_model, data1[data1_x_bin], data1[Target], cv  = cv_split)

#print('AFTER DT Parameters: ', tune_model.best_params_)
#print("AFTER DT Training w/bin set score mean: {:.2f}". format(tune_results['train_score'].mean()*100)) 
#print("AFTER DT Test w/bin set score mean: {:.2f}". format(tune_results['test_score'].mean()*100))
#print("AFTER DT Test w/bin set score min: {:.2f}". format(tune_results['test_score'].min()*100))
#print('-'*10)

10、具備特徵選擇的調整模型

  如開頭所述,變量越多,並不意味着模型越好,但正確的預測變量確實如此。 所以,數據建模的另外一個步驟是特徵選擇。 Sklearn有幾個選項,咱們將使用帶有交叉驗證(CV)的遞歸特徵消除(RFE)( recursive feature elimination (RFE) with cross validation (CV).)。

#base model
print('BEFORE DT RFE Training Shape Old: ', data1[data1_x_bin].shape) 
print('BEFORE DT RFE Training Columns Old: ', data1[data1_x_bin].columns.values)

print("BEFORE DT RFE Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) 
print("BEFORE DT RFE Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT RFE Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
print('-'*10)



#feature selection
dtree_rfe = feature_selection.RFECV(dtree, step = 1, scoring = 'accuracy', cv = cv_split)
dtree_rfe.fit(data1[data1_x_bin], data1[Target])

#transform x&y to reduced features and fit new model
#alternative: can use pipeline to reduce fit and transform steps: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
X_rfe = data1[data1_x_bin].columns.values[dtree_rfe.get_support()]
rfe_results = model_selection.cross_validate(dtree, data1[X_rfe], data1[Target], cv  = cv_split)

#print(dtree_rfe.grid_scores_)
print('AFTER DT RFE Training Shape New: ', data1[X_rfe].shape) 
print('AFTER DT RFE Training Columns New: ', X_rfe)

print("AFTER DT RFE Training w/bin score mean: {:.2f}". format(rfe_results['train_score'].mean()*100)) 
print("AFTER DT RFE Test w/bin score mean: {:.2f}". format(rfe_results['test_score'].mean()*100))
print("AFTER DT RFE Test w/bin score 3*std: +/- {:.2f}". format(rfe_results['test_score'].std()*100*3))
print('-'*10)


#tune rfe model
rfe_tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
rfe_tune_model.fit(data1[X_rfe], data1[Target])

#print(rfe_tune_model.cv_results_.keys())
#print(rfe_tune_model.cv_results_['params'])
print('AFTER DT RFE Tuned Parameters: ', rfe_tune_model.best_params_)
#print(rfe_tune_model.cv_results_['mean_train_score'])
print("AFTER DT RFE Tuned Training w/bin score mean: {:.2f}". format(rfe_tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) 
#print(rfe_tune_model.cv_results_['mean_test_score'])
print("AFTER DT RFE Tuned Test w/bin score mean: {:.2f}". format(rfe_tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT RFE Tuned Test w/bin score 3*std: +/- {:.2f}". format(rfe_tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)

#Graph MLA version of Decision Tree: http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
import graphviz 
dot_data = tree.export_graphviz(dtree, out_file=None, 
                                feature_names = data1_x_bin, class_names = True,
                                filled = True, rounded = True)
graph = graphviz.Source(dot_data) 
graph

11、步驟6:驗證和實施

下一步是使用驗證數據準備提交。

#compare algorithm predictions with each other, where 1 = exactly similar and 0 = exactly opposite
#there are some 1's, but enough blues and light reds to create a "super algorithm" by combining them
correlation_heatmap(MLA_predict)

 

#why choose one model, when you can pick them all with voting classifier
#http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
#removed models w/o attribute 'predict_proba' required for vote classifier and models with a 1.0 correlation to another model
vote_est = [
    #Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
    ('ada', ensemble.AdaBoostClassifier()),
    ('bc', ensemble.BaggingClassifier()),
    ('etc',ensemble.ExtraTreesClassifier()),
    ('gbc', ensemble.GradientBoostingClassifier()),
    ('rfc', ensemble.RandomForestClassifier()),

    #Gaussian Processes: http://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-classification-gpc
    ('gpc', gaussian_process.GaussianProcessClassifier()),
    
    #GLM: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
    ('lr', linear_model.LogisticRegressionCV()),
    
    #Navies Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html
    ('bnb', naive_bayes.BernoulliNB()),
    ('gnb', naive_bayes.GaussianNB()),
    
    #Nearest Neighbor: http://scikit-learn.org/stable/modules/neighbors.html
    ('knn', neighbors.KNeighborsClassifier()),
    
    #SVM: http://scikit-learn.org/stable/modules/svm.html
    ('svc', svm.SVC(probability=True)),
    
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
   ('xgb', XGBClassifier())

]


#Hard Vote or majority rules
vote_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
vote_hard_cv = model_selection.cross_validate(vote_hard, data1[data1_x_bin], data1[Target], cv  = cv_split)
vote_hard.fit(data1[data1_x_bin], data1[Target])

print("Hard Voting Training w/bin score mean: {:.2f}". format(vote_hard_cv['train_score'].mean()*100)) 
print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv['test_score'].std()*100*3))
print('-'*10)


#Soft Vote or weighted probabilities
vote_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
vote_soft_cv = model_selection.cross_validate(vote_soft, data1[data1_x_bin], data1[Target], cv  = cv_split)
vote_soft.fit(data1[data1_x_bin], data1[Target])

print("Soft Voting Training w/bin score mean: {:.2f}". format(vote_soft_cv['train_score'].mean()*100)) 
print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv['test_score'].mean()*100))
print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv['test_score'].std()*100*3))
print('-'*10)

 

#WARNING: Running is very computational intensive and time expensive.
#Code is written for experimental/developmental purposes and not production ready!


#Hyperparameter Tune with GridSearchCV: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]


grid_param = [
            [{
            #AdaBoostClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
            'n_estimators': grid_n_estimator, #default=50
            'learning_rate': grid_learn, #default=1
            #'algorithm': ['SAMME', 'SAMME.R'], #default=’SAMME.R
            'random_state': grid_seed
            }],
       
    
            [{
            #BaggingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
            'n_estimators': grid_n_estimator, #default=10
            'max_samples': grid_ratio, #default=1.0
            'random_state': grid_seed
             }],

    
            [{
            #ExtraTreesClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
            'n_estimators': grid_n_estimator, #default=10
            'criterion': grid_criterion, #default=」gini」
            'max_depth': grid_max_depth, #default=None
            'random_state': grid_seed
             }],


            [{
            #GradientBoostingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
            #'loss': ['deviance', 'exponential'], #default=’deviance’
            'learning_rate': [.05], #default=0.1 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
            'n_estimators': [300], #default=100 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
            #'criterion': ['friedman_mse', 'mse', 'mae'], #default=」friedman_mse」
            'max_depth': grid_max_depth, #default=3   
            'random_state': grid_seed
             }],

    
            [{
            #RandomForestClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
            'n_estimators': grid_n_estimator, #default=10
            'criterion': grid_criterion, #default=」gini」
            'max_depth': grid_max_depth, #default=None
            'oob_score': [True], #default=False -- 12/31/17 set to reduce runtime -- The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 146.35 seconds.
            'random_state': grid_seed
             }],
    
            [{    
            #GaussianProcessClassifier
            'max_iter_predict': grid_n_estimator, #default: 100
            'random_state': grid_seed
            }],
        
    
            [{
            #LogisticRegressionCV - http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
            'fit_intercept': grid_bool, #default: True
            #'penalty': ['l1','l2'],
            'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], #default: lbfgs
            'random_state': grid_seed
             }],
            
    
            [{
            #BernoulliNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB
            'alpha': grid_ratio, #default: 1.0
             }],
    
    
            #GaussianNB - 
            [{}],
    
            [{
            #KNeighborsClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
            'n_neighbors': [1,2,3,4,5,6,7], #default: 5
            'weights': ['uniform', 'distance'], #default = ‘uniform’
            'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
            }],
            
    
            [{
            #SVC - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
            #http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-python-r
            #'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'C': [1,2,3,4,5], #default=1.0
            'gamma': grid_ratio, #edfault: auto
            'decision_function_shape': ['ovo', 'ovr'], #default:ovr
            'probability': [True],
            'random_state': grid_seed
             }],

    
            [{
            #XGBClassifier - http://xgboost.readthedocs.io/en/latest/parameter.html
            'learning_rate': grid_learn, #default: .3
            'max_depth': [1,2,4,6,8,10], #default 2
            'n_estimators': grid_n_estimator, 
            'seed': grid_seed  
             }]   
        ]



start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter
for clf, param in zip (vote_est, grid_param): #https://docs.python.org/3/library/functions.html#zip

    #print(clf[1]) #vote_est is a list of tuples, index 0 is the name and index 1 is the algorithm
    #print(param)
    
    
    start = time.perf_counter()        
    best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
    best_search.fit(data1[data1_x_bin], data1[Target])
    run = time.perf_counter() - start

    best_param = best_search.best_params_
    print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
    clf[1].set_params(**best_param) 


run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))

print('-'*10)

 

#Hard Vote or majority rules w/Tuned Hyperparameters
grid_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
grid_hard_cv = model_selection.cross_validate(grid_hard, data1[data1_x_bin], data1[Target], cv  = cv_split)
grid_hard.fit(data1[data1_x_bin], data1[Target])

print("Hard Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_hard_cv['train_score'].mean()*100)) 
print("Hard Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_hard_cv['test_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_hard_cv['test_score'].std()*100*3))
print('-'*10)

#Soft Vote or weighted probabilities w/Tuned Hyperparameters
grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
grid_soft_cv = model_selection.cross_validate(grid_soft, data1[data1_x_bin], data1[Target], cv  = cv_split)
grid_soft.fit(data1[data1_x_bin], data1[Target])

print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100)) 
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))
print('-'*10)

 

#prepare data for modeling
print(data_val.info())
print("-"*10)
#data_val.sample(10)

 

#handmade decision tree - submission score = 0.77990
data_val['Survived'] = mytree(data_val).astype(int)


#decision tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_dt = tree.DecisionTreeClassifier()
#submit_dt = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
#submit_dt.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_dt.best_params_) #Best Parameters: {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
#data_val['Survived'] = submit_dt.predict(data_val[data1_x_bin])


#bagging w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77990
#submit_bc = ensemble.BaggingClassifier()
#submit_bc = model_selection.GridSearchCV(ensemble.BaggingClassifier(), param_grid= {'n_estimators':grid_n_estimator, 'max_samples': grid_ratio, 'oob_score': grid_bool, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_bc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_bc.best_params_) #Best Parameters: {'max_samples': 0.25, 'n_estimators': 500, 'oob_score': True, 'random_state': 0}
#data_val['Survived'] = submit_bc.predict(data_val[data1_x_bin])


#extra tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_etc = ensemble.ExtraTreesClassifier()
#submit_etc = model_selection.GridSearchCV(ensemble.ExtraTreesClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_etc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_etc.best_params_) #Best Parameters: {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_etc.predict(data_val[data1_x_bin])


#random foreset w/full dataset modeling submission score: defaults= 0.71291, tuned= 0.73205
#submit_rfc = ensemble.RandomForestClassifier()
#submit_rfc = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_rfc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_rfc.best_params_) #Best Parameters: {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_rfc.predict(data_val[data1_x_bin])

 

#ada boosting w/full dataset modeling submission score: defaults= 0.74162, tuned= 0.75119
#submit_abc = ensemble.AdaBoostClassifier()
#submit_abc = model_selection.GridSearchCV(ensemble.AdaBoostClassifier(), param_grid={'n_estimators': grid_n_estimator, 'learning_rate': grid_ratio, 'algorithm': ['SAMME', 'SAMME.R'], 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_abc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_abc.best_params_) #Best Parameters: {'algorithm': 'SAMME.R', 'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0}
#data_val['Survived'] = submit_abc.predict(data_val[data1_x_bin])


#gradient boosting w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77033
#submit_gbc = ensemble.GradientBoostingClassifier()
#submit_gbc = model_selection.GridSearchCV(ensemble.GradientBoostingClassifier(), param_grid={'learning_rate': grid_ratio, 'n_estimators': grid_n_estimator, 'max_depth': grid_max_depth, 'random_state':grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_gbc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_gbc.best_params_) #Best Parameters: {'learning_rate': 0.25, 'max_depth': 2, 'n_estimators': 50, 'random_state': 0}
#data_val['Survived'] = submit_gbc.predict(data_val[data1_x_bin])

#extreme boosting w/full dataset modeling submission score: defaults= 0.73684, tuned= 0.77990
#submit_xgb = XGBClassifier()
#submit_xgb = model_selection.GridSearchCV(XGBClassifier(), param_grid= {'learning_rate': grid_learn, 'max_depth': [0,2,4,6,8,10], 'n_estimators': grid_n_estimator, 'seed': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_xgb.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_xgb.best_params_) #Best Parameters: {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0}
#data_val['Survived'] = submit_xgb.predict(data_val[data1_x_bin])


#hard voting classifier w/full dataset modeling submission score: defaults= 0.75598, tuned = 0.77990
#data_val['Survived'] = vote_hard.predict(data_val[data1_x_bin])
data_val['Survived'] = grid_hard.predict(data_val[data1_x_bin])


#soft voting classifier w/full dataset modeling submission score: defaults= 0.73684, tuned = 0.74162
#data_val['Survived'] = vote_soft.predict(data_val[data1_x_bin])
#data_val['Survived'] = grid_soft.predict(data_val[data1_x_bin])


#submit file
submit = data_val[['PassengerId','Survived']]
submit.to_csv("F:/wd.jupyter/datasets/kaggle_data/titanic/submit.csv", index=False)

print('Validation Data Distribution: \n', data_val['Survived'].value_counts(normalize = True))
submit.sample(10)

 

12、步驟7:優化和制定戰略
結論
  咱們的模型收斂於0.77990提交準確性。使用相同的數據集和決策樹(adaboost,隨機森林,梯度加強,xgboost等)的不一樣實現與調整不超過0.77990提交準確性。有趣的是,對此數據集,簡單決策樹算法具備最佳默認提交分數,而且調整得到了相同的最佳準確度分數。

雖然在單個數據集上測試少許算法沒法得出通常結論,但對所提到的數據集有幾個觀察結果。

  • 訓練數據集具備與測試/驗證數據集和羣體不一樣的分佈。這在交叉驗證(CV)準確度分數和Kaggle提交準確度分數之間創造了普遍的差距。
  • 給定相同的數據集,基於決策樹的算法在適當調整後彷佛收斂於相同的準確度分數。儘管進行調整,

對於迭代二,我會花更多的時間在預處理和特徵工程上。爲了更好地調整CV分數和Kaggle分數並提升總體準確性

相關文章
相關標籤/搜索