kaggle總結

kaggle總結

1、特徵分析(EDA,探索性數據分析)

1.1 seaborn特徵分析

roc_cure
lineplot("X", "y", data=df))html

一個特徵不一樣值對生的影響,有限個數:
barplot("X", "y", data=df)python

連續且個數比較多
sns.distplot(train['SibSp'][train['Survived'] == 1], bins=50)
sns.distplot(train['SibSp'][train['Survived'] == 0], bins=50)
等價於
sns.distplot(train.loc[ train['Survived'] == 1, 'SibSp'], bins=50)
sns.distplot(train.loc[ train['Survived'] == 0, 'SibSp'], bins=50)app

一個值分別對生死的影響
countplot("Embarked", hue='Survived', data=df)dom

1.2 特徵概述

data.head(10)
data.describe()
data.describe().T
data.info()
train['Survived'].value_counts() #查看生存比重函數

2、特徵選擇、處理

2.1 連續值分隔處理

  1. 使用pd.cut自動分割
    train['Age'] = pd.cut(train['Age'], 5, labels=[0, 1, 2, 3, 4])測試

  2. 手動分割
    def ProcessLabel(val):
    if val < 3:
    return 0
    elif val < 7:
    return 1
    else:
    return 2
    train['FamliySize'] = train['Sisbp'] + train['Parch'] + 1
    train['FamLable'] = train[FamilySize].apply(ProcessLabel)編碼

2.2 字符串處理

train['Embarked'] = train['Embarked'].map({'S': 0, 'P':1, 'S': 2})spa

2.3 缺失值處理

字符串填充:rest

train['Embarked'] = train['Embarked'].fillna('S')

使用均值填充:code

avg = train['Age'].mean()
std = train['Age'].std()
age_null_count  = train['Age'].isnull().sum()
age_list = np.random.randint(avg-std, avg+std, size = age_null_count)
train.loc[train['Age'].isnull(), 'Age'] = age_list

當缺失較多時,使用迴歸模型預測值:
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgbm

data = train[['Age', 'Pclass', 'Sex', 'Title']]
data = pd.get_dummies(data)
model = RandomForestRegressor(n_estimators=128, n_jobs=-1)
# model = lgbm.LGBMRegressor(n_estimators=128, n_jobs=-1)
tr= data[data['Age'].notnull()].values
te = data[data['Age'].isnull()].values
tr_X = tr[:, 1:]
tr_y = tr[:, 0]
te_X = te[:, 1:]
model.fit(tr_X, tr_y)
pred_age = model.predict(te_X)
train.loc[data['Age'].isnull(), 'Age'] = pre_age

2.4 one hot 編碼

必定要對all_data進行,不然容易訓練集,測試集不匹配:

all_data = pd.get_dummise(all_data)


Emb = pd.get_dummies(all_data)
all_data = pd.concat([all_data, Emb], axis = 1)

2.5 數據合併分開

all_data = pd.concat([train, test], ignore_index = True)

分開:

train=all_data.loc[all_data['Survived'].notnull()]
test=all_data.loc[all_data['Survived'].isnull()]

2.6 特徵縮放,標準化

from sklearn.preprocessing import StandardScaler
sc =StandardScaler()
data_new[['Amount', 'Hour']] =sc.fit_transform(data_new[['Amount', 'Hour']])
data_new.head()

3、模型調參

lgbm:

objective=(regression,binary/multiclass)

3.1 GridSearchCV參數尋優

import lightgbm as lgb
from sklearn.model_selection import cross_val_score
from sklearn.model_selection improt GridSearchCV

params = {'num_leaves': [32, 64, 128, 256, 1024], 'max_depth': [10, 20, 30, 60], 'learning_rate': [0.01, 0.05, 0.1], 'n_estimators': [100, 200, 300]}
model = lgb.LGBMClassifier()
gridS = GridSearchCV(model, params, cv=5, n_jobs=-1)
gridS.fit(X, y)
gridS.best_estimator_

4、結果

4.1 畫roc曲線

須要最好是機率, 若是是0, 1值的話,只有一個點,因此要使用lgb.train(),而不是LGBMClassifier()的模型

from sklearn.metrics import roc_curve
from matplotlib import pyplot as plt
import seaborn as sns

sns.set()
fpr, tpr, thresh = roc_curve(y, pred)
plt.plot(fpr, tpr)
plt.show()

4.2 求交叉準確率

from sklearn.model_selection import cross_val_score

score =  cross_val_score(model, X, y, scoring='accuracy', cv=5)
print(np.mean(score))

4.3 保存csv

res = pd.DataFrame({'PassageID': passage_id, 'Survived': pred.as_type(np.int32)})
res.to_csv('pred.csv', index=False)

Others

模型訓練時報Input contains NaN, infinity or a value too large for dtype('float64'):
由於特徵裏包含nan

相關函數:

np.isnan
train.info()
train['Age'].isnull()
train['Age'].notnull()
相關文章
相關標籤/搜索