Machine Learning Project Walkthrough: Data Cleaning

時間 2019-11-17

標籤 machine learning project walkthrough data cleaning 简体版

原文原文鏈接

1: Introduction

In this course, we will walk through the full data science life cycle, from data cleaning and feature selection to machine learning. We will focus on credit modelling, a well known data science problem that focuses on modeling a borrower's credit risk. Credit has played a key role in the economy for centuries and some form of credit has existed since the beginning of commerce. We'll be working with financial lending data from Lending Club. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace here.php

Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data (and their own data science process!) and assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. You can read more about the interest rate that Lending Club assigns here. Lending Club also tries to verify each piece of information the borrower provides but it can't always verify all of the information (usually for regulation reasons).html

A higher interest rate means that the borrower is riskier and more unlikely to pay back the loan while a lower interest rate means that the borrower has a good credit history is more likely to pay back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given a grade according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.git

Investors are primarily interested in receiveing a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the origination fee that Lending Club charges.web

The borrower then makes monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition the requested amount. Many loans aren't completely paid off on time, however, and some borrowers default on the loan.redis

Here's a diagram from Bible Money Matters that sums up the process:app

While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. While at first, you may wonder why investors would put money into anything but low interest loans. The incentive investors have to back higher interest loans is, well, the higher interest! If investors believe the borrower can pay back the loan, even if he or she has a weak financial history, then investors can make more money through the larger additional amount the borrower has to pay.less

Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this course, we'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.dom

2: Introduction To The Data

Lending Club releases data for all of the approved and declined loan applications periodically on their website. You can select a few different year ranges to download the datasets (in CSV format) for both approved and declined loans.ide

You'll also find a data dictionary (in XLS format) which contains information on the different column names towards the bottom of the page. We recommend downloading the data dictionary to so you can refer to it whenever you want to learn more about what a column represents in the datasets. Here's a link to the data dictionary file hosted on Google Drive.post

Before diving into the datasets themselves, let's get familiar with the data dictionary. The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on data on approved loans only.

The approved loans datasets contain information on current loans, completed loans, and defaulted loans. Let's now define the problem statement for this machine learning project:

Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?

Before we can start doing machine learning, we need to define what features we want to use and which column repesents the target column we want to predict. Let's start by reading in the dataset and exploring it.

3: Reading In To Pandas

In this mission, we'll focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.

To ensure that code runs fast on our platform, we reduced the size of LoanStats3a.csv by:

removing the first line:
- because it contains the extraneous text Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)instead of the column titles, which prevents the dataset from being parsed by the pandas library properly
removing the desc column:
- which contains a long text explanation for each loan
removing the url column:
- which contains a link to each loan on Lending Club which can only be accessed with an investor account
removing all columns containing more than 50% missing values:
- which allows us to move faster since we can spend less time trying to fill these values

The following code replicates this process, if you want to replicate the dataset to work with it on your own:

import pandas as pd

loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)

half_count = len(loans_2007) / 2

loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)

loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)

loans_2007.to_csv('loans_2007.csv', index=False)

We named the filtered dataset loans_2007.csv instead in case we want to explore the raw dataset (LoanStats3a.csv) without mixing up the two. First things first, let's read in the dataset into a Dataframe so we can start to explore the data and explore the remaining features.

Instructions

Read loans_2007.csv into a DataFrame named loans_2007and use the print function to display the first row of the Dataframe.
Use the print function to:
- display the first row ofloans_2007 and
- the number of columns inloans_2007.

import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")
loans_2007.drop_duplicates()
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

4: First Group Of Columns

The Dataframe contains many columns and can be cumbersome to try to explore all at once. Let's break up the columns into 3 groups of 18 columns and use the data dictionary to become familiar with what each column represents. As you understand each feature, you want to pay attention to any features that:

leak information from the future (after the loan has already been funded)
don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
formatted poorly and need to be cleaned up
require more data or a lot of processing to turn into a useful feature
contain redundant information

We need to especially pay attention to data leakage, since it can cause our model to overfit. This is because the model would be using data about the target column that wouldn't be available when we're using the model on future loans. We encourage you to spend as much time as you need to understand each column, because a poor understanding could cause you to make mistakes in the data analysis and modeling process. As you go through the dictionary, keep in mind that we need to select one of the columns as the target column we want to use when we move on to the machine learning phase.

In this screen and the next few screens, let's focus on just columns that we need to remove from consideration. Then, we can circle back and further dissect the columns we decided to keep.

To make this process easier, we created a table that contains the name, data type, first row's value, and description from the data dictionary for the first 18 rows.

name	dtype	first value	description
id	object	1077501	A unique LC assigned ID for the loan listing.
member_id	float64	1.2966e+06	A unique LC assigned Id for the borrower member.
loan_amnt	float64	5000	The listed amount of the loan applied for by the borrower.
funded_amnt	float64	5000	The total amount committed to that loan at that point in time.
funded_amnt_inv	float64	49750	The total amount committed by investors for that loan at that point in time.
term	object	36 months	The number of payments on the loan. Values are in months and can be either 36 or 60.
int_rate	object	10.65%	Interest Rate on the loan
installment	float64	162.87	The monthly payment owed by the borrower if the loan originates.
grade	object	B	LC assigned loan grade
sub_grade	object	B2	LC assigned loan subgrade
emp_title	object	NaN	The job title supplied by the Borrower when applying for the loan.
emp_length	object	10+ years	Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership	object	RENT	The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
annual_inc	float64	24000	The self-reported annual income provided by the borrower during registration.
verification_status	object	Verified	Indicates if income was verified by LC, not verified, or if the income source was verified
issue_d	object	Dec-2011	The month which the loan was funded
loan_status	object	Charged Off	Current status of the loan
pymnt_plan	object	n	Indicates if a payment plan has been put in place for the loan
purpose	object	car	A category provided by the borrower for the loan request.

After analyzing each column, we can conclude that the following features need to be removed:

id: randomly generated field by Lending Club for unique identification purposes only
member_id: also a randomly generated field by Lending Club for unique identification purposes only
funded_amnt: leaks data from the future (after the loan is already started to be funded)
funded_amnt_inv: also leaks data from the future (after the loan is already started to be funded)
grade: contains redundant information as the interest rate column (int_rate)
sub_grade: also contains redundant information as the interest rate column (int_rate)
emp_title: requires other data and a lot of processing to potentially be useful
issue_d: leaks data from the future (after the loan is already completed funded)

Recall that Lending Club assigns a grade and a sub-grade based on the borrower's interest rate. While the grade and sub_grade values are categorical, the int_rate column contains continuous values, which are better suited for machine learning.

Let's now drop these columns from the Dataframe before moving onto the next group of columns.

Instructions

Use the Dataframe method drop to remove the following columns from theloans_2007 Dataframe:

id
member_id
funded_amnt
funded_amnt_inv
grade
sub_grade
emp_title
issue_d

loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)

5: Second Group Of Features
Let's now look at the next 18 columns:

name   dtype   first value   description
title   object   Computer   The loan title provided by the borrower
zip_code   object   860xx   The first 3 numbers of the zip code provided by the borrower in the loan application.
addr_state   object   AZ   The state provided by the borrower in the loan application
dti   float64   27.65   A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
delinq_2yrs   float64   0   The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
earliest_cr_line   object   Jan-1985   The month the borrower's earliest reported credit line was opened
inq_last_6mths   float64   1   The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
open_acc   float64   3   The number of open credit lines in the borrower's credit file.
pub_rec   float64   0   Number of derogatory public records
revol_bal   float64   13648   Total credit revolving balance
revol_util   object   83.7%   Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_acc   float64   9   The total number of credit lines currently in the borrower's credit file
initial_list_status   object   f   The initial listing status of the loan. Possible values are – W, F
out_prncp   float64   0   Remaining outstanding principal for total amount funded
out_prncp_inv   float64   0   Remaining outstanding principal for portion of total amount funded by investors
total_pymnt   float64   5863.16   Payments received to date for total amount funded
total_pymnt_inv   float64   5833.84   Payments received to date for portion of total amount funded by investors
total_rec_prncp   float64   5000   Principal received to date
Within this group of columns, we need to drop the following columns:

zip_code: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
out_prncp: leaks data from the future, (after the loan already started to be paid off)
out_prncp_inv: also leaks data from the future, (after the loan already started to be paid off)
total_pymnt: also leaks data from the future, (after the loan already started to be paid off)
total_pymnt_inv: also leaks data from the future, (after the loan already started to be paid off)
total_rec_prncp: also leaks data from the future, (after the loan already started to be paid off)
The out_prncp and out_prncp_inv both describe the outstanding principal amount for a loan, which is the remaining amount the borrower still owes. These 2 columns as well as the total_pymnt column describe properties of the loan after it's fully funded and started to be paid off. This information isn't available to an investor before the loan is fully funded and we don't want to include it in our model.

Let's go ahead and remove these columns from the Dataframe.

Instructions
Use the Dataframe method drop to remove the following columns from the loans_2007 Dataframe:
zip_code
out_prncp
out_prncp_inv
total_pymnt
total_pymnt_inv
total_rec_prncp

loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

6: Third Group Of Features

Let's now move on to the last group of features:

name	dtype	first value	description
total_rec_int	float64	863.16	Interest received to date
total_rec_late_fee	float64	0	Late fees received to date
recoveries	float64	0	post charge off gross recovery
collection_recovery_fee	float64	0	post charge off collection fee
last_pymnt_d	object	Jan-2015	Last month payment was received
last_pymnt_amnt	float64	171.62	Last total payment amount received
last_credit_pull_d	object	Jun-2016	The most recent month LC pulled credit for this loan
collections_12_mths_ex_med	float64	0	Number of collections in 12 months excluding medical collections
policy_code	float64	1	publicly available policy_code=1 new products not publicly available policy_code=2
application_type	object	INDIVIDUAL	Indicates whether the loan is an individual application or a joint application with two co-borrowers
acc_now_delinq	float64	0	The number of accounts on which the borrower is now delinquent.
chargeoff_within_12_mths	float64	0	Number of charge-offs within 12 months
delinq_amnt	float64	0	The past-due amount owed for the accounts on which the borrower is now delinquent.
pub_rec_bankruptcies	float64	0	Number of public record bankruptcies
tax_liens	float64	0	Number of tax liens

In the last group of columns, we need to drop the following columns:

total_rec_int: leaks data from the future, (after the loan already started to be paid off),
total_rec_late_fee: also leaks data from the future, (after the loan already started to be paid off),
recoveries: also leaks data from the future, (after the loan already started to be paid off),
collection_recovery_fee: also leaks data from the future, (after the loan already started to be paid off),
last_pymnt_d: also leaks data from the future, (after the loan already started to be paid off),
last_pymnt_amnt: also leaks data from the future, (after the loan already started to be paid off).

All of these columns leak data from the future, meaning that they're describing aspects of the loan after it's already been fully funded and started to be paid off by the borrower.

Instructions

Use the Dataframe method drop to remove the following columns from theloans_2007 Dataframe:

total_rec_int
total_rec_late_fee
recoveries
collection_recovery_fee
last_pymnt_d
last_pymnt_amnt

Use the print function to:

display the first row ofloans_2007 and
the number of columns inloans_2007.

loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

7: Target Column

Just by becoming familiar with the columns in the dataset, we were able to reduce the number of columns from 52 to 34 columns. We now need to decide on a target column that we want to use for modeling.

We should use the loan_status column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. Currently, this column contains text values and we need to convert it to a numerical one for training a model. Let's explore the different values in this column and come up with a strategy for converting the values in this column.

Instructions

Use the Dataframe methodvalue_counts to return the frequency of the unique values in the loan_status column.
Display the frequency of each unique value using the printfunction.

print(loans_2007['loan_status'].value_counts())

8: Binary Classification

There are 8 different possible values for the loan_status column. You can read about most of the different loan statuses on the Lending Clube webste. The 2 values that start with "Does not meet the credit policy" aren't explained unfortunately. A quick Google search takes us to explanations from the lending comunity here and here.

We've compiled the explanation for each column as well as the counts in the Dataframe in the following table:

Loan Status	Count	Meaning
Fully Paid	33136	Loan has been fully paid off.
Charged Off	5634	Loan for which there is no longer a reasonable expectation of further payments.
Does not meet the credit policy. Status:Fully Paid	1988	While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.
Does not meet the credit policy. Status:Charged Off	761	While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.
In Grace Period	20	The loan is past due but still in the grace period of 15 days.
Late (16-30 days)	8	Loan hasn't been paid in 16 to 30 days (late on the current payment).
Late (31-120 days)	24	Loan hasn't been paid in 31 to 120 days (late on the current payment).
Current	961	Loan is up to date on current payments.
Default	3	Loan is defaulted on and no payment has been made for more than 121 days.

From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. Only theFully Paid and Charged Off values describe the final outcome of the loan. The other values describe loans that are still on going and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default status resembles the Charged Offstatus, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance. You can read about the difference here.

Since we're interesting in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a binary classification one. Let's remove all the loans that don't contain either Fully Paid and Charged Off as the loan's status and then transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case. While there are a few different ways to transform all of the values in a column, we'll use the Dataframe method replace. According to the documentation, we can pass the replace method a nested mapping dictionary in the following format:

mapping_dict = {

   "date": {

       "january": 1,

       "february": 2,

       "march": 3

df = df.replace(mapping_dict)

Lastly, one thing we need to keep in mind is the class imbalance between the positive and negative cases. While there are 33,136 loans that have been fully paid off, there are only 5,634 that were charged off. This class imbalance is a common problem in binary classification and during training, the model ends up having a strong bias towards predicting the class with more observations in the training set and will rarely predict the class with less observations. The stronger the imbalance, the more biased the model becomes. There are a few different ways to tackle this class imbalance, which we'll explore later.

Instructions

Remove all rows fromloans_2007 that contain values other than Fully Paid orCharged Off for theloan_status column.
Use the Dataframe methodreplace to replace:
- Fully Paid with 1
- Charged Off with 0

loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]

status_replace = {
"loan_status" : {
"Fully Paid": 1,
"Charged Off": 0,
}
}

loans_2007 = loans_2007.replace(status_replace)

9: Removing Single Value Columns

To wrap up this mission, let's look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application. In addition, removing these columns will reduce the number of columns we'll need to explore further in the next mission.

We'll need to compute the number of unique values in each column and drop the columns that contain only one unique value. While the Series method unique returns the unique values in a column, it also counts the Pandas missing value object nan as a value:

# Returns 0 and nan.

unique_values = loans['tax_liens'].unique()

Since we're trying to find columns that contain one true unique value, we should first drop the null values then compute the number of unique values:

non_null = loans_2007['tax_liens'].dropna()

unique_non_null = non_null.unique()

num_true_unique = len(len_unique_non_null)

Instructions

Remove any columns fromloans_2007 that contain only one unique value:
- Create an empty list,drop_columns to keep track of which columns you want to drop
- For each column:
  - Use the Series method dropna to remove any null values and then use the Series methodunique to return the set of non-null unique values
  - Use the len()function to return the number of values in that set
  - Append the column to drop_columns if it contains only 1 unique value
- Use the Dataframe methoddrop to remove the columns in drop_columnsfrom loans_2007
Use the print function to display drop_columns so we know which ones were removed

orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
col_series = loans_2007[col].dropna().unique()
if len(col_series) == 1:
drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print(drop_columns)

10: Next Steps

It looks we we were able to remove 9 more columns since they only contained 1 unique value.

In this mission, we started to become familiar with the columns in the dataset and removed many columns that aren't useful for modeling. We also selected our target column and decided to focus our modeling efforts on binary classification. In the next mission, we'll explore the individual features in greater depth and work towards training our first machine learning model.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。