In the past mission, you removed all of the columns that contained redundant information, weren't useful for modeling, required too much processing to make useful, or leaked information from the future. We've exported the Dataframe from the end of the last mission to a CSV file named filtered_loans_2007.csv
to differentiate the file with the loans_2007.csv
we used in the last mission. In this mission, we'll prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.html
This is because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression.app
Let's start by computing the number of missing values and come up with a strategy for handling them. Then, we'll focus on the categorical columns.less
We can return the number of missing values across the Dataframe by:ide
True
if the original value is null,False
if the original value isn't null.
null_counts = df.isnull().sum()
filtered_loans_2007.csv
as a Dataframe and assign it toloans
.isnull
and sum
methods to return the number of null values in each column. Assign the resulting Series object to null_counts
.print
function to display null_counts
.
import pandas as pd
loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)oop
While most of the columns have 0 missing values, 2 columns have 50 or less rows with missing values, and 1 column,pub_rec_bankruptcies
, contains 697 rows with missing values. Let's remove columns entirely where more than 1% of the rows for that column contain a null value. In addition, we'll remove the remaining rows containing null values.ui
This means that we'll keep the following columns and just remove rows containing missing values for them:this
title
revol_util
last_credit_pull_d
and drop the pub_rec_bankruptcies
column entirely since more than 1% of the rows have a missing value for this column.lua
Let's use the strategy of removing the pub_rec_bankruptcies
column first then removing all rows containing any missing values at all to cover both of these cases. This way, we only remove the rows containing missing values for the title
and revol_util
columns but not the pub_rec_bankruptcies
column.spa
pub_rec_bankruptcies
column from loans
.loans
containing any missing values.dtypes
attribute followed by the value_counts()
method to return the counts for each column data type. Use theprint
function to display these counts.
loans = loans.drop("pub_rec_bankruptcies", axis=1)
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())rest
While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types. Let's return a new Dataframe containing just the object columns so we can explore them in more depth. You can use the Dataframe method select_dtypes to select only the columns of a certain data type:
float_df = df.select_dtypes(include=['float'])
Let's select just the object columns then display a sample row to get a better sense of how the values in each column are formatted.
select_dtypes
to select only the columns of object
type fromloans
and assign the resulting Dataframe object_columns_df
.object_columns_df
using theprint
function.
object_columns_df = loans.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])
Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:
home_ownership
: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,verification_status
: indicates if income was verified by Lending Club,emp_length
: number of years the borrower was employed upon time of application,term
: number of payments on the loan, either 36 or 60,addr_state
: borrower's state of residence,purpose
: a category provided by the borrower for the loan request,title
: loan title provided the borrower,There are also some columns that represent numeric values, that need to be converted:
int_rate
: interest rate of the loan in %,revol_util
: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit, read more here.Based on the first row's values for purpose
and title
, it seems like these columns could reflect the same information. Let's explore the unique value counts separately to confirm if this is true.
Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:
earliest_cr_line
: The month the borrower's earliest reported credit line was opened,last_credit_pull_d
: The most recent month Lending Club pulled credit for this loan.Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.
Let's explore the unique value counts of the columnns that seem like they contain categorical values.
home_ownership
,verification_status
,emp_length
, term
,addr_state
columns:
cols
.cols
:
print
function combined with the Series methodvalue_counts
to display each column's unique value counts.cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
print(loans[c].value_counts())
The home_ownership
, verification_status
, emp_length
, term
, and addr_state
columns all contain multiple discrete values. We should clean the emp_length
column and treat it as a numerical one since the values have ordering (2 years of employment is less than 8 years).
First, let's look at the unique value counts for the purpose
and title
columns to understand which column we want to keep.
value_counts
method and the print
function to display the unique values in the following columns:
purpose
title
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())
The home_ownership
, verification_status
, emp_length
, and term
columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.
It seems like the purpose
and title
columns do contain overlapping information but we'll keep the purpose
column since it contains a few discrete values. In addition, the title
column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation
and Debt Consolidation Loan
and debt consolidation
).
We can use the following mapping to clean the emp_length
column:
We erred on the side of being conservative with the 10+ years
, < 1 year
and n/a
mappings. We assume that people who may have been working more than 10 years have only really worked for 10 years. We also assume that people who've worked less than a year or if the information is not available that they've worked for 0. This is a general heuristic but it's not perfect.
Lastly, the addr_state
column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.
last_credit_pull_d
,addr_state
, title
, andearliest_cr_line
columns from loans
.int_rate
andrevol_util
columns to float columns by:
str
acessor followed by the rstrip
string method to strip the right trailing percent sign (%
):
loans['int_rate'].str.rstrip('%')
returns a new Series with %
stripped from the right side of each value.astype
method to convert to the float type.replace
method to clean the emp_length
column.mapping_dict = {
"emp_length": {
"10+ years": 10,
"9 years": 9,
"8 years": 8,
"7 years": 7,
"6 years": 6,
"5 years": 5,
"4 years": 4,
"3 years": 3,
"2 years": 2,
"1 year": 1,
"< 1 year": 0,
"n/a": 0
}
}
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
loans = loans.replace(mapping_dict)
Let's now encode the home_ownership
, verification_status
, title
, and term
columns as dummy variables so we can use them in our model. We first need to use the Pandas get_dummies method to return a new Dataframe containing a new column for each dummy variable:
# Returns a new Dataframe containing 1 column for each dummy variable.
dummy_df = pd.get_dummies(loans["term", "verification_status"])
We can then use the concat method to add these dummy columns back to the original Dataframe:
loans = pd.concat([loans, dummy_df], axis=1]
and then drop the original columns entirely using the drop
method:
loans = loans.drop(["verification_status", "term"], axis=1)
home_ownership
,verification_status
,emp_length
, purpose
, andterm
columns as integer values:
astype
to convert each column to the category
data type.loans
.home_ownership
,verification_status
,purpose
, and term
) fromloans
.
cat_columns = ["home_ownership", "verification_status", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)
In this mission, we performed the last amount of data preparation necessary to start training machine learning models. We converted all of the columns to numerical values because those are the only type of value scikit-learn can work with. In the next mission, we'll experiment with training models and evaluating accuracy using cross-validation.