In the past three missions, we learned about decision trees, and looked at ways to reduce overfitting. The most powerful method to reduce decision tree overfitting is called the random forest algorithm. In this mission, we'll learn how to construct and apply random forests.html
We've been using a dataset on US income, which we'll keep using here. The data is from the 1994 Census, and contains information on an individual's marital status, age, type of work, and more. The target column, high_income
, is if they make less than or equal to 50k a year (0
), or more than 50k a year (1
).node
You can download the data from here.bootstrap
A random forest is a kind of ensemble model. Ensembles combine the predictions of multiple models to create a more accurate final prediction. We'll make a simple ensemble to see how it works.app
We'll create two decision trees with slightly different parameters:less
min_samples_leaf
set to 2
max_depth
set to 5
and check their accuracy separately. In the next screen, we'll combine their predictions and compare the combined accuracy with either tree's accuracy.dom
Fit both clf
and clf2
to the data.ide
train[columns]
as the predictors, andtrain["high_income"]
as the target.Make predictions on the test set predictors (test[columns]
) using both clf
and clf2
.oop
For both sets of predictions, compute the AUC between the predictions and the actual values (test["high_income"]
) using the roc_auc_score function.ui
print
function to display the AUC values for both.from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_scorethis
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
clf.fit(train[columns], train["high_income"])
clf2 = DecisionTreeClassifier(random_state=1, max_depth=5)
clf2.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))
predictions = clf2.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))
When we have multiple classifiers making predictions, we can treat each set of predictions as a column in a matrix. Here's an example where we have Decision Tree 1 (DT1
), Decision Tree 2 (DT2
), and DT3
:
DT1 DT2 DT3
0 1 0
1 1 1
0 0 1
1 0 0
When we add more models to our ensemble, we just add more columns to the combined predictions. Ultimately, we don't want this matrix, though -- we want one prediction per row in the training data. To do this, we'll need to create rules to turn each row of our matrix of predictions into a single number.
We want to create a Final Prediction
vector:
DT1 DT2 DT3 Final Prediction
0 1 0 0
1 1 1 1
0 0 1 0
1 0 0 0
There are many ways to get from the output of multiple models to a final vector of predictions. One method is majority voting, where each classifier gets a "vote", and the most commonly voted value for each row wins. This only works if there are more than 2 classifiers (and ideally an odd number so we don't have to write a rule to break ties). Majority voting is what we applied in the example above.
Since in the last screen we only had two classifiers, we'll have to use a different method to combine predictions. We'll take the mean of all the items in a row. Right now, we're using the predict
method, which returns either 0
or 1
. predict
returns something like this:
0
1
0
1
We can instead use the predict_proba
method, which will predict a probability from 0
to 1
that a given class is the right one for a row. Since 0
and 1
are our two classes, we'll get a matrix with as many rows as the income
dataframe and 2
columns.predict_proba
will return something like this:
0 1
.7 .3
.2 .8
.1 .9
Each row will correspond to a prediction. The first column is the probability that the prediction is a 0
, the second column is the probability that the prediction is a 1
. Each row adds up to 1
.
If we just take the second column, we get the average value that the classifier would predict for that row. If there's a .9
probability that the correct classification is 1
, we can use the .9
as the value the classifier is predicting. This will give us a continuous output in a single vector instead of just 0
or 1
.
We can then add all of the vectors we get through this method together and divide by the number of vectors to get the mean prediction by all the members of the ensemble. We can then round off to get 0
or 1
predictions.
If we use the predict_proba
method on both classifiers from the last screen to generate probabilities, take the mean for each row, and then round the results, we'll get ensemble predictions.
predictions
andpredictions2
, then divide by 2
to get the mean.numpy.round
to round all of the resulting predictions.predictions = clf.predict_proba(test[columns])[:,1]
predictions2 = clf2.predict_proba(test[columns])[:,1]
combined = (predictions + predictions2) / 2
rounded = numpy.round(combined)
print(roc_auc_score(test["high_income"], rounded))
As we can see from the previous screen, the combined predictions of the two trees had a higher AUC than either tree:
settings | test AUC |
---|---|
min_samples_leaf: 2 | 0.688 |
max_depth: 2 | 0.676 |
combined predictions | 0.715 |
To intuitively understand why this makes sense, think about two people at the same talent level. One learned programming in college. The other learned on their own (let's say using Dataquest!).
If you give both of them a project, since they both have different knowledge and experience, they'll both approach it in slightly different ways. They may both produce code that achieves the same result, but one may run faster in certain areas. The other may have a better interface. Even though both of them have about the same talent level, because they approach the problem differently, their solutions are stronger in different areas.
If we combine the best parts of both of their projects, we'll end up with a stronger combined project.
Ensembling is the exact same. Both models are approaching the problem slightly differently, and building a different tree because we used different parameters for each. Each tree makes different predictions in different areas. Even though both trees have about the same accuracy, when we combine them, the result is stronger because it leverages the strengths of both approaches.
The more "diverse", or dissimilar, the models used to construct an ensemble, the stronger the combined predictions will be (assuming that all models have about the same accuracy). Ensembling a decision tree and a logistic regression model, which use very different approaches to arrive at their answers, will result in stronger predictions than ensembling two decision trees with similar parameters.
On the other side, if the models you ensemble are very similar in how they make predictions, you'll get a negligible boost from ensembling.
Ensembling models with very different accuracies will not generally improve your accuracy. Ensembling a model with a .75
AUC and a model with a .85
AUC on a test set will usually result in an AUC somewhere in between the two original values. There's a way around this which we'll discuss later on, called weighting.
A random forest is an ensemble of decision trees. If we don't make any modifications to the trees, each tree will be the exact same, so we'll get no boost when we ensemble them. In order to make ensembling effective, we have to introduce variation into each individual decision tree model.
If we introduce variation, each tree will be be constructed slightly differently, and therefore will make different predictions. This variation is why the word "random" is in "random forest".
There are two main ways to introduce variation in a random forest -- bagging and random feature subsets. We'll dive into bagging first.
In a random forest, each tree isn't trained using the whole dataset. Instead, it's trained on a random sample of the data, or a "bag". This sampling is performed with replacement. When we sample with replacement, after we select a row from the data we're sampling, we put the row back in the data so it can be picked again. Some rows from the original data may appear in the "bag" multiple times.
Let's use bagging with the first tree we trained.
predictions
is a list of vectors corresponding to predictions on the test set.10
to get the mean prediction for each row.numpy.round
to round the resulting predictions.test["high_income"]
.# We'll build 10 trees
tree_count = 10
# Each "bag" will have 60% of the number of original rows.
bag_proportion = .6
predictions = []
for i in range(tree_count):
# We select 60% of the rows from train, sampling with replacement.
# We set a random state to ensure we'll be able to replicate our results.
# We set it to i instead of a fixed value so we don't get the same sample every loop.
# That would make all of our trees the same.
bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
# Fit a decision tree model to the "bag".
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
clf.fit(bag[columns], bag["high_income"])
# Using the model, make predictions on the test data.
predictions.append(clf.predict_proba(test[columns])[:,1])
combined = numpy.sum(predictions, axis=0) / 10
rounded = numpy.round(combined)
print(roc_auc_score(test["high_income"], rounded))
With the bagging example from the previous screen, we gained some accuracy over a single decision tree. We achieved an AUC score of around .733
with bagging, an improvement of the AUC score of .688
we achieved without bagging:
settings | test AUC |
---|---|
min_samples_leaf: 2 | 0.688 |
max_depth: 2 | 0.676 |
combined predictions | 0.715 |
min_samples_leaf: 2, with bagging | 0.732 |
Let's go back to the decision tree algorithm we explored 2 missions ago to explain random feature subsets:
We're repeating the same process to select the optimal split for a node, but we'll only evaluate a constrained set of features, selected randomly. This introduces variation into the trees, and makes for more powerful ensembles.
Below is the ID3 algorithm that we developed earlier. We'll modify it to only consider a certain subset of the features.
Modify find_best_column
to select a random sample fromcolumns
before computing information gain.
Insert code here
.2
items in it.numpy.random.choice
to select a random sample.Be careful not to overwritecolumns
when you do the selection.
Use the print
function to display tree
.
# Create the dataset that we used 2 missions ago.
data = pandas.DataFrame([
[0,4,20,0],
[0,4,60,2],
[0,5,40,1],
[1,4,25,1],
[1,5,35,2],
[1,5,55,1]
])
data.columns = ["high_income", "employment", "age", "marital_status"]
# Set a random seed to make results reproducible.
numpy.random.seed(1)
# The dictionary to store our tree.
tree = {}
nodes = []
# The function to find the column to split on.
def find_best_column(data, target_name, columns):
information_gains = []
# Insert your code here.
for col in columns:
information_gain = calc_information_gain(data, col, "high_income")
information_gains.append(information_gain)
# Find the name of the column with the highest gain.
highest_gain_index = information_gains.index(max(information_gains))
highest_gain = columns[highest_gain_index]
return highest_gain
# The function to construct an id3 decision tree.
def id3(data, target, columns, tree):
unique_targets = pandas.unique(data[target])
nodes.append(len(nodes) + 1)
tree["number"] = nodes[-1]
if len(unique_targets) == 1:
if 0 in unique_targets:
tree["label"] = 0
elif 1 in unique_targets:
tree["label"] = 1
return
best_column = find_best_column(data, target, columns)
column_median = data[best_column].median()
tree["column"] = best_column
tree["median"] = column_median
left_split = data[data[best_column] <= column_median]
right_split = data[data[best_column] > column_median]
split_dict = [["left", left_split], ["right", right_split]]
for name, split in split_dict:
tree[name] = {}
id3(split, target, columns, tree[name])
# Run the id3 algorithm on our dataset and print the resulting tree.
id3(data, "high_income", ["employment", "age", "marital_status"], tree)
print(tree)
def find_best_column(data, target_name, columns):
information_gains = []
# Select two columns randomly.
cols = numpy.random.choice(columns, 2)
for col in cols:
information_gain = calc_information_gain(data, col, "high_income")
information_gains.append(information_gain)
highest_gain_index = information_gains.index(max(information_gains))
# Get the highest gain by indexing cols.
highest_gain = cols[highest_gain_index]
return highest_gain
id3(data, "high_income", ["employment", "age", "marital_status"], tree)
print(tree)
7: Random Subsets In Scikit-Learn
We can also repeat our random subset selection process in scikit-learn. We just set the splitter parameter on DecisionTreeClassifier to "random", and the max_features parameter to "auto". If we have N columns, this will pick a subset of features of size N−−√N, compute the gini coefficient (similar to information gain) for each, and split the node on the best column in the subset.
This is essentially the same thing we did in the previous screen, but with far less typing.
Instructions
Modify the instantiation of the DecisionTreeClassifier object.
Set splitter to "random", and max_features to "auto".
Print the resulting AUC score.
# We'll build 10 trees
tree_count = 10
# Each "bag" will have 60% of the number of original rows.
bag_proportion = .6
predictions = []
for i in range(tree_count):
# We select 60% of the rows from train, sampling with replacement.
# We set a random state to ensure we'll be able to replicate our results.
# We set it to i instead of a fixed value so we don't get the same sample every time.
bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
# Fit a decision tree model to the "bag".
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
clf.fit(bag[columns], bag["high_income"])
# Using the model, make predictions on the test data.
predictions.append(clf.predict_proba(test[columns])[:,1])
combined = numpy.sum(predictions, axis=0) / 10
rounded = numpy.round(combined)
print(roc_auc_score(test["high_income"], rounded))
predictions = []
for i in range(tree_count):
# We select 60% of the rows from train, sampling with replacement.
# We set a random state to ensure we'll be able to replicate our results.
# We set it to i instead of a fixed value so we don't get the same sample every time.
bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
# Fit a decision tree model to the "bag".
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2, splitter="random", max_features="auto")
clf.fit(bag[columns], bag["high_income"])
# Using the model, make predictions on the test data.
predictions.append(clf.predict_proba(test[columns])[:,1])
combined = numpy.sum(predictions, axis=0) / 10
rounded = numpy.round(combined)
print(roc_auc_score(test["high_income"], rounded))
Using random subsets from the previous screen improved the accuracy versus just using bagging:
settings | test AUC |
---|---|
min_samples_leaf: 2 | 0.688 |
max_depth: 2 | 0.676 |
combined predictions | 0.715 |
min_samples_leaf: 2, with bagging | 0.732 |
min_samples_leaf: 2, with bagging and random subsets | 0.735 |
So far we've demonstrated the two building blocks of random forests, bagging and random feature subsets. Luckily, we don't have to write code from scratch each time. Scikit-learn has a RandomForestClassifier class and a RandomForestRegressor class that enable us to quickly train and test random forest models.
When we instantiate a RandomForestClassifier
, we pass in an n_estimators
parameter that indicates how many trees to build. While adding more trees usually improves accuracy, it also increases the overall time the model takes to train.
RandomForestClassifier
has a similar interface to DecisionTreeClassifier
, and we can use the fit
and predict
methods to train and make predictions.
clf
to the training data and make predictions on the test data.from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=5, random_state=1, min_samples_leaf=2)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))
Similarly to decision trees, we can tweak a few parameters with random forests:
min_samples_leaf
min_samples_split
max_depth
max_leaf_nodes
These parameters apply to the individual trees in the model, and change how they are constructed. There are also parameters specific to the random forest that alter how it's constructed as a whole:
n_estimators
bootstrap
-- defaults to True
. Bootstrap aggregation is another name for bagging, and this indicates whether to turn it on.Check the documentation for a full list of parameters.
By tweaking parameters, we can increase the accuracy of the forest. The easiest tweak is to increase the number of estimators we use. This has diminishing returns -- going from 10
trees to 100
will make a bigger difference than going from 100
to 500
, which will make a bigger difference than going from 500
to 1000
. The accuracy increase function is logarithmic, so increasing the number of trees beyond a certain number (usually 200
) won't help much at all.
n_estimators
to 150
.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=5, random_state=1, min_samples_leaf=2)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=2)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))
While we were able to improve the AUC from 0.735
to 0.738
, the model using 150
trees took much longer to train. While the extra training time is trivial on the dataset we're working with right now, understanding this tradeoff will help you when working with much larger datasets where the extra training time could be hours or days!
One of the major advantages of random forests over single decision trees is they tend to overfit less. Although each individual decision tree in a random forest varies widely, the average of their predictions is less sensitive to the input data than a single tree is. This is because while one tree can construct an incorrect and overfit model, the average of 100
or more trees will be more likely to hone in on the signal and ignore the noise. The signal will be the same across all the trees, whereas each tree will hone into the noise differently. This means that the average will discard the noise and keep the signal.
In the code cell, you'll see that we've fit a single decision tree to the training set, and made predictions for both the training set and testing set. The AUC for the training set predictions is .819
while the AUC for the testing set is .714
. Since the test AUC is much lower than the train AUC, this means that the model is overfitting.
Let's now train a similar random forest model and contrast.
clf
to the training set and use it to make predictions on the training set.clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=5)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(train[columns])
print(roc_auc_score(train["high_income"], predictions))
predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=5)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(train[columns])
print(roc_auc_score(train["high_income"], predictions))
predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))
As we can see in the code cell from the previous screen, overfitting decreased with a random forest and accuracy went up overall.
The random forest algorithm is incredibly powerful, but isn't applicable to all tasks. The main strengths of a random forest are:
max_depth
still have to be set and tweaked, though.The main weaknesses are:
n_jobs
parameter onRandomForestClassifier. We'll get more into parallelization later.Given these tradeoffs, it makes sense to use random forests in situations where accuracy is of the utmost importance, and being able to interpret or explain the decisions the model is making isn't key. In cases where time is of the essence, or interpretability is important, a single decision tree may be a better choice.
In the next mission, we'll explore parallelizing random forest creation more, and look more into applications of random forests.