https://github.com/dataquestio/solutions/blob/master/Mission213Solution.ipynbhtml
In many American cities, there are communal bicycle sharing stations where you can rent bicycles by the hour or by the day. Washington, D.C. is one of these cities, and has detailed data available about how many bicycles were rented by hour and by day.git
Hadi Fanaee-T at the University of Portocompiled this data into a CSV file, which you'll be working with in this project. The file contains17380
rows, and each row represents the bike rentals in a single hour of a single day. The data can be downloaded here. If you need help at any point, you can consult our solution notebookhere.github
Here's what the first 5 rows look like:markdown
Here are explanations of the relevant columns:app
instant
-- a unique sequential id number for each row.dteday
-- the date the rentals occurred on.season
-- the season the rentals occurred in.yr
-- the year the rentals occurred in.mnth
-- the month the rentals occurred in.hr
-- the hour the rentals occurred in.holiday
-- whether or not the day was a holiday.weekday
-- whether or not the day was a weekday.workingday
-- whether or not the day was a working day.weathersit
-- the weather situation (categorical variable).temp
-- the temperature on a 0-1
scale.atemp
-- the adjusted temperature.hum
-- the humidity on a 0-1
scale.windspeed
-- the wind speed on a 0-1
scale.casual
-- the number of casual riders (people who hadn't previously signed up with the bikesharing program) that rented bikes.registered
-- the number of registered riders (people who signed up previously) that rented bikes.cnt
-- the total number of bikes rented (casual
+ registered
).In this project, you'll try to predict the total number of bikes rented in a given hour. You'll predict the cnt
column using all the other columns, except casual
and registered
. To do this, you'll create a few different machine learning models and evaluate their performance.less
bike_rental_hour.csv
into the Dataframe bike_rentals
.bike_rentals
and take a look at the data.cnt
column of bike_rentals
, and take a look at the distribution of total rentals.bike_rentals
Dataframe to explore how each column is correlated with cnt
.It can often be helpful to calculate features before applying machine learning models. Features can enhance the accuracy of models by introducing new information, or distilling existing information.dom
For example, the hr
column in bike_rentals
contains hours that bikes are rented, from 1
to24
. A machine will treat each hour differently, and not understand that certain hours are related. We can introduce some order into this by creating a new column with labels for morning
,afternoon
, evening
, and night
. This will bundle up similar times together, and enable the model to make better decisions.ide
assign_label
that takes in a numeric hour value, and returns:
1
if the hour is from 6
to 12
.2
if the hour is from 12
to 18
.3
if the hour is from 18
to 24
.4
if the hour is from 0
to 6
.hr
column.time_label
column of bike_rentals
.Before you can start applying machine learning algorithms, you'll need to split the data into training and testing sets. This will enable you to train an algorithm using the training set and evaluate its accuracy on the testing set. If you train an algorithm on the training data, and evaluate its performance on the same data, you can get an unrealistically low error value, due to overfitting.ui
cnt
column, pick an error metric you want to use to evaluate the performance of the machine learning algorithms. Write up a markdown cell explaining why you picked this metric.80%
of the rows in bike_rentals
to be part of the training set using the sample method on bike_rentals
. Assign the result to train
.bike_rentals
but not in train
to be in the testing set. Assign the result to test
.
False
when a row in bike_rentals
is not found in train
:bike_rentals.index.isin(train.index)
bike_rentals
not found in train
to be in the testing set:bike_rentals.loc[~bike_rentals.index.isin(train.index)]
Now that you've done some data exploration and manipulation, you're ready to apply linear regression to the data. Linear regression will likely work fairly well on this data, given that many of the columns are highly correlated with cnt
.this
As you learned in earlier missions, linear regression works best when predictors are linearly correlated to the target, and when predictors are independent, and don't change meaning when combined with each other. The good thing about linear regression is that it is fairly resistant to overfitting because it is simple, but it also can be prone to underfitting the data, and not building a powerful enough model. This means that linear regression usually isn't the most accurate option.
You'll need to ignore the casual
andregistered
columns because cnt
is derived from these columns. If you're trying to predict the number of people who rent bikes in a given hour (cnt
), it doesn't make sense that you'd already know casual
or registered
, because those numbers are added together to get cnt
.
5: Applying Decision Trees
You're now ready to apply the decision tree algorithm. You'll be able to compare the error with the error from linear regression, which will enable you to pick the right algorithm for this dataset.
Decision trees tend to predict outcomes much more reliably than linear regression. Because decision trees are a fairly complex model, they also tend to overfit, particularly when parameters such as maximum depth and minimum number of samples per leaf aren't tweaked. Decision trees are also prone to instability -- small changes in the input data can result in a very different output model.
Instructions
Use the DecisionTreeRegressor class to fit a decision tree algorithm to the train data.
Make predictions using the DecisionTreeRegressor class on test.
Calculate the error between the predictions and the actual values.
Experiment with various parameters of the DecisionTreeRegressor class, including min_samples_leaf, to see if it changes error.
Write a markdown cell with your thoughts on the predictions and the error.
You can now apply the random forest algorithm, which improves on the decision tree algorithm. Random forests tend to be much more accurate than simple models like linear regression. Because of how random forests are constructed, they tend to overfit much less than decision trees. Random forests can still be prone to overfitting, though, and tuning parameters such as maximum depth and minimum samples per leaf is important.
train
data.test
.min_samples_leaf
, to see if it changes error.That's it for the guided steps! We recommend exploring the data more on your own.
Here are some potential next steps:
casual
and registered
instead of cnt
.We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.
You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.
We hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work!