Guided Project: Predicting bike rentals

https://github.com/dataquestio/solutions/blob/master/Mission213Solution.ipynbhtml

 

1: The Dataset

In many American cities, there are communal bicycle sharing stations where you can rent bicycles by the hour or by the day. Washington, D.C. is one of these cities, and has detailed data available about how many bicycles were rented by hour and by day.git

Hadi Fanaee-T at the University of Portocompiled this data into a CSV file, which you'll be working with in this project. The file contains17380 rows, and each row represents the bike rentals in a single hour of a single day. The data can be downloaded here. If you need help at any point, you can consult our solution notebookhere.github

Here's what the first 5 rows look like:markdown

Imgur

Here are explanations of the relevant columns:app

  • instant -- a unique sequential id number for each row.
  • dteday -- the date the rentals occurred on.
  • season -- the season the rentals occurred in.
  • yr -- the year the rentals occurred in.
  • mnth -- the month the rentals occurred in.
  • hr -- the hour the rentals occurred in.
  • holiday -- whether or not the day was a holiday.
  • weekday -- whether or not the day was a weekday.
  • workingday -- whether or not the day was a working day.
  • weathersit -- the weather situation (categorical variable).
  • temp -- the temperature on a 0-1 scale.
  • atemp -- the adjusted temperature.
  • hum -- the humidity on a 0-1 scale.
  • windspeed -- the wind speed on a 0-1scale.
  • casual -- the number of casual riders (people who hadn't previously signed up with the bikesharing program) that rented bikes.
  • registered -- the number of registered riders (people who signed up previously) that rented bikes.
  • cnt -- the total number of bikes rented (casual + registered).

In this project, you'll try to predict the total number of bikes rented in a given hour. You'll predict the cnt column using all the other columns, except casual and registered. To do this, you'll create a few different machine learning models and evaluate their performance.less

Instructions

  • Use the Pandas library to read bike_rental_hour.csv into the Dataframe bike_rentals.
  • Print out the first few rows of bike_rentals and take a look at the data.
  • Make a histogram of the cnt column of bike_rentals, and take a look at the distribution of total rentals.
  • Use the corr method on the bike_rentals Dataframe to explore how each column is correlated with cnt.

2: Calculating Features

It can often be helpful to calculate features before applying machine learning models. Features can enhance the accuracy of models by introducing new information, or distilling existing information.dom

For example, the hr column in bike_rentalscontains hours that bikes are rented, from 1 to24. A machine will treat each hour differently, and not understand that certain hours are related. We can introduce some order into this by creating a new column with labels for morning,afternoonevening, and night. This will bundle up similar times together, and enable the model to make better decisions.ide

Instructions

  • Write a function called assign_label that takes in a numeric hour value, and returns:
    • 1 if the hour is from 6 to 12.
    • 2 if the hour is from 12 to 18.
    • 3 if the hour is from 18 to 24.
    • 4 if the hour is from 0 to 6.
  • Use the apply method on Series to apply the function to each item in the hr column.
  • Assign the result to the time_label column of bike_rentals.

3: Train/Test Split

Before you can start applying machine learning algorithms, you'll need to split the data into training and testing sets. This will enable you to train an algorithm using the training set and evaluate its accuracy on the testing set. If you train an algorithm on the training data, and evaluate its performance on the same data, you can get an unrealistically low error value, due to overfitting.ui

Instructions

  • Based on your explorations of the cnt column, pick an error metric you want to use to evaluate the performance of the machine learning algorithms. Write up a markdown cell explaining why you picked this metric.
  • Select 80% of the rows in bike_rentals to be part of the training set using the sample method on bike_rentals. Assign the result to train.
  • Select the rows that are in bike_rentals but not in train to be in the testing set. Assign the result to test.
    • This line will generate a Boolean Series that is False when a row in bike_rentals is not found in train:bike_rentals.index.isin(train.index)
    • This line will select any rows in bike_rentals not found in train to be in the testing set:bike_rentals.loc[~bike_rentals.index.isin(train.index)]

4: Applying Linear Regression

Now that you've done some data exploration and manipulation, you're ready to apply linear regression to the data. Linear regression will likely work fairly well on this data, given that many of the columns are highly correlated with cnt.this

As you learned in earlier missions, linear regression works best when predictors are linearly correlated to the target, and when predictors are independent, and don't change meaning when combined with each other. The good thing about linear regression is that it is fairly resistant to overfitting because it is simple, but it also can be prone to underfitting the data, and not building a powerful enough model. This means that linear regression usually isn't the most accurate option.

You'll need to ignore the casual andregistered columns because cnt is derived from these columns. If you're trying to predict the number of people who rent bikes in a given hour (cnt), it doesn't make sense that you'd already know casual or registered, because those numbers are added together to get cnt.

 

5: Applying Decision Trees
You're now ready to apply the decision tree algorithm. You'll be able to compare the error with the error from linear regression, which will enable you to pick the right algorithm for this dataset.

Decision trees tend to predict outcomes much more reliably than linear regression. Because decision trees are a fairly complex model, they also tend to overfit, particularly when parameters such as maximum depth and minimum number of samples per leaf aren't tweaked. Decision trees are also prone to instability -- small changes in the input data can result in a very different output model.

Instructions
Use the DecisionTreeRegressor class to fit a decision tree algorithm to the train data.
Make predictions using the DecisionTreeRegressor class on test.
Calculate the error between the predictions and the actual values.
Experiment with various parameters of the DecisionTreeRegressor class, including min_samples_leaf, to see if it changes error.
Write a markdown cell with your thoughts on the predictions and the error.

 

6: Applying Random Forests

You can now apply the random forest algorithm, which improves on the decision tree algorithm. Random forests tend to be much more accurate than simple models like linear regression. Because of how random forests are constructed, they tend to overfit much less than decision trees. Random forests can still be prone to overfitting, though, and tuning parameters such as maximum depth and minimum samples per leaf is important.

Instructions

  • Use the RandomForestRegressor class to fit a random forest algorithm to the train data.
  • Make predictions using the RandomForestRegressor class ontest.
  • Calculate the error between the predictions and the actual values.
  • Experiment with various parameters of theRandomForestRegressor class, including min_samples_leaf, to see if it changes error.
  • Write a markdown cell with your thoughts on the predictions and the error.

7: Next Steps

That's it for the guided steps! We recommend exploring the data more on your own.

Here are some potential next steps:

  • Calculate more features, such as:
    • An index combining temperature, humidity, and wind speed.
  • Try predicting casual and registeredinstead of cnt.

We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.

You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.

We hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work!

相關文章
相關標籤/搜索