Share
Knowledge points:
lubridate package disassembly time | POSIXlt
Use decision tree classification and random forest prediction
Use logarithms for fit and exp function to restore
The training set comes from the bicycle rental data in the Kaggle Washington Bicycle Sharing Program, and analyzes the relationship between shared bicycles, weather, time, etc. The data set has a total of 11 variables and more than 10,000 rows of data.
First of all, let’s take a look at the official data. There are two tables, both with data from 2011-2012. The difference is that the Test file has all the dates of each month, but it is not registered. users and random users. The Train file only has 1-20 days per month, but there are two types of users.
Solution: Complete the number of users numbered 21-30 in the Train file. The evaluation criterion is the comparison of predictions with actual quantities.
First load the files and packages
library(lubridate)library(randomForest)library(readr)setwd("E:") data<-read_csv("train.csv")head(data)
I encountered a pitfall here, using r language The default read.csv cannot read the correct file format, and it is even worse when it is replaced with xlsx. It turns into a strange number like 43045 all the time. I have tried as.Date before and it can be converted correctly, but this time because there are minutes and seconds, I can only use timestamp, but the result is not good.
Finally, I downloaded the "readr" package and used the read_csv statement to interpret it smoothly.
Because the test date is more complete than the train date, but the number of users is missing, train and test must be merged.
test$registered=0test$casual=0test$count=0 data<-rbind(train,test)
Extract time: You can use timestamp. The time here is relatively simple, it is the number of hours, so you can also directly intercept the string.
data$hour1<-substr(data$datetime,12,13) table(data$hour1)
Count the total usage per hour, it is like this (why is it so neat):
The next step is to use the box plot to look at the relationship between users, time, and days of the week. Why use box plots instead of hist histograms? Because box plots have discrete point expressions, so logarithms are used to find fit
. As can be seen from the figure, in terms of time, the usage of registered users and non-registered users Time makes a big difference.
Correlation coefficient: a linear association measure between variables, testing the degree of correlation between different data.It can be seen from the calculation results that the number of users is negatively correlated with the wind speed, which has a greater impact than the temperature.Value range [-1, 1], the closer to 0, the less relevant.
The decision tree model is a simple and easy-to-use non-parametric classifier. It does not require any a priori assumptions about the data, is fast in calculation, easy to interpret the results, and is robust against noisy data and missing data.Make Decision tree of registered users and hours,The basic calculation steps of the decision tree model are as follows: first select one of the n independent variables, find the best split point, and divide the data into two groups. For the grouped data, repeat the above steps until a certain condition is met.
There are three important issues that need to be solved in decision tree modeling:
How to choose the independent variables
How to choose the split point
Determine the conditions for stopping the division
train$hour1<-as.integer(train$hour1)d<-rpart(registered~hour1,data=train)rpart.plot(d)
Then the result is based on the decision tree Manual classification, so the code is still full...
train$hour1<-as.integer(train$hour1)data$dp_reg=0data$dp_reg[data$hour1<7.5]=1data$dp_reg[data$hour1>=22]=2data$dp_reg[data$hour1>=9.5 & data$hour1<18]=3data$dp_reg[data$hour1>=7.5 & data$hour1<18]=4data$dp_reg[data$hour1>=8.5 & data$hour1<18]=5data$dp_reg[data$hour1>=20 & data$hour1<20]=6data$dp_reg[data$hour1>=18 & data$hour1<20]=7