You should see a much better $R^2$ for the training data, but a much worse one for the validation data. What happened?
This is a phenomenon called overfitting - our model has too many degrees of freedom (one parameter for each of the 100+ features of this dataset. This means that while our model fits the training data reasonably well, but at the expense of being too specific to that data.
John Von Neumann famously said "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk!".
2.4 Non-linear modeling and Regression Trees¶So, we're at an impasse. We didn't have enough features and our model performed poorly, we added too many features and our model looked good on training data, but not so good on test data.
What's a modeler to do?
There are a couple of ways of dealing with this situation - one of them is called regularization, which you might try on your own (see RidgeRegression
or LassoRegression
in scikit-learn), another is to use a model which captures non-linear relationships between the features and the response variable.
One such type of model was pioneered here at Berkeley, by the late, great Leo Breiman. These models are called regression trees.
The basic idea behind regression treees is to recursively partition the dataset into subsets that are similar with respect to the response variable.
If we take our temperature example, we might observe a non-linear relationship - electricity gets expensive when it's cold outside because we use the heater, but it also gets expensive when it's too hot outside because we run the air conditioning.
A decision tree model might dynamically elect to split the data on the temperature feature, and estimate high prices both for hot and cold, with lower prices for more Berkeley-like temperatures. Go read the scikit-learn decision trees documentation for more background.
Exercise 12¶a. Using the scikit learn DecsionTreeRegressor
API, write a function that fits trees with the parameter 'max_depth' exposed to the user, and set to 10 by default.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4