Intro
Recently I completed a project from data science boot camp and learned how the transformation of numerical information is helpful in finding a regression model. On this page, I like to focus on how the transformation of numerical data can give better validation to the model instead of focusing on how the model looks and can be used to predict a value.
I will go over the data source and data structure briefly and give a quick explanation of what methods are used in this post. I will also include some python code and mathematical formulas to aid understanding.
Data
I am using data from the project I had completed, but the data is from a real-life and contains information on house sales data in King County, Washington. The information includes house prices and multiple house features. Below is the list of variables used in the analysis.
Dependent Variable
House Price
Independent Variables
Numerical
Living space in squared-feet
Lot size in squared-feet
Year built
The number of floors*
* A separate explanation of why I defined it as numerical is at the end of the post.
Categorical
Binaries
Waterfront
View presence
Renovation condition
Basement presence
Multi-categorical
Maintenance condition
House grade
Methods
This section is just to give you an idea of what I have done for the results. If these look familiar to you or don't interest you, then you can just skip to the result section. Checking the results before reading this section might give you an idea more easily of why I am posting this.
Assumptions
There are several ways to validate the model, and I like to go over the four major assumptions to validate the model. The assumptions are linearity, normality, homoscedasticity, and multicollinearity. I chose this method because it can be explained visually, and visualization is more helpful in explaining the concept than just lots of words and numbers.
1. Linearity
It is important to check the linearity assumption in the linear regression analysis. As polynomial transformation has not been applied in this analysis, the expected house price (dependent variable) will be compared to the raw value of the house price.
Below is the python code I used. The purpose of sharing the code is to give some idea of how the graph is created.
# split whole data to training and test data from sklearn.model_selection import train_test_split X = independent_variables y = house_price_column X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # find the model fit from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) # Calculate predicted price using test data y_pred = model.predict(X_test) # Graphing part import matplotlib.pyplot as plt fig, ax = plt.subplots() perfect_line = np.arange(y_test.min(), y_test.max()) ax.plot(perfect_line, linestyle="--", color="orange", label="Perfect Fit") ax.scatter(y_test, y_pred, alpha=0.5) ax.set_xlabel("Actual Price") ax.set_ylabel("Predicted Price") ax.legend();
2. Normality
The normality assumption is related to the normality of model residuals. This is checked using a QQ plot.
import scipy.stats as stats residuals = y_test - y_pred sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);
3. Homoscedasticity
The assumption of homoscedasticity checks the dependent variables against the dependent variables and sees if values are dispersed without any pattern. This assumption is also related to the residuals.
fig, ax = plt.subplots() residuals = y_test - y_pred ax.scatter(y_pred, residuals, alpha=0.5) ax.plot(y_pred, [0 for i in range(len(X_test))]) ax.set_xlabel("Predicted Value") ax.set_ylabel("Actual - Predicted Value");
4. Multicollinearity
The assumption of multicollinearity checks dependency between independent variables. It is best to have independent variables independent of one another as much as possible.
from statsmodels.stats.outliers_influence import variance_inflation_factor vif = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])] pd.Series(vif, index=X_train.columns, name="Variance Inflation Factor")
Transformations of numerical variables
Log transformation
This part is simple. All values in the numerical columns are natural-logged.
Normalization
The below formula shows a value in a numerical variable is subtracted by the mean of the variable, and then the subtracted value is divided by the standard deviation of the variable.
Results
Here is the fun part. You can just relax and see how graphs and scores change.
The Raw Data - no transformation
1. Linearity
I see several outliers. Some linearity is observed only on the left side.
2. Normality
Only 1/3 of the dots are on the red line.
3. Homoscedasticity
4. Multicollinearity
Only scores below 5 are accepted. About half of the scores are not acceptable.
sqft_living 8483.406359 sqft_lot 1.200729 floors 14.106084 waterfront 1.085728 view 1.344477 yr_built 72.300842 is_renovated 1.157421 has_basement 2.175980 condition_Fair 1.038721 condition_Good 1.668386 condition_Very Good 1.295097 grade_11 Excellent 1.530655 grade_6 Low Average 5.129509 grade_7 Average 14.142031 grade_8 Good 8.261598 grade_9 Better 3.446987 interaction 8460.117213
Log Transformation
1. Linearity
It shows much better linearity. The linearity of the dots have a slightly lower slope than the perfect line.
2. Normality
There is a small kurtosis, but the majority of the dots are on the red line.
3. Homoscedasticity
This looks better, too. A slight pattern is observed on the right side.
4. Multicollinearity
Several scores are still too high.
sqft_living 471370.972327 sqft_lot 155.772190 floors 4.052275 yr_built 922.928871 waterfront 1.086052 view 1.337069 is_renovated 1.146855 has_basement 2.438983 condition_Fair 1.042784 condition_Good 1.668688 condition_Very Good 1.283740 grade_11 Excellent 1.468962 grade_6 Low Average 5.221791 grade_7 Average 12.895007 grade_8 Good 7.519577 grade_9 Better 3.355969 interaction 469074.416388
Log Transformation and Normalization
1. Linearity
The slope is slightly better and closer to the perfect line.
2. Normality
I don't see much difference from the previous graph.
3. Homoscedasticity
I don't see much difference from the previous graph.
4. Multicollinearity
All of the scores are now acceptable. This is a huge difference!
sqft_living 3.001670 sqft_lot 1.552016 floors 2.046914 yr_built 1.758294 waterfront 1.086293 view 1.313341 is_renovated 1.148279 has_basement 2.441147 condition_Fair 1.042169 condition_Good 1.647135 condition_Very Good 1.281906 grade_11 Excellent 1.278034 grade_6 Low Average 1.939542 grade_7 Average 2.077564 grade_8 Good 1.609822 grade_9 Better 1.440610 interaction 1.374175
Conclusion
Transformations helped to keep (i.e. not reject) the four assumptions. Visualizations seem clear enough to guide you to study what was improving.
Extra
The number of floors
This is kind of out of the major topic in this post, but this decision can be crucial to the overall regression analysis. Let me begin with the value counts of the floor information.
1.0 10673 2.0 8235 1.5 1910 3.0 611 2.5 161 3.5 7
The left column shows the data has a range of the floor counts from 1 through 3.5. The model might make more sense to treat this variable as a categorical variable. However, what if one likes to predict a house price that has 4 floors? This question or problem can be solved if this information is treated as numerical. I think this is a matter of the goal of the analysis.
Top comments (0)