Regression
Data Mining for Business
Intelligence
Shmueli, Patel & Bruce
© Galit Shmueli and Peter Bruce 2010
Topics
Explanatory vs. predictive modeling with
regression
Example: prices of Toyota Corollas
Fitting a predictive model
Assessing predictive accuracy
Selecting a subset of predictors (variable selection) Explanatory Modeling
Goal: Explain relationship between predictors
(explanatory variables) and target
Familiar use of regression in data analysis
Multiple linear regression – linear relationship between
a dependent variable Y (response) and a set of predictors
X1,…,Xp
Model Goal: Fit the data well and understand the
contribution of explanatory variables to the model – model performance assessed by residual analysis
Model fitted to the entire dataset
Predictive Modeling
Goal: Predict target values in new data where we have predictor values, but not target values
Classic data mining context
Model Goal: Optimize predictive accuracy – how
accurately can the fitted model predict new cases
Model trained on training data and performance is assessed on validation or test data
Explaining role of predictors is not the primary
purpose (although useful)
Regression Method
Predict the value of the dependent variable Y
based on predictors X1,…,Xp
Regression coefficients β1, β2,…, βp in the equation:
Y = β1X1 + β2X2 + …..+ βpXp
Coefficients estimated via ordinary least squares
(OLS) method
Estimated using training sample
Predictive capacity assessed by prediction results on
validation set – average squared error
Assumptions – normality, independence, linearity
Example: Prices of Toyota
Corolla
ToyotaCorolla.xls
Goal: Predict sale prices of used Toyota
Corollas based on their specification
Data: Prices of 1442 used Toyota
Corollas, with their specification information – age, mileage, fuel type, engine size
Data Sample
(showing only the variables to be used in analysis) Variables Used
Price in Euros
Age in months as of 8/04
KM (kilometers)
Fuel Type (diesel, petrol, CNG)
HP (horsepower)
Metallic color (1=yes, 0=no)
Automatic transmission (1=yes,
0=no)
CC (cylinder volume)
Doors
Quarterly_Tax (road tax)
Weight (in kg)
Preprocessing
Fuel type is categorical, must be transformed into binary variables
Diesel (1=yes, 0=no)
CNG (1=yes, 0=no)
None needed for “Petrol” (reference category)
Subset of the records selected for training partition (limited # of variables shown)
60% training data / 40% validation data
Multiple linear regression model fitted using ONLY training data The Fitted Regression Model
(XLMiner output)
Predicted Values
Predicted price computed using regression coefficients Residuals = difference between actual and predicted prices Error reports
Error for the validation set is usually larger than that of the training set (as expected)
Distribution of
Residuals
Symmetric distribution
Some outliers
Average error = 116
50% errors between
±860
Selecting Subsets of
Predictors
Goal: Find parsimonious model (the simplest model that performs sufficiently well)
Expensive or impossible to measure all predictors
for future predictions
More robust
Multicollinearity can lead to unstable regression coefficients and hence increase variation in predictions and lower predictive accuracy
Sometimes dropping correlated predictors increase bias (average error)
Trade-off between too few and too many
predictors - Bias-variance trade-off
Variable selection methods
Use domain knowledge – some practical
considerations:
Expense of collecting future data on predictors
Missing values and inaccurate measurements
Irrelevance to the problem at hand
High correlations
Two primary methods:
Exhaustive Search
Partial Search Algorithms
Forward selection
Backward elimination
Stepwise regression
Exhaustive Search
All possible subsets of predictors assessed
(single, pairs, triplets, etc.)
Computationally intensive
Judge by “adjusted