Trendlines and Regression
Analysis
Modeling Relationships and Trends in
Data
Create charts to better
understand data sets.
For cross-sectional data, use a
scatter chart.
For time series data, use a line
chart.
Common Mathematical Functions Used n
Predictive Analytical Models
Linear y = a + bx
Logarithmic y = ln(x)
Polynomial (2nd order) y = ax2 + bx + c
Polynomial (3rd order) y = ax3 + bx2 + dx + e
Power y = axb
Exponential y = abx
(the base of natural logarithms, e = 2.71828…is often used
for the constant b)
Excel Trendline Tool
Right click on data series
and choose Add trendline
from pop-up menu
Check the boxes Display
Equation on chart and
Display R-squared value
on chart
R2
R2 (R-squared) is a measure of the “fit” of the line
to the data.
◦ The value of R2 will be between 0 and 1.
◦ A value of 1.0 indicates a perfect fit and all data points
would lie on the line; the larger the value of R2 the better
the fit.
Example 8.1: Modeling a Price-Demand Function
Linear demand function:
Sales = 20,512 - 9.5116(price)
Example 8.2: Predicting Crude Oil Prices
Line chart of historical crude oil prices
Example 8.9 Continued
Excel’s Trendline tool is used to fit various functions to the
data.
Exponential y = 50.49e0.021x R2 = 0.664
y = 0.13x2 − 2.399x + 68.01 R2 = 0.905
Logarithmic y = 13.02ln(x) + 39.60 R2 = 0.382
y = 0.005x3 − 0.111x2
Polynomial 2°
Polynomial 3°
+ 0.648x + 59.497 R2 = 0.928
*
Power y = 45.96x0.0169 R2 = 0.397
Example 8.2 Continued
Third order polynomial trendline fit to the data
Figure 8.11
Think- Pair & Share on Regression Analysis
Think:Consider the role of regression
analysis in your field or industry. Reflect
on how regression analysis is currently
being used to analyze relationships
between variables, make predictions, or
inform decision-making processes.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-10
Continue
Pair: Discuss with a colleague or team member
how regression analysis has impacted your work
or industry. Share examples of specific projects or
analyses where regression analysis was used and
the outcomes it yielded. Consider any challenges
or limitations you've encountered when working
with regression analysis.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-11
Continue..
Share: Share your insights and discussion points
with the wider team or group. Discuss the
potential of regression analysis to provide
valuable insights into complex relationships and
make accurate predictions. Exchange ideas on
how regression analysis can be further leveraged
to address specific challenges or capitalize on
opportunities within your organization or industry.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-12
Caution About Polynomials
The R2 value will continue to increase as the order
of the polynomial increases; that is, a 4th order
polynomial will provide a better fit than a 3rd order,
and so on.
Higher order polynomials will generally not be very
smooth and will be difficult to interpret visually.
◦ Thus, we don't recommend going beyond a third-
order polynomial when fitting data.
Use your eye to make a good judgment!
Regression Analysis
Regression analysis is a tool for building
mathematical and statistical models that
characterize relationships between a dependent
(ratio) variable and one or more independent, or
explanatory variables (ratio or categorical), all of
which are numerical.
Simple linear regression involves a single
independent variable.
Multiple regression involves two or more
independent variables.
Simple Linear Regression
Finds a linear relationship between:
one independent variable X and
one dependent variable Y
First prepare a scatter plot to verify the data has a
linear trend.
Use alternative approaches if the data is not linear.
Example 8.3: Home Market Value Data
Size of a house is
typically related to its
market value.
X = square footage
Y = market value ($)
The scatter plot of the full
data set (42 homes)
indicates a linear trend.
Finding the Best-Fitting Regression Line
Market value = a + b × square feet
Two possible lines are shown below.
Line A is clearly a better fit to the data.
We want to determine the best regression line.
Example 8.4: Using Excel to Find the
Best Regression Line
Market value = 32,673 + $35.036 × square feet
◦ The estimated market value of a home with 2,200 square feet
would be: market value = $32,673 + $35.036 × 2,200 = $109,752
The regression model
explains variation in
market value due to
size of the home.
It provides better
estimates of market
value than simply
using the average.
Least-Squares Regression
Simple linear regression model:
We estimate the parameters from the sample data:
Let Xi be the value of the independent variable of the ith
observation. When the value of the independent
variable is Xi, then Yi = b0 + b1Xi is the estimated value
of Y for Xi.
Residuals
Residuals are the observed errors associated
with estimating the value of the dependent
variable using the regression line:
Least Squares Regression
The best-fitting line minimizes the sum of squares of the
residuals.
Excel functions:
◦ =INTERCEPT(known_y’s, known_x’s)
◦ =SLOPE(known_y’s, known_x’s)
Example 8.5: Using Excel Functions to
Find Least-Squares Coefficients
Slope = b1 = 35.036
=SLOPE(C4:C45, B4:B45)
Intercept = b0 = 32,673
=INTERCEPT(C4:C45, B4:B45)
Estimate
^
Y when X = 1750 square feet
Y = 32,673 + 35.036(1750) = $93,986
=TREND(C4:C45, B4:B45, 1750)
Simple Linear Regression With Excel
Data > Data Analysis >
Regression
Input Y Range (with
header)
Input X Range (with
header)
Check Labels
Excel outputs a table with
many useful regression
statistics.
Home Market Value Regression
Results
Regression Statistics
Multiple R - | r |, where r is the sample correlation
coefficient. The value of r varies from -1 to +1 (r is
negative if slope is negative)
R Square - coefficient of determination, R2, which
varies from 0 (no fit) to 1 (perfect fit)
Adjusted R Square - adjusts R2 for sample size
and number of X variables
Standard Error - variability between observed
and predicted Y values. This is formally called the
standard error of the estimate, SYX.
Example 8.6: Interpreting Regression
Statistics for Simple Linear Regression
53% of the variation in home market values
can be explained by home size.
The standard error of $7287 is less than
standard deviation (not shown) of $10,553.
Regression as Analysis of Variance
ANOVA conducts an F-test to determine whether
variation in Y is due to varying levels of X.
ANOVA is used to test for significance of regression:
H1: population slope coefficient ≠ 0
H0: population slope coefficient = 0
Excel reports the p-value (Significance F).
Rejecting H0 indicates that X explains variation in Y.
Example 8.7: Interpreting Significance of
Regression
Home size is not a significant variable
Home size is a significant variable
p-value = 3.798 x 10-8
◦ Reject H0: The slope is not equal to zero. Using a linear
relationship, home size is a significant variable in explaining
variation in market value.
Testing Hypotheses for Regression
Coefficients
An alternate method for testing whether a slope or
intercept is zero is to use a t-test:
Excel provides the p-values for tests on the slope and
intercept.
Example 8.8: Interpreting Hypothesis
Tests for Regression Coefficients
Use p-values to draw conclusion
Neither coefficient is statistically equal to zero.
Confidence Intervals for Regression
Coefficients
Confidence intervals (Lower 95% and Upper 95%
values in the output) provide information about the
unknown values of the true regression
coefficients, accounting for sampling error.
We may also use confidence intervals to test
hypotheses about the regression coefficients.
◦ To test the hypotheses
check whether B1 falls within the confidence interval for the
slope. If it does, reject the null hypothesis.
Example 8.9: Interpreting Confidence
Intervals for Regression Coefficients
For the Home Market Value data, a 95% confidence
interval for the intercept is [14,823, 50,523], and for the
slope, [24.59, 45.48].
Although we estimated that a house with 1,750 square
feet has a market value of 32,673 + 35.036(1,750)
=$93,986, if the true population parameters are at the
extremes of the confidence intervals, the estimate might
be as low as 14,823 + 24.59(1,750) = $57,855 or as high
as 50,523 + 45.48(1,750) = $130,113.
Residual Analysis and Regression Assumptions
Residual = Actual Y value − Predicted Y value
Standard residual = residual / standard deviation
Rule of thumb: Standard residuals outside of ±2 or
±3 are potential outliers.
Excel provides a table and a plot of residuals.
This point has a standard
residual of 4.53
Checking Assumptions
Linearity
examine scatter diagram (should appear linear)
examine residual plot (should appear random)
Normality of Errors
view a histogram of standard residuals
regression is robust to departures from normality
Homoscedasticity: variation about the regression line is
constant
examine the residual plot
Independence of Errors: successive observations should
not be related.
This is important when the independent variable is time.
Example 8.11: Checking Regression
Assumptions for the Home Market Value Data
Linearity - linear trend in scatterplot
- no pattern in residual plot
Example 8.11 Continued
Normality of Errors – residual histogram appears
slightly skewed but is not a serious departure
Example 8.11 Continued
Homoscedasticity – residual plot shows no serious
difference in the spread of the data for different X
values.
Example 8.11 Continued
Independence of Errors – Because the data is
cross-sectional, we can assume this assumption
holds.
Multiple Linear Regression
A linear regression model with more than one
independent variable is called a multiple linear
regression model.
Role Play Exploring Linear Regression
Analysis
Pooja: Hey Ajay, come on in! I've been diving into
the marketing data you requested. Have a seat.
Ajay: Thanks, Pooja. I'm really curious about our
marketing campaign's effectiveness, especially
regarding our online ads. Can you shed some light
on that?
Pooja: Absolutely! I've been running some
analyses, and it seems like linear regression could
provide some insights into how our online ad
spending correlates with website traffic or
conversions.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-40
Continue..
Ajay: Linear regression? Can you explain that in
simpler terms?
Pooja: Of course. Linear regression is a statistical
method used to model the relationship between
two variables, typically by fitting a straight line to
the data points. In our case, we're looking at how
changes in our online ad spending impact website
traffic or conversions.
Ajay: Ah, got it. So, how do we go about doing
this?
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-41
Continue..
Pooja: With linear regression, we'll be able to
quantify the impact of our ad spending on website
traffic or conversions. For example, we might find
that for every dollar increase in ad spending,
website traffic increases by a certain number of
visits. This information can help us optimize our
marketing budget allocation for maximum impact.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-42
Estimated Multiple Regression
Equation
We estimate the regression coefficients—called
partial regression coefficients — b0, b1, b2,… bk,
then use the model:
The partial regression coefficients represent the
expected change in the dependent variable when
the associated independent variable is increased
by one unit while the values of all other
independent variables are held constant.
Excel Regression Tool
The independent variables in the spreadsheet must be in
contiguous columns.
◦ So, you may have to manually move the columns of data around
before applying the tool.
Key differences:
Multiple R and R Square are called the multiple
correlation coefficient and the coefficient of multiple
determination, respectively, in the context of multiple
regression.
ANOVA tests for significance of the entire model. That
is, it computes an F-statistic for testing the hypotheses:
ANOVA for Multiple Regression
ANOVA tests for significance of the entire model. That
is, it computes an F-statistic for testing the hypotheses:
The multiple linear regression output also provides
information to test hypotheses about each of the
individual regression coefficients.
◦ If we reject the null hypothesis that the slope associated with
independent variable i is 0, then the independent variable i is
significant and improves the ability of the model to better predict
the dependent variable. If we cannot reject H0, then that
independent variable is not significant and probably should not be
included in the model.
Example 8.12: Interpreting Regression Results
for the Colleges and Universities Data
Predict student graduation rates using several indicators:
Example 8.12 Continued
Regression model
The value of R2 indicates that 53% of the variation in the dependent
variable is explained by these independent variables.
All coefficients are statistically significant.
Model Building Issues
A good regression model should include only
significant independent variables.
Adding an independent variable to a regression
model will always result in R2 equal to or greater
than the R2 of the original model.
Adjusted R2 reflects both the number of
independent variables An increase in adjusted
R2 indicates that the model has improved.
Systematic Model Building Approach
1. Construct a model with all available independent
variables. Check for significance of the independent
variables by examining the p-values.
2. Identify the independent variable having the largest p-
value that exceeds the chosen level of significance.
3. Remove the variable identified in step 2 from the
model and evaluate adjusted R2.
(Don’t remove all variables with p-values that exceed a at the
same time, but remove only one at a time.)
4. Continue until all variables are significant.
Example 8.13: Identifying the Best Regression Model
Banking Data
Home value has the
largest p-value; drop
and re-run the
regression.
Example 8.13 Continued
Bank regression after removing Home Value
Adjusted R2 improves slightly.
All X variables are significant.
Alternate Criterion
Use the t-statistic.
If | t | < 1, then the standard error will decrease
and adjusted R2 will increase if the variable is
removed. If | t | > 1, then the opposite will occur.
You can follow the same systematic approach,
except using t-values instead of p-values.
Multicollinearity
Multicollinearity occurs when there are
strong correlations among the independent
variables, and they can predict each other
Correlations exceeding ±0.7 may indicate
better than the dependent variable.
multicollinearity
The variance inflation factor is a better
indicator, but not computed in Excel.
Example 8.14: Identifying Potential Multicollinearity
Colleges and Universities correlation matrix; none
exceed the recommend threshold of ±0.7
Banking Data correlation matrix; large correlations exist
Example 8.14 Continued
If we remove Wealth from the model, the adjusted R2 drops to
0.9201, but we discover that Education is no longer significant.
Dropping Education and leaving only Age and Income in the model
results in an adjusted R2 of 0.9202.
However, if we remove Income from the model instead of Wealth,
the Adjusted R2 drops to only 0.9345, and all remaining variables
(Age, Education, and Wealth) are significant.
Practical Issues in Trendline and
Regression Modeling
Identifying the best regression model often requires
experimentation and trial and error.
The independent variables selected should make
sense in attempting to explain the dependent
variable
.
Additional variables increase R2 and, therefore,
help to explain a larger proportion of the variation.
Good models are as simple as possible (the
principle of parsimony).
Overfitting
Over fitting means fitting a model too closely to the
sample data at the risk of not fitting it well to the
population in which we are interested.
In multiple regression, if we add too many terms to the
model, then the model may not adequately predict other
values from the population.
Over fitting can be mitigated by using good logic,
intuition, theory, and parsimony.
Regression with Categorical Variables
Regression analysis requires numerical data.
Categorical data can be included as independent
variables, but must be coded numeric using
dummy variables.
For variables with 2 categories, code as 0 and 1.
CaseLets: Regression Analysis in Marketing
What is the dependent variable in this
case study?
Identify the independent variables.
Why is regression analysis an appropriate
method for this case study?
What type of regression analysis would be
most suitable for this case study, and
why?
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-59
Example 8.15: A Model with Categorical
Variables
Employee Salaries provides data for 35 employees
Predict Salary using Age and MBA (code as
yes=1, no=0)
Example 8.15 Continued
Salary = 893.59 + 1044.15 × Age + 14767.23 × MBA
◦ If MBA = 0, salary = 893.59 + 1044 × Age
◦ If MBA = 1, salary =15,660.82 + 1044 × Age
Interactions
An interaction occurs when the effect of one
variable is dependent on another variable.
We can test for interactions by defining a new
variable as the product of the two variables,
X3 = X1 × X2 , and testing whether this
variable is significant, leading to an
alternative model.
Example 8.16: Incorporating Interaction
Terms in a Regression Model
Define an interaction between
Age and MBA and re-run the
regression.
The MBA indicator is not significant; drop and re-run.
Example 8.16 Continued
Adjusted R2 increased slightly, and both age and the
interaction term are significant. The final model is
salary = 3,323.11 + 984.25 × age + 425.58 × MBA × age
Categorical Variables with More Than Two
Levels
When a categorical variable
has k > 2 levels, we need to
add k - 1 additional variables to
the model.
Example 8.17: A Regression Model with
Multiple Levels of Categorical Variables
The Excel file Surface
Finish provides
measurements of the
surface finish of 35 parts
produced on a lathe,
along with the
revolutions per minute
(RPM) of the spindle
and one of four types of
cutting tools used.
Example 8.17 Continued
Because we have k = 4 levels of tool type, we will
define a regression model of the form
Example 8.17 Continued
Add 3 columns to
the data, one for
each of the tool
type variables
Example 8.17 Continued
Regression results
Surface finish = 24.49 + 0.098 RPM - 13.31 type B - 20.49 type C -
26.04 type D
Regression Models with Nonlinear Terms
Curvilinear models may be appropriate when
scatter charts or residual plots show nonlinear
relationships.
A second order polynomial might be used
Here β1 represents the linear effect of X on Y and
β2 represents the curvilinear effect.
This model is linear in the β parameters so we can
use linear regression methods.
Debate on Regression Analysis
Proponent Argument:
Regression analysis is a powerful statistical tool
that allows us to explore relationships between
variables and make predictions based on data. It
is widely used across various fields, including
economics, social sciences, and marketing, to
understand complex phenomena and make
informed decisions.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-71
Continue
While it's true that regression analysis has its limitations,
these challenges can often be addressed with proper
data preprocessing, model selection, and validation
techniques. For example, techniques such as robust
regression and regularization can help mitigate the
impact of outliers and overfitting, respectively. Moreover,
advancements in statistical methods, such as machine
learning algorithms, offer alternatives to traditional
regression models, allowing for more flexibility and
robustness in modeling complex relationships.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-72
CONCLUSION..
while regression analysis has its limitations and
challenges, its benefits outweigh its drawbacks when
used judiciously and in the right context. It remains
a valuable tool for exploring relationships between
variables, making predictions, and informing
decision-making across various domains. As data
analytics continue to evolve, regression analysis will
continue to play a vital role in extracting actionable
insights from data and driving innovation and
growth. Copyright © 2013 Pearson Education, Inc.
9-73
publishing as Prentice Hall