0% found this document useful (0 votes)
20 views73 pages

Unit 3 - Predictive Analysis

The document discusses trendlines and regression analysis, emphasizing the use of various mathematical functions and Excel tools to model relationships in data. It covers simple and multiple linear regression, the importance of R-squared values, and the significance of regression coefficients through hypothesis testing. Additionally, it highlights the need for careful model building and checking assumptions to ensure accurate predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views73 pages

Unit 3 - Predictive Analysis

The document discusses trendlines and regression analysis, emphasizing the use of various mathematical functions and Excel tools to model relationships in data. It covers simple and multiple linear regression, the importance of R-squared values, and the significance of regression coefficients through hypothesis testing. Additionally, it highlights the need for careful model building and checking assumptions to ensure accurate predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Trendlines and Regression

Analysis
Modeling Relationships and Trends in
Data

 Create charts to better


understand data sets.
 For cross-sectional data, use a

scatter chart.
 For time series data, use a line

chart.
Common Mathematical Functions Used n
Predictive Analytical Models

Linear y = a + bx
Logarithmic y = ln(x)
Polynomial (2nd order) y = ax2 + bx + c
Polynomial (3rd order) y = ax3 + bx2 + dx + e
Power y = axb
Exponential y = abx
(the base of natural logarithms, e = 2.71828…is often used
for the constant b)
Excel Trendline Tool
 Right click on data series
and choose Add trendline
from pop-up menu
 Check the boxes Display
Equation on chart and
Display R-squared value
on chart
R2
 R2 (R-squared) is a measure of the “fit” of the line
to the data.
◦ The value of R2 will be between 0 and 1.
◦ A value of 1.0 indicates a perfect fit and all data points
would lie on the line; the larger the value of R2 the better
the fit.
Example 8.1: Modeling a Price-Demand Function

Linear demand function:


Sales = 20,512 - 9.5116(price)
Example 8.2: Predicting Crude Oil Prices
 Line chart of historical crude oil prices
Example 8.9 Continued
 Excel’s Trendline tool is used to fit various functions to the
data.

Exponential y = 50.49e0.021x R2 = 0.664

y = 0.13x2 − 2.399x + 68.01 R2 = 0.905


Logarithmic y = 13.02ln(x) + 39.60 R2 = 0.382

y = 0.005x3 − 0.111x2
Polynomial 2°
Polynomial 3°
+ 0.648x + 59.497 R2 = 0.928
*
Power y = 45.96x0.0169 R2 = 0.397
Example 8.2 Continued
 Third order polynomial trendline fit to the data

Figure 8.11
Think- Pair & Share on Regression Analysis

 Think:Consider the role of regression


analysis in your field or industry. Reflect
on how regression analysis is currently
being used to analyze relationships
between variables, make predictions, or
inform decision-making processes.

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 9-10
Continue

 Pair: Discuss with a colleague or team member


how regression analysis has impacted your work
or industry. Share examples of specific projects or
analyses where regression analysis was used and
the outcomes it yielded. Consider any challenges
or limitations you've encountered when working
with regression analysis.

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 9-11
Continue..
 Share: Share your insights and discussion points
with the wider team or group. Discuss the
potential of regression analysis to provide
valuable insights into complex relationships and
make accurate predictions. Exchange ideas on
how regression analysis can be further leveraged
to address specific challenges or capitalize on
opportunities within your organization or industry.

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 9-12
Caution About Polynomials
 The R2 value will continue to increase as the order
of the polynomial increases; that is, a 4th order
polynomial will provide a better fit than a 3rd order,
and so on.
 Higher order polynomials will generally not be very
smooth and will be difficult to interpret visually.
◦ Thus, we don't recommend going beyond a third-
order polynomial when fitting data.
 Use your eye to make a good judgment!
Regression Analysis
 Regression analysis is a tool for building
mathematical and statistical models that
characterize relationships between a dependent
(ratio) variable and one or more independent, or
explanatory variables (ratio or categorical), all of
which are numerical.
 Simple linear regression involves a single
independent variable.
 Multiple regression involves two or more
independent variables.
Simple Linear Regression
 Finds a linear relationship between:
 one independent variable X and
 one dependent variable Y
 First prepare a scatter plot to verify the data has a

linear trend.
 Use alternative approaches if the data is not linear.
Example 8.3: Home Market Value Data

Size of a house is
typically related to its
market value.
X = square footage
Y = market value ($)
The scatter plot of the full
data set (42 homes)
indicates a linear trend.
Finding the Best-Fitting Regression Line
 Market value = a + b × square feet
 Two possible lines are shown below.

 Line A is clearly a better fit to the data.


 We want to determine the best regression line.
Example 8.4: Using Excel to Find the
Best Regression Line
 Market value = 32,673 + $35.036 × square feet
◦ The estimated market value of a home with 2,200 square feet
would be: market value = $32,673 + $35.036 × 2,200 = $109,752

The regression model


explains variation in
market value due to
size of the home.
It provides better
estimates of market
value than simply
using the average.
Least-Squares Regression
 Simple linear regression model:

 We estimate the parameters from the sample data:

 Let Xi be the value of the independent variable of the ith


observation. When the value of the independent
variable is Xi, then Yi = b0 + b1Xi is the estimated value
of Y for Xi.
Residuals
 Residuals are the observed errors associated
with estimating the value of the dependent
variable using the regression line:
Least Squares Regression
 The best-fitting line minimizes the sum of squares of the
residuals.

 Excel functions:
◦ =INTERCEPT(known_y’s, known_x’s)
◦ =SLOPE(known_y’s, known_x’s)
Example 8.5: Using Excel Functions to
Find Least-Squares Coefficients
 Slope = b1 = 35.036
=SLOPE(C4:C45, B4:B45)
 Intercept = b0 = 32,673
=INTERCEPT(C4:C45, B4:B45)
 Estimate
^
Y when X = 1750 square feet
Y = 32,673 + 35.036(1750) = $93,986
=TREND(C4:C45, B4:B45, 1750)
Simple Linear Regression With Excel
Data > Data Analysis >
Regression
Input Y Range (with
header)
Input X Range (with
header)
Check Labels

Excel outputs a table with


many useful regression
statistics.
Home Market Value Regression
Results
Regression Statistics
 Multiple R - | r |, where r is the sample correlation
coefficient. The value of r varies from -1 to +1 (r is
negative if slope is negative)
 R Square - coefficient of determination, R2, which

varies from 0 (no fit) to 1 (perfect fit)


 Adjusted R Square - adjusts R2 for sample size

and number of X variables


 Standard Error - variability between observed

and predicted Y values. This is formally called the


standard error of the estimate, SYX.
Example 8.6: Interpreting Regression
Statistics for Simple Linear Regression

53% of the variation in home market values


can be explained by home size.
The standard error of $7287 is less than
standard deviation (not shown) of $10,553.
Regression as Analysis of Variance

ANOVA conducts an F-test to determine whether


variation in Y is due to varying levels of X.
ANOVA is used to test for significance of regression:

H1: population slope coefficient ≠ 0


H0: population slope coefficient = 0

Excel reports the p-value (Significance F).


Rejecting H0 indicates that X explains variation in Y.
Example 8.7: Interpreting Significance of
Regression
Home size is not a significant variable
Home size is a significant variable
 p-value = 3.798 x 10-8
◦ Reject H0: The slope is not equal to zero. Using a linear
relationship, home size is a significant variable in explaining
variation in market value.
Testing Hypotheses for Regression
Coefficients
 An alternate method for testing whether a slope or
intercept is zero is to use a t-test:

 Excel provides the p-values for tests on the slope and


intercept.
Example 8.8: Interpreting Hypothesis
Tests for Regression Coefficients

 Use p-values to draw conclusion

 Neither coefficient is statistically equal to zero.


Confidence Intervals for Regression
Coefficients
 Confidence intervals (Lower 95% and Upper 95%
values in the output) provide information about the
unknown values of the true regression
coefficients, accounting for sampling error.
 We may also use confidence intervals to test
hypotheses about the regression coefficients.
◦ To test the hypotheses

check whether B1 falls within the confidence interval for the


slope. If it does, reject the null hypothesis.
Example 8.9: Interpreting Confidence
Intervals for Regression Coefficients
 For the Home Market Value data, a 95% confidence
interval for the intercept is [14,823, 50,523], and for the
slope, [24.59, 45.48].
 Although we estimated that a house with 1,750 square
feet has a market value of 32,673 + 35.036(1,750)
=$93,986, if the true population parameters are at the
extremes of the confidence intervals, the estimate might
be as low as 14,823 + 24.59(1,750) = $57,855 or as high
as 50,523 + 45.48(1,750) = $130,113.
Residual Analysis and Regression Assumptions
 Residual = Actual Y value − Predicted Y value
 Standard residual = residual / standard deviation
 Rule of thumb: Standard residuals outside of ±2 or

±3 are potential outliers.


 Excel provides a table and a plot of residuals.

This point has a standard


residual of 4.53
Checking Assumptions
 Linearity
 examine scatter diagram (should appear linear)
 examine residual plot (should appear random)
 Normality of Errors
 view a histogram of standard residuals
 regression is robust to departures from normality
 Homoscedasticity: variation about the regression line is
constant
 examine the residual plot
 Independence of Errors: successive observations should
not be related.
 This is important when the independent variable is time.
Example 8.11: Checking Regression
Assumptions for the Home Market Value Data
 Linearity - linear trend in scatterplot
- no pattern in residual plot
Example 8.11 Continued

Normality of Errors – residual histogram appears


slightly skewed but is not a serious departure
Example 8.11 Continued
 Homoscedasticity – residual plot shows no serious
difference in the spread of the data for different X
values.
Example 8.11 Continued

 Independence of Errors – Because the data is


cross-sectional, we can assume this assumption
holds.
Multiple Linear Regression
 A linear regression model with more than one
independent variable is called a multiple linear
regression model.
Role Play Exploring Linear Regression
Analysis
 Pooja: Hey Ajay, come on in! I've been diving into
the marketing data you requested. Have a seat.
 Ajay: Thanks, Pooja. I'm really curious about our

marketing campaign's effectiveness, especially


regarding our online ads. Can you shed some light
on that?
 Pooja: Absolutely! I've been running some

analyses, and it seems like linear regression could


provide some insights into how our online ad
spending correlates with website traffic or
conversions.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-40
Continue..
 Ajay: Linear regression? Can you explain that in
simpler terms?
 Pooja: Of course. Linear regression is a statistical

method used to model the relationship between


two variables, typically by fitting a straight line to
the data points. In our case, we're looking at how
changes in our online ad spending impact website
traffic or conversions.
 Ajay: Ah, got it. So, how do we go about doing

this?

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 9-41
Continue..
 Pooja: With linear regression, we'll be able to
quantify the impact of our ad spending on website
traffic or conversions. For example, we might find
that for every dollar increase in ad spending,
website traffic increases by a certain number of
visits. This information can help us optimize our
marketing budget allocation for maximum impact.

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 9-42
Estimated Multiple Regression
Equation
 We estimate the regression coefficients—called
partial regression coefficients — b0, b1, b2,… bk,
then use the model:

 The partial regression coefficients represent the


expected change in the dependent variable when
the associated independent variable is increased
by one unit while the values of all other
independent variables are held constant.
Excel Regression Tool
 The independent variables in the spreadsheet must be in
contiguous columns.
◦ So, you may have to manually move the columns of data around
before applying the tool.
 Key differences:
 Multiple R and R Square are called the multiple
correlation coefficient and the coefficient of multiple
determination, respectively, in the context of multiple
regression.
 ANOVA tests for significance of the entire model. That
is, it computes an F-statistic for testing the hypotheses:
ANOVA for Multiple Regression
 ANOVA tests for significance of the entire model. That
is, it computes an F-statistic for testing the hypotheses:

 The multiple linear regression output also provides


information to test hypotheses about each of the
individual regression coefficients.
◦ If we reject the null hypothesis that the slope associated with
independent variable i is 0, then the independent variable i is
significant and improves the ability of the model to better predict
the dependent variable. If we cannot reject H0, then that
independent variable is not significant and probably should not be
included in the model.
Example 8.12: Interpreting Regression Results
for the Colleges and Universities Data

 Predict student graduation rates using several indicators:


Example 8.12 Continued
 Regression model

 The value of R2 indicates that 53% of the variation in the dependent


variable is explained by these independent variables.
 All coefficients are statistically significant.
Model Building Issues
 A good regression model should include only
significant independent variables.
 Adding an independent variable to a regression
model will always result in R2 equal to or greater
than the R2 of the original model.
 Adjusted R2 reflects both the number of
independent variables An increase in adjusted
R2 indicates that the model has improved.
Systematic Model Building Approach

1. Construct a model with all available independent


variables. Check for significance of the independent
variables by examining the p-values.
2. Identify the independent variable having the largest p-
value that exceeds the chosen level of significance.
3. Remove the variable identified in step 2 from the
model and evaluate adjusted R2.
(Don’t remove all variables with p-values that exceed a at the
same time, but remove only one at a time.)
4. Continue until all variables are significant.
Example 8.13: Identifying the Best Regression Model

 Banking Data

Home value has the


largest p-value; drop
and re-run the
regression.
Example 8.13 Continued

 Bank regression after removing Home Value

Adjusted R2 improves slightly.

All X variables are significant.


Alternate Criterion
 Use the t-statistic.
 If | t | < 1, then the standard error will decrease

and adjusted R2 will increase if the variable is


removed. If | t | > 1, then the opposite will occur.
 You can follow the same systematic approach,

except using t-values instead of p-values.


Multicollinearity
 Multicollinearity occurs when there are
strong correlations among the independent
variables, and they can predict each other

 Correlations exceeding ±0.7 may indicate


better than the dependent variable.

multicollinearity
 The variance inflation factor is a better

indicator, but not computed in Excel.


Example 8.14: Identifying Potential Multicollinearity

 Colleges and Universities correlation matrix; none


exceed the recommend threshold of ±0.7

 Banking Data correlation matrix; large correlations exist


Example 8.14 Continued
 If we remove Wealth from the model, the adjusted R2 drops to
0.9201, but we discover that Education is no longer significant.
 Dropping Education and leaving only Age and Income in the model
results in an adjusted R2 of 0.9202.
 However, if we remove Income from the model instead of Wealth,
the Adjusted R2 drops to only 0.9345, and all remaining variables
(Age, Education, and Wealth) are significant.
Practical Issues in Trendline and
Regression Modeling
 Identifying the best regression model often requires
experimentation and trial and error.
 The independent variables selected should make
sense in attempting to explain the dependent
variable
.
 Additional variables increase R2 and, therefore,
help to explain a larger proportion of the variation.
 Good models are as simple as possible (the
principle of parsimony).
Overfitting
 Over fitting means fitting a model too closely to the
sample data at the risk of not fitting it well to the
population in which we are interested.
 In multiple regression, if we add too many terms to the
model, then the model may not adequately predict other
values from the population.
 Over fitting can be mitigated by using good logic,
intuition, theory, and parsimony.
Regression with Categorical Variables
 Regression analysis requires numerical data.
 Categorical data can be included as independent

variables, but must be coded numeric using


dummy variables.
 For variables with 2 categories, code as 0 and 1.
CaseLets: Regression Analysis in Marketing

 What is the dependent variable in this


case study?
 Identify the independent variables.
 Why is regression analysis an appropriate

method for this case study?


 What type of regression analysis would be

most suitable for this case study, and


why?

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 9-59
Example 8.15: A Model with Categorical
Variables
 Employee Salaries provides data for 35 employees

 Predict Salary using Age and MBA (code as


yes=1, no=0)
Example 8.15 Continued
 Salary = 893.59 + 1044.15 × Age + 14767.23 × MBA
◦ If MBA = 0, salary = 893.59 + 1044 × Age
◦ If MBA = 1, salary =15,660.82 + 1044 × Age
Interactions
 An interaction occurs when the effect of one
variable is dependent on another variable.
 We can test for interactions by defining a new

variable as the product of the two variables,


X3 = X1 × X2 , and testing whether this
variable is significant, leading to an
alternative model.
Example 8.16: Incorporating Interaction
Terms in a Regression Model
 Define an interaction between
Age and MBA and re-run the
regression.

The MBA indicator is not significant; drop and re-run.


Example 8.16 Continued
 Adjusted R2 increased slightly, and both age and the
interaction term are significant. The final model is
salary = 3,323.11 + 984.25 × age + 425.58 × MBA × age
Categorical Variables with More Than Two
Levels
When a categorical variable
has k > 2 levels, we need to
add k - 1 additional variables to
the model.
Example 8.17: A Regression Model with
Multiple Levels of Categorical Variables
 The Excel file Surface
Finish provides
measurements of the
surface finish of 35 parts
produced on a lathe,
along with the
revolutions per minute
(RPM) of the spindle
and one of four types of
cutting tools used.
Example 8.17 Continued
 Because we have k = 4 levels of tool type, we will
define a regression model of the form
Example 8.17 Continued

 Add 3 columns to
the data, one for
each of the tool
type variables
Example 8.17 Continued
 Regression results

Surface finish = 24.49 + 0.098 RPM - 13.31 type B - 20.49 type C -


26.04 type D
Regression Models with Nonlinear Terms
 Curvilinear models may be appropriate when
scatter charts or residual plots show nonlinear
relationships.
 A second order polynomial might be used

Here β1 represents the linear effect of X on Y and


β2 represents the curvilinear effect.

 This model is linear in the β parameters so we can


use linear regression methods.
Debate on Regression Analysis
 Proponent Argument:
 Regression analysis is a powerful statistical tool

that allows us to explore relationships between


variables and make predictions based on data. It
is widely used across various fields, including
economics, social sciences, and marketing, to
understand complex phenomena and make
informed decisions.

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 9-71
Continue
 While it's true that regression analysis has its limitations,
these challenges can often be addressed with proper
data preprocessing, model selection, and validation
techniques. For example, techniques such as robust
regression and regularization can help mitigate the
impact of outliers and overfitting, respectively. Moreover,
advancements in statistical methods, such as machine
learning algorithms, offer alternatives to traditional
regression models, allowing for more flexibility and
robustness in modeling complex relationships.

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 9-72
CONCLUSION..

while regression analysis has its limitations and


challenges, its benefits outweigh its drawbacks when
used judiciously and in the right context. It remains
a valuable tool for exploring relationships between
variables, making predictions, and informing
decision-making across various domains. As data
analytics continue to evolve, regression analysis will
continue to play a vital role in extracting actionable
insights from data and driving innovation and
growth. Copyright © 2013 Pearson Education, Inc.
9-73
publishing as Prentice Hall

You might also like