Chapter Topics
•Hypothesis Testing Methodology
•Z Test for the Mean (σ Known)
• p-Value Approach to Hypothesis Testing
•Connection to Confidence Interval Estimation
•One Tail Test
• t Test of Hypothesis for the Mean
•Z Test of Hypothesis for the Proportion
© 1999 Prentice-Hall, Inc. Chap. 8 - 1
Review of ANOVA and linear
regression
Review of simple ANOVA
ANOVA
for comparing means between
more than 2 groups
Hypotheses of One-Way
ANOVA
H0 : μ1 = μ2 = μ3 = = μc
All population means are equal
i.e., no treatment effect (no variation in means among
groups)
H1 : Not all of the population means are the same
At least one population mean is different
i.e., there is a treatment effect
Does not mean that all population means are different
(some pairs may be the same)
The F-distribution
A ratio of variances follows an F-distribution:
2
σ between
2
~ Fn ,m
σ within
The F-test tests the hypothesis that two variances
are equal.
F will be close to 1 if sample variances are equal.
2 2
H 0 : σ between = σ within
2 2
H a : σ between ≠ σ within
How to calculate ANOVA’s by
hand…
Treatment 1 Treatment 2 Treatment 3 Treatment 4
y11 y21 y31 y41
y12 y22 y32 y42 n=10 obs./group
y13 y23 y33 y43
y14 y24 y34 y44 k=4 groups
y15 y25 y35 y45
y16 y26 y36 y46
y17 y27 y37 y47
y18 y28 y38 y48
y19 y29 y39 y49
y110 y210 y310 y410
10
∑
10 10 10
y1 j ∑y
j =1
∑y 2j ∑y 3j
j =1
4j The group means
y1• = y 2• =
j =1
y 3• =
j =1 y 4• =
10 10 10 10
10 10 10
∑
10
∑(y 1j − y1• ) 2
j =1
( y 2 j − y 2• ) 2 ∑(y
j =1
3j − y 3• ) 2
∑(y
j =1
4j − y 4• ) 2
The (within)
j =1
10 − 1 10 − 1 10 − 1 10 − 1 group
variances
Sum of Squares Within (SSW),
or Sum of Squares Error (SSE)
10 10 10
∑(y
10
∑(y
2
∑(y 1j − y1• ) 2
j =1
2j − y 2• ) ∑(y j =1
3j − y 3• ) 2
j =1
4j − y 4• ) 2
The (within)
j =1
group variances
10 − 1 10 − 1 10 − 1 10 − 1
10 10
10 10
∑(y 1j − y1• ) +
2
∑ ( y 2 j − y 2• ) 2 + ∑ ( y 3 j − y 3• ) + 2
∑(y
j =1
4j − y 4• ) 2
j =1 j =3
j =1
4 10
= ∑∑ i =1 j =1
( y ij − y i • ) 2 Sum of Squares Within (SSW)
(or SSE, for chance error)
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)
4 10
Overall mean
of all 40
observations
∑∑ y
i =1 j =1
ij
(“grand y •• =
mean”) 40
4 Sum of Squares Between
10 x ∑(y − y •• ) 2 (SSB). Variability of the
group means compared
i• to the grand mean (the
i =1 variability due to the
treatment).
Total Sum of Squares (SST)
Total sum of
4 10 squares(TSS).
∑∑
i =1 j =1
( y ij − y •• ) 2 Squared difference of
every observation from
the overall mean.
(numerator of variance of
Y!)
Partitioning of Variance
4 10 4 4 10
∑∑ ( y
i =1 j =1
ij − y i• ) 2
∑
+ 10x ( y i • − y •• ) 2
= ∑∑ ( y ij − y •• ) 2
i =1 i =1 j =1
SSW + SSB = TSS
ANOVA Table
Mean Sum
Source of Sum of of Squares
variation d.f. squares F-statistic p-value
Between k-1 SSB SSB/k-1 Go to
SSB
(sum of squared k −1
(k groups) SSW Fk-1,nk-k
deviations of nk − k chart
group means from
grand mean)
Within nk-k SSW s2=SSW/nk-k
(sum of squared
(n individuals per
deviations of
group)
observations from
their group mean)
Total nk-1 TSS
variation (sum of squared deviations of
observations from grand mean) TSS=SSB + SSW
Example
Treatment 1 Treatment 2 Treatment 3 Treatment 4
60 inches 50 48 47
67 52 49 67
42 43 50 54
67 67 55 67
56 67 56 68
62 59 61 65
64 67 61 65
59 64 60 56
72 63 59 60
71 65 64 65
Example
Step 1) calculate the sum
Treatment 1 Treatment 2 Treatment 3 Treatment 4
of squares between groups:
60 inches 50 48 47
67 52 49 67
42 43 50 54
Mean for group 1 = 62.0 67 67 55 67
56 67 56 68
Mean for group 2 = 59.7
62 59 61 65
Mean for group 3 = 56.3 64 67 61 65
59 64 60 56
Mean for group 4 = 61.4 72 63 59 60
71 65 64 65
Grand mean= 59.85
SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per
group= 19.65x10 = 196.5
Example
Step 2) calculate the sum
Treatment 1 Treatment 2 Treatment 3 Treatment 4
of squares within groups:
60 inches 50 48 47
67 52 49 67
42 43 50 54
(60-62) 2+(67-62) 2+ (42-62) 67 67 55 67
2+ (67-62) 2+ (56-62) 2+ (62-
56 67 56 68
62) 2+ (64-62) 2+ (59-62) 2+ 62 59 61 65
(72-62) 2+ (71-62) 2+ (50- 64 67 61 65
59.7) 2+ (52-59.7) 2+ (43- 59 64 60 56
59.7) 2+67-59.7) 2+ (67- 72 63 59 60
71 65 64 65
59.7) 2+ (69-59.7)
2…+….(sum of 40 squared
deviations) = 2060.6
Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value
Squares
Between 3 196.5 65.5 1.14 .344
Within 36 2060.6 57.2
Total 39 2257.1
Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value
Squares
Between 3 196.5 65.5 1.14 .344
Within 36 2060.6 57.2
Total 39 2257.1
INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
Coefficient of Determination
2 SSB SSB
R = =
SSB + SSE SST
The amount of variation in the outcome variable (dependent
variable) that is explained by the predictor (independent variable).
ANOVA example
Table 6. Mean micronutrient intake from the school lunch by school
S1a, n=25 S2b, n=25 S3c, n=25 P-valued
Calcium (mg) Mean 117.8 158.7 206.5 0.000
SDe 62.4 70.5 86.2
Iron (mg) Mean 2.0 2.0 2.0 0.854
SD 0.6 0.6 0.6
Folate (μg) Mean 26.6 38.7 42.6 0.000
SD 13.1 14.5 15.1
Mean 1.9 1.5 1.3 0.055
Zinc (mg)
SD 1.0 1.2 0.4
a School 1 (most deprived; 40% subsidized lunches). FROM: Gould R, Russell J,
Barker ME. School lunch
b School 2 (medium deprived; <10% subsidized). menus and 11 to 12 year old
c School 3 (least deprived; no subsidization, private school). children's food choice in three
secondary schools in England-
d ANOVA; significant differences are highlighted in bold (P<0.05). are the nutritional standards
being met? Appetite. 2006
Jan;46(1):86-92.
Answer
Step 1) calculate the sum of squares between groups:
Mean for School 1 = 117.8
Mean for School 2 = 158.7
Mean for School 3 = 206.5
Grand mean: 161
SSB = [(117.8-161)2 + (158.7-161)2 + (206.5-161)2] x25 per
group= 98,113
Answer
Step 2) calculate the sum of squares within groups:
S.D. for S1 = 62.4
S.D. for S2 = 70.5
S.D. for S3 = 86.2
Therefore, sum of squares within is:
(24)[ 62.42 + 70.5 2+ 86.22]=391,066
Answer
Step 3) Fill in your ANOVA table
Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value
Squares
Between 2 98,113 49056 9 <.05
Within 72 391,066 5431
Total 74 489,179
**R2=98113/489179=20%
School explains 20% of the variance in lunchtime calcium
intake in these kids.
Beyond one-way ANOVA
Often, you may want to test more than 1
treatment. ANOVA can accommodate
more than 1 treatment or factor, so long
as they are independent. Again, the
variation partitions beautifully!
TSS = SSB1 + SSB2 + SSW
Linear regression review
What is “Linear”?
Remember this:
Y=mX+B?
B
What’s Slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.
Regression equation…
Expected value of y at a given level of x=
E ( yi / xi ) = α + βxi
Predicted value for an
individual…
yi= α + β*xi + random errori
Fixed – Follows a normal
exactly distribution
on the
line
Assumptions (or the fine print)
Linear regression assumes that…
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the
same (homogeneity of variances)
4. The observations are independent**
**When we talk about repeated measures
starting next week, we will violate this
assumption and hence need more
sophisticated regression models!
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Regression Picture
ŷi = βxi + α
yi
C A
B
y A
B y
C
yi
*Least squares estimation
x gave us the line (β) that
n n n minimized C2
∑(y
i =1
i − y) 2
= ∑ ( yˆ
i =1
i − y) 2
+ ∑ ( yˆ
i =1
i − yi ) 2
R2=SSreg/SStotal
A2 B2 C2
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to naïve mean of y Variance around the regression line
observations from naïve mean Variability due to x (regression) Additional variability not explained
of y by x—what least squares method aims
Total variation to minimize
Recall example: cognitive
function and vitamin D
Hypothetical data loosely based on [1];
cross-sectional study of 100 middle-
aged and older European men.
Cognitive function is measured by the Digit
Symbol Substitution Test (DSST).
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
Distribution of vitamin D
Mean= 63 nmol/L
Standard deviation = 33 nmol/L
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
Four hypothetical datasets
I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):
0
0.5 points per 10 nmol/L
1.0 points per 10 nmol/L
1.5 points per 10 nmol/L
Dataset 1: no relationship
Dataset 2: weak relationship
Dataset 3: weak to moderate
relationship
Dataset 4: moderate
relationship
The “Best fit” line
Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
The “Best fit” line
Note how the line is
a little deceptive; it
draws your eye,
making the
relationship appear
stronger than it
really is!
Regression
equation:
E(Yi) = 26 + 0.5*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)
Note: all the lines go
through the point
(63, 28)!
Significance testing…
Slope
Distribution of slope ~ Tn-2(β,s.e.( βˆ ))
H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)
Tn-2=
βˆ − 0
s.e.( βˆ )
Example: dataset 4
Standard error (beta) = 0.03
T98 = 0.15/0.03 = 5, p<.0001
95% Confidence interval = 0.09 to 0.21
Multiple linear regression…
What if age is a confounder here?
Older men have lower vitamin D
Older men have poorer cognition
“Adjust” for age by putting age in the
model:
DSST score = intercept + slope1xvitamin D
+ slope2 xage
2 predictors: age and vit D…
Different 3D view…
Fit a plane rather than a line…
On the plane, the
slope for vitamin
D is the same at
every age; thus,
the slope for
vitamin D
represents the
effect of vitamin
D when age is
held constant.
Equation of the “Best fit”
plane…
DSST score = 53 + 0.0039xvitamin D
(in 10 nmol/L) - 0.46 xage (in years)
P-value for vitamin D >>.05
P-value for age <.0001
Thus, relationship with vitamin D was
due to confounding by age!
Multiple Linear Regression
More than one predictor…
E(y)= α + β1*X + β2 *W + β3 *Z…
Each regression coefficient is the amount of
change in the outcome variable that would be
expected per one-unit change of the
predictor, if all other variables in the model
were held constant.
Functions of multivariate
analysis:
Control for confounders
Test for interactions between predictors
(effect modification)
Improve predictions
ANOVA is linear regression!
Divide vitamin D into three groups:
Deficient (<25 nmol/L)
Insufficient (>=25 and <50 nmol/L)
Sufficient (>=50 nmol/L), reference group
DSST= α (=value for sufficient) + βinsufficient*(1
if insufficient) + β2 *(1 if deficient)
This is called “dummy coding”—where multiple
binary variables are created to represent
being in each category (or not) of a
categorical variable
The picture…
Sufficient vs.
Insufficient
Sufficient vs.
Deficient
Results…
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 40.07407 1.47817 27.11 <.0001
deficient 1 -9.87407 3.73950 -2.64 0.0096
insufficient 1 -6.87963 2.33719 -2.94 0.0041
Interpretation:
The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.
What is a Hypothesis?
A hypothesis is an I assume the mean GPA
assumption about the of this class is 3.5!
population parameter.
A parameter is a
Population mean or
proportion
The parameter must be
identified before
analysis.
© 1984-1994 T/Maker Co.
© 1999 Prentice-Hall, Inc. Chap. 8 - 2
The Null Hypothesis, H0
• States the Assumption (numerical) to be tested
e.g. The average # TV sets in US homes is at
least 3 (H0: µ ≥ 3)
• Begin with the assumption that the null
hypothesis is TRUE.
(Similar to the notion of innocent until proven guilty)
•Refers to the Status Quo
•Always contains the ‘ = ‘ sign
•The Null Hypothesis may or may not be rejected.
© 1999 Prentice-Hall, Inc. Chap. 8 - 3
The Alternative Hypothesis, H1
• Is the opposite of the null hypothesis
e.g. The average # TV sets in US homes is
less than 3 (H1: µ < 3)
• Challenges the Status Quo
• Never contains the ‘=‘ sign
• The Alternative Hypothesis may or may
not be accepted
© 1999 Prentice-Hall, Inc. Chap. 8 - 4
Identify the Problem
Steps:
State the Null Hypothesis (H0: µ ≥ 3)
State its opposite, the Alternative
Hypothesis (H1: µ < 3)
Hypotheses are mutually exclusive &
exhaustive
Sometimes it is easier to form the
alternative hypothesis first.
© 1999 Prentice-Hall, Inc. Chap. 8 - 5
Hypothesis Testing Process
Assume the
population
mean age is 50.
(Null Hypothesis) Population
The Sample
Is X = 20 ≅ µ = 50? Mean Is 20
No, not likely!
REJECT
Null Hypothesis Sample
© 1999 Prentice-Hall, Inc. Chap. 8 - 6
Reason for Rejecting H0
Sampling Distribution
It is unlikely
that we would ... Therefore, we
get a sample reject the null
mean of this hypothesis that
value ... µ = 50.
... if in fact this were
the population mean.
20 µ = 50 Sample Mean
H0
© 1999 Prentice-Hall, Inc. Chap. 8 - 7
Level of Significance, α
• Defines Unlikely Values of Sample
Statistic if Null Hypothesis Is True
Called Rejection Region of Sampling
Distribution
• Designated α (alpha)
Typical values are 0.01, 0.05, 0.10
• Selected by the Researcher at the Start
• Provides the Critical Value(s) of the Test
© 1999 Prentice-Hall, Inc. Chap. 8 - 8
Level of Significance, α and
the Rejection Region
H0: µ ≥ 3 α Critical
H1: µ < 3 Value(s)
Rejection 0
Regions α
H0: µ ≤ 3
H1: µ > 3
0
α/2
H0: µ = 3
H1: µ ≠ 3
0
© 1999 Prentice-Hall, Inc. Chap. 8 - 9
Errors in Making Decisions
• Type I Error
Reject True Null Hypothesis
Has Serious Consequences
Probability of Type I Error Is α
Called Level of Significance
• Type II Error
Do Not Reject False Null Hypothesis
Probability of Type II Error Is β (Beta)
© 1999 Prentice-Hall, Inc. Chap. 8 - 10
Result Possibilities
H0: Innocent
Jury Trial Hypothesis Test
Actual Situation Actual Situation
Verdict Innocent Guilty Decision H0 True H0 False
Do Not Type II
Innocent Correct Error Reject 1-α
Error (β )
H0
Type I Power
Guilty Error Correct Reject Error
H0 (1 - β)
(α )
© 1999 Prentice-Hall, Inc. Chap. 8 - 11
α & β Have an
Inverse Relationship
Reduce probability of one error
and the other one goes up.
© 1999 Prentice-Hall, Inc. Chap. 8 - 12
Z-Test Statistics (σ Known)
• Convert Sample Statistic (e.g., X ) to
Standardized Z Variable
X − µX X −µ Test Statistic
Z= =
σX σ
n
• Compare to Critical Z Value(s)
If Z test Statistic falls in Critical Region,
Reject H0; Otherwise Do Not Reject H0
© 1999 Prentice-Hall, Inc. Chap. 8 - 13
p Value Test
• Probability of Obtaining a Test Statistic
More Extreme (≤ or ≥) than Actual
Sample Value Given H0 Is True
• Called Observed Level of Significance
Smallest Value of a H0 Can Be Rejected
• Used to Make Rejection Decision
If p value ≥ α, Do Not Reject H0
If p value < α, Reject H0
© 1999 Prentice-Hall, Inc. Chap. 8 - 14
Hypothesis Testing: Steps
Test the Assumption that the true mean #
of TV sets in US homes is at least 3.
1. State H0 H0 : µ ≥ 3
2. State H1 H1 : µ < 3
3. Choose α α = .05
4. Choose n n = 100
5. Choose Test: Z Test (or p Value)
© 1999 Prentice-Hall, Inc. Chap. 8 - 15
Hypothesis Testing: Steps
(continued)
Test the Assumption that the average # of
TV sets in US homes is at least 3.
6. Set Up Critical Value(s) Z = -1.645
7. Collect Data 100 households surveyed
8. Compute Test Statistic Computed Test Stat.= -2
9. Make Statistical Decision Reject Null Hypothesis
10. Express Decision The true mean # of TV set
is less than 3 in the US
households.
© 1999 Prentice-Hall, Inc. Chap. 8 - 16
One-Tail Z Test for Mean
(σ Known)
• Assumptions
Population Is Normally Distributed
If Not Normal, use large samples
Null Hypothesis Has ≤ or ≥ Sign Only
• Z Test Statistic:
x − µx x−µ
z= =
σx σ
n
© 1999 Prentice-Hall, Inc. Chap. 8 - 17
Rejection Region
H0: µ ≥ 0 H0: µ ≤ 0
H1: µ < 0 H1: µ > 0
Reject H0 Reject H 0
α α
0 Z 0 Z
Must Be Significantly Small values don’t contradict H0
Below µ = 0 Don’t Reject H0!
© 1999 Prentice-Hall, Inc. Chap. 8 - 18
Example: One Tail Test
Does an average box of
cereal contain more than
368 grams of cereal? A
random sample
_ of 25 boxes
showed X = 372.5. The 368 gm.
company has specified σ to
be 15 grams. Test at the H0: µ ≤ 368
α=0.05 level. H1: µ > 368
© 1999 Prentice-Hall, Inc. Chap. 8 - 19
Finding Critical Values:
One Tail
Standardized Normal
What Is Z Given α = 0.05? Probability Table (Portion)
.50
-.05 σZ = 1 Z .04 .05 .06
.45
1.6 .5495 .5505 .5515
α = .05
1.7 .5591 .5599 .5608
0 1.645 Z 1.8 .5671 .5678 .5686
Critical Value 1.9 .5738 .5744 .5750
= 1.645
© 1999 Prentice-Hall, Inc. Chap. 8 - 20
Example Solution: One Tail
H0: µ ≤ 368 Test Statistic:
H1: µ > 368
X −µ
α = 0.025 Z= = 1.50
n = 25
σ
n
Critical Value: 1.645
Decision:
Reject
Do Not Reject at α = .05
.05
Conclusion:
No Evidence True Mean
0 1.645 Z Is More than 368
© 1999 Prentice-Hall, Inc. Chap. 8 - 21
p Value Solution
p Value is P(Z ≥ 1.50) = 0.0668
Use the
alternative p Value 1.0000
hypothesis .0668
to find the - .9332
direction of .0668
the test. .9332
0 1.50 Z
From Z Table: Z Value of Sample
Lookup 1.50 Statistic
© 1999 Prentice-Hall, Inc. Chap. 8 - 22
p Value Solution
(p Value = 0.0668) ≥ (α = 0.05).
Do Not Reject.
p Value = 0.0668
Reject
α = 0.05
0 1.50 Z
Test Statistic Is In the Do Not Reject Region
© 1999 Prentice-Hall, Inc. Chap. 8 - 23
Example: Two Tail Test
Does an average box of
cereal contains 368 grams of
cereal? A random sample of
25 boxes showed X = 372.5.
The company has specified 368 gm.
σ to be 15 grams. Test at the
α=0.05 level.
H0: µ = 368
H1: µ ≠ 368
© 1999 Prentice-Hall, Inc. Chap. 8 - 24
Example Solution: Two Tail
H0: µ = 386 Test Statistic:
H1: µ ≠ 386
X − µ 372.5 − 368
α = 0.05 Z= = = 1.50
σ 15
n = 25 n 25
Critical Value: ±1.96
Decision:
Reject
Do Not Reject at α = .05
.025 .025
Conclusion:
No Evidence that True
-1.96 0 1.96 Z Mean Is Not 368
© 1999 Prentice-Hall, Inc. Chap. 8 - 25
Connection to
Confidence Intervals
_
For X = 372.5oz, σ = 15 and n = 25,
The 95% Confidence Interval is:
372.5 - (1.96) 15/ 25 to 372.5 + (1.96) 15/ 25
or
366.62 ≤ µ ≤ 378.38
If this interval contains the Hypothesized mean
(368), we do not reject the null hypothesis.
It does. Do not reject.
© 1999 Prentice-Hall, Inc. Chap. 8 - 26
t-Test: σ Unknown
Assumptions
Population is normally distributed
If not normal, only slightly skewed & a large
sample taken
Parametric test procedure
t test statistic X −µ
t=
S
n
© 1999 Prentice-Hall, Inc. Chap. 8 - 27
Example: One Tail t-Test
Does an average box of cereal
contain more than 368 grams
of cereal? A random sample of
36 boxes showed X = 372.5,
and σ = 15. Test at the α=0.01
368 gm.
level.
σ is not given, H0: µ ≤ 368
H1: µ > 368
© 1999 Prentice-Hall, Inc. Chap. 8 - 28
Example:Z Test for Proportion
•Problem: A marketing company claims
that it receives 4% responses from its
Mailing.
•Approach: To test this claim, a random
sample of 500 were surveyed with 25
responses.
•Solution: Test at the α = .05 significance
level.
© 1999 Prentice-Hall, Inc. Chap. 8 - 29
Z Test for Proportion:
Solution
H0: p = .04
Test Statistic:
H1: p ≠ .04
p - ps .04 -.05
α = .05 Z ≅ = = 1.14
p (1 - p) .04 (1 - .04)
n = 500 n 500
Critical Values: ± 1.96
Decision:
Reject Reject Do not reject at α = .05
.025 .025 Conclusion:
We do not have sufficient
evidence to reject the company’s
0 Z claim of 4% response rate.
© 1999 Prentice-Hall, Inc. Chap. 8 - 30
Machine Learning
(UCSE603)
Module –I:
Introduction
DR. RANJAN MAITY
CENTRAL INSTITUTE OF TECHNOLOGY
KOKRAJHAR
Syllabus
Module 1: Introduction to Machine learning [3L] Brief Introduction to Machine Learning, its advantage and disadvantages, Supervised Learning
Unsupervised Learning Reinforcement Learning.
Module 2: Bayesian Learning [5L] Probability Basics, Bayes Theorem, Naive Bayes Classifier, Gaussian Naive Bayes Classifier, Bayesian Networks.
Module 3: Artificial Neural Network [5L] Perceptron Learning Algorithm: Delta Rule and Gradient Descent. Multi-layer Perceptron Learning:
Backpropagation and Stochastic Gradient Descent. Hypotheses Space, Inductive Bias and Convergence. Variants Structures of Neural Network.
Module 4: Support Vector Machines [4L] Decision Boundary and Support Vector: Optimization and Primal-Dual Problem. Extension to SVM: Soft Margin
and Non-linear Decision Boundary. Kernel Functions and Radial Basis Functions (detailed later).
Module 5: Linear Models and Regression [3L] Linear Regression. Linear Classification. Logistic Regression. Non-linear Transformation.
Module 6: Decision Tree Learning [4L] Decision Tree Representation and Learning Algorithm (ID3). Attribute Selection using Entropy Measures and
Gains. Hypotheses Space and Inductive bias.
Module 7: Clustering [5L] Partitional Clustering and Hierarchical Clustering. Cluster Types, Attributes and Salient Features. k-Means, k-Nearest
Neighbour (kNN) Classifier. Hierarchical and Density-based Clustering Algorithms. Inter and Intra Clustering Similarity, Cohesion and Separation.
Module 8: Some Other Learning Concept [1L] Active Learning, Deep Learning, Transfer Learning.
Books
TEXT BOOKS:
1. Christopher Bishop. Pattern Recognition and Machine Learning. 2e.
2. The Elements of Statistical Learning Data Mining, Inference, and Prediction by T. Hastie, R. Tibshirani, J. Friedman
REFERENCES:
1. Machine Learning, Tom Mitchell, McGraw Hill, 1997
2. Introduction to Machine Learning, Third edition, Ethem Alpaydin, The MIT Press, September 2014.
3. Understanding Machine Learning: From Theory to Algorithms, First Edition by Shwartz, Shai Ben-David, Cambridge University
Press, 2014.
4. Learning From Data, First Edition by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, AML Book, 2012.
Evaluation
Continuous (40):
1. Attendance : Min 75%
2. Class Test I (Open book): 5
3. Mid term : 15
4. Class Test II (Open Book): 5
5. Term Project : 15
End Term (60):
Prerequisites
1. Number System
2. Linear Algebra
3. Probability Theory
Quick Recap : Number System
Source: https://en.wikipedia.org/wiki/Natural_number
Cont.…
1. Complex Number (ℂ): C = A + iB [a real, b imaginary and i 2= -1], C’ = A – iB (complex conjugate)
2. Real Number (ℝ):
3. Rational Number (ℚ) : any number x that can be represented as a/b, where a and b are both integers. 1.5 = 3/2, Try for √2, √3 ???
4. Integers (ℤ): Whole number without fraction, can be +ve as well as –ve. Example - 3, 59, -113
5. Natural Numbers (ℕ): +ve integers, includes 0 according to ISO 80000-2,
Recap on Linear Algebra
▪Linear Algebra : Vector + Linear Functions
▪Vector: means carrier: List of numbers
▪Things we can do addition and multiplication
▪ Example: Numbers, polynomials, power series,
Vector in Space – n vectors
▪ n no of vectors with an large number of components
▪ an => the nth component of the vector not a multiplies n times
▪ Set of all n vectors =>
▪ Zero vector : all components are zero
Transpose of Vector
• Converts column to row or row to column
Vector Conversion
• Conversion two vectors x and y with size m and n respectively
•Convert x to y???
•Example :
•Got y
Linear Function
▪Let X, Y are two vectors, c is a number, F is a linear function
▪F(X) and F(Y) are also vector
▪Additivity
▪ F(X+Y)= F(X)+F(Y)
▪Homogeneity
▪ F(cX) = c F(X)
▪Linear: If the function F obeys Additivity and Homogeneity then we call F as Linear
▪Consider F(x)= x2, then F(x+x)= F(x) +F(x) test with a value…… Is it linear??
Matrix
▪ Matrix : 2D – info related to a linear function :
▪ gray image of Kalu = a matrix, where each element
Tensor.…
• n dimensional vector space
•Generally more than 3…
•Consider a colour image – each pixel -> 24 bits
• Red Component (8 bits)
• Green Component (8 bits)
• Blue Component (8 bits)
Cont.…
Cont.…
• Colour Image of KALU :
Dot Products of Vector
▪ Two vectors
▪Dot Product
Probability
• Def: How likely an event will occur
• Toss a Coin : Space ={H, T},
•Roll a Dice : Space={1,2,3,4,5,6}
Cont…
▪Problem: Flip three fair coins one by one, what will be the probability of at least two heads ??
▪ Space S= {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH} ,
▪ Can map with digital logic ??
▪ Total Space = 8
▪ At least two heads M = { THH, HTH, HHT, HHH}
▪ Total Space =4
▪ Probability of at least two head =4/8 =1/2
▪Problem: Consider 2 fair coins and 2 dices are rolled independently. What will be the probability of {HH66}?
▪ Space ={2*2*6*6} =144
▪ Probability =1/144
▪ Independently, P(H)=1/2. P(6)=1/6
▪ P{HH66}= P(H)×P(H)×P(6) ×P(6) --------Independent event
Conditional Probability
▪ two events A and B, where occurrence of A depends on the occurrence of B, P(B)>0 then conditional probability
▪ Both A and B will occur
▪Example: Consider the case of flipping three coins : S= {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH}
▪What is the probability of first coin to be head when two of the three are head???
▪ P(A)= first coin head , P(B)= two coins are head
▪ P(A) = {HTT, HTH, HHT, HHH} =4/8 =1/2
▪ P(B) = {THH, HHT, HHH} =3/8
▪ P(A∩B) = {HHT, HHH} =2/8
▪ P(A|B)= P(A∩B)/P(B)=2/8/3/8=2/3
Law of Total Probability, conditioned
▪Consider a sample space S,
▪ S is partitioned in A1, A2,…..An sample spaces.
▪ If B is an event then
▪Prob: Consider a train has 40% AC coach, while the rest are non AC. Suppose 50% of AC coaches have lower berth, while
lower berths in non AC rakes are only 30%. A berth is picked up randomly, What is the probability that it will be lower??
▪ A1= Lower berth in AC, A2= Lower berth in non AC
▪ P(A1) = 0.4, P(A2) = 0.6
Bayes Theorem
▪Two events A, B then
▪Example: Consider two gift boxes G1= {5 Red pens, 10 Green pens}, G2 = {4 Red Pens, 6 Green Pens}, What is the
probability that a green pen is selected from G1?
▪P (A) = probability of G1 =1/2
▪P(B) = prob. of green ball =
▪P(A|B)=???
Dr. Ranjan Maity
Assistant Professor
Central Institute of Technology Kokrajhar
r.maity@cit.ac.in
Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Genesis
Alan Turing: I believe that in about fifty
years’ time it will be possible to programme
computers, with a storage capacity of about 109,
to make them play the imitation game so well
that an average interrogator will not have more
than 70 percent chance of making the right
identification after five minutes of questioning.
… I believe that at the end of the century the use
of words and general educated opinion will have
altered so much that one will be able to speak of
machines thinking without expecting to be
contradicted.
Machine Learning, UCSE603 https://plato.stanford.edu/entries/turing-test/
Who is
Turing Test Machine
?
Machine Interrogator Human
Machine Learning, UCSE603 Dr. Ranjan Maity
Cont.…
1952: Arthur Samuel - Game of Checkers
1957: Frank Rosenblatt – neural network – simulates human brain
1967: Cover and Hart – nearest neighbor – pattern recognition
1997: Deep Blue – beats Garry Kasparov
2006: Hinton – Deep learning
Used to identify text and objects in images/videos
2011: IBM Watson wins Jeopardy, an American Quiz Show
2014: DeepFace – use to identify human
Machine Learning, UCSE603 Dr. Ranjan Maity
Cont…
Auto Pilot – Tesla
ADAS (Advance Driver
Assistance System)
Machine Learning, UCSE603 Dr. Ranjan Maity
More Applications of M/L
5/6/2022 Add a footer
Dr. Ranjan Maity
Assistant Professor
Central Institute of Technology Kokrajhar
r.maity@cit.ac.in
Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Types
Supervised
Unsupervised
Semi-supervised
Reinforcement
Machine Learning, UCSE603 https://plato.stanford.edu/entries/turing-test/
Supervised Learning
Consider the learnings
of alphabets – From
where you have
learned ??
Training Data with
Outputs
Learning Japanese
alphabets
https://www.busuu.com/en/japanese/alphabet
Machine Learning, UCSE603, Dr. Ranjan Maity
Cont…
Consider the images –
Apple (R, G) – class 1
Lemon & Orange –class 2
Watermelon – class 3
Banana – class 4
Model will be developed
based on the data
Test it with test set
https://www.diegocalvo.es/en/supervised-learning/
Machine Learning, UCSE603, Dr. Ranjan Maity
Mathematical Formulation
Two types –
Classification
Given a set of labelled (x1,y1), (x2,y2), (x3, y3)……(xn,yn)
Learning a function f(x) , which will predict y (in terms of class) for a given x
Regression
Given a set of labelled (x1,y1), (x2,y2), (x3, y3)……(xn,yn)
Learning a function f(x) , which will predict y (in terms of values) for a given x
Machine Learning, UCSE603 Dr. Ranjan Maity
Classification
Cat
25
f(x)
20
15
Weight
Tiger 10
0
0 1 2 3 4 5 6 7
height
Machine Learning, UCSE603 Dr. Ranjan Maity
Regression
Stock market
Machine Learning, UCSE603 Dr. Ranjan Maity
Unsupervised Learning
No labelled data
Given x1, x2,
x3,…….,xn
O/P – Structures of xi
Clustering
Machine Learning, UCSE603 Dr. Ranjan Maity
Reinforcement Learning
Def: learning the optimal
behavior in an environment
to obtain maximum reward
(https://www.synopsys.com/ai/what-is-reinforcement-learning.html)
Agent
Environment
Policy
Reward
Machine Learning, UCSE603 Dr. Ranjan Maity
Dr. Ranjan Maity
Assistant Professor
Central Institute of Technology Kokrajhar
r.maity@cit.ac.in
Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Norms of Vector
Length/magnitude of a vector
Function f :u → R
1. f is a function, u vector, R real space
2. f ( x) ≥ 0, ∀x ∈ u
3. Triangle inequality f ( x + y ) ≤ f ( x) + f ( y )∀x, y ∈ u
(α x)
4. positive homegenity f= α f ( x), ∀x ∈ u, ∀α ∈ C
5. f(x) = 0 iff x=0
Machine Learning, UCSE603 /
Cont.… n
1st Norm v 1 = v1 + v2 + ..... + vn = ∑ vi
i =1
n
2nd Norm/Euclidean Norm v 2 = v + v ...... + v = ∑ vi2
2
1
2
2
2
n
i =1
Max Norm (α Norm) v 1 = max( v1 , v2 ,....., vn )
n n
Frobenius Norm (For Matrix) M F = ∑∑ M ij2
=i 1 =j 1
Problems
Machine Learning, UCSE603 Dr. Ranjan Maity
Linear Combination
Vector v – linear combination of vectors v1, v2,……..vn if
v = α1v1 + α 2 v2 + ....... + α n vn
−4 5 −8 5 −3
v = 2v2 + v1 = + = + =
4 −2 8 −2 6
Machine Learning, UCSE603 Dr. Ranjan Maity
Span and Linear Independence
Span: vector v, apply any linear combination
Linear Independence: Consider a set of vectors v1, v2,………. vn, if no vi
can be represented by the linear combination of others. Check
0 1
v1 =
= , v2
5 3
Machine Learning, UCSE603 Dr. Ranjan Maity
Trace of a Matrix
Sum of all diagonals 1 3 4
v 1 = 2 7 9 , T (v 1 ) =1 + 7 + 8 =16
Properties: 5 6 8
1. T (u + v)= T (u ) + T (v)
2. T (uv) = T (vu )
3. T (u ) = T (u )
T
1 2 6 7 9
Compute Trace of v 1= ?
3 4 5 8 10
Machine Learning, UCSE603 Dr. Ranjan Maity
Determinant with Volume
a b c
Determinant – M = d e f , M = a × (ei − hf ) − b × (di − fg ) + c × (dh − eg )
g h i
Formula : M
=
n
∑ (−1) i+ j
M ij M ij
j =1 3,4
Consider two vectors and compute the area
uv
3,-4
Machine Learning, UCSE603 Dr. Ranjan Maity
Naïve Bayes Classifier
Eamonn Keogh
UCR
This is a high level overview only. For details, see: Thomas Bayes
Pattern Recognition and Machine Learning, Christopher Bishop, Springer-Verlag, 2006.
Or
Pattern Classification by R. O. Duda, P. E. Hart, D. Stork, Wiley and Sons. 1702 - 1761
We will start off with a visual intuition, before looking at the math…
Grasshoppers Katydids
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Remember this example?
Let’s get lots more data…
With a lot of data, we can build a histogram. Let us
just build one for “Antenna Length” for now…
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
We can leave the
histograms as they are,
or we can summarize
them with two normal
distributions.
Let us us two normal
distributions for ease
of visualization in the
following slides…
• We want to classify an insect we have found. Its antennae are 3 units long.
How can we classify it?
• We can just ask ourselves, give the distributions of antennae lengths we have
seen, is it more probable that our insect is a Grasshopper or a Katydid.
• There is a formal way to discuss the most probable classification…
p(cj | d) = probability of class cj, given that we have observed d
Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d
P(Grasshopper | 3 ) = 10 / (10 + 2) = 0.833
P(Katydid | 3 ) = 2 / (10 + 2) = 0.166
10
Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d
P(Grasshopper | 7 ) = 3 / (3 + 9) = 0.250
P(Katydid | 7 ) = 9 / (3 + 9) = 0.750
9
3
Antennae length is 7
p(cj | d) = probability of class cj, given that we have observed d
P(Grasshopper | 5 ) = 6 / (6 + 6) = 0.500
P(Katydid | 5 ) = 6 / (6 + 6) = 0.500
66
Antennae length is 5
Bayes Classifiers
That was a visual intuition for a simple case of the Bayes classifier,
also called:
• Idiot Bayes
• Naïve Bayes
• Simple Bayes
We are about to see some of the mathematical formalisms, and
more examples, but keep in mind the basic idea.
Find out the probability of the previously unseen instance
belonging to each class, then simply pick the most probable class.
Bayes Classifiers
• Bayesian classifiers use Bayes theorem, which says
p(cj | d ) = p(d | cj ) p(cj)
p(d)
• p(cj | d) = probability of instance d being in class cj,
This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature d
with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
Assume that we have two classes (Note: “Drew
c1 = male, and c2 = female. can be a male
or female
name”)
We have a person whose sex we do not
know, say “drew” or d. Drew Barrymore
Classifying drew as male or female is
equivalent to asking is it more probable
that drew is male or female, I.e which is
greater p(male | drew) or p(female | drew)
Drew Carey
What is the probability of being called
“drew” given that you are a male?
What is the probability
of being a male?
p(male | drew) = p(drew | male ) p(male)
What is the probability of
p(drew)
being named “drew”?
(actually irrelevant, since it is
that same for all classes)
This is Officer Drew (who arrested me in
1997). Is Officer Drew a Male or Female?
Luckily, we have a small
database with names and sex.
We can use it to apply Bayes Name Sex
rule…
Drew Male
Officer Drew Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Nina Female
Sergio Male
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Officer Drew Nina Female
Sergio Male
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8 Officer Drew is
more likely to be
p(female | drew) = 2/5 * 5/8 = 0.250 a Female.
3/8 3/8
Officer Drew IS a female!
Officer Drew
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8
p(female | drew) = 2/5 * 5/8 = 0.250
3/8 3/8
So far we have only considered Bayes p(cj | d) = p(d | cj ) p(cj)
Classification when we have one
attribute (the “antennae length”, or the p(d)
“name”). But we may have many
features.
How do we use all the features?
Name Over 170CM Eye Hair length Sex
Drew No Blue Short Male
Claudia Yes Brown Long Female
Drew No Blue Long Female
Drew No Blue Long Female
Alberto Yes Brown Short Male
Karin No Blue Long Female
Nina Yes Brown Short Female
Sergio Yes Blue Long Male
• To simplify the task, naïve Bayesian classifiers assume
attributes have independent distributions, and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
The probability of
class cj generating
instance d, equals….
The probability of class cj
generating the observed
value for feature 1,
multiplied by..
The probability of class cj
generating the observed
value for feature 2,
multiplied by..
• To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions, and
thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….
Officer Drew
is blue-eyed, p(officer drew| Female) = 2/5 * 3/5 * ….
over 170cm
tall, and has p(officer drew| Male) = 2/3 * 2/3 * ….
long hair
The Naive Bayes classifiers
is often represented as this
type of graph…
cj
Note the direction of the
arrows, which state that
each class causes certain
features, with a certain
probability
p(d1|cj) p(d2|cj) … p(dn|cj)
Naïve Bayes is fast and cj
space efficient
We can look up all the probabilities
with a single scan of the database and
store them in a (small) table…
p(d1|cj) p(d2|cj) … p(dn|cj)
Sex Over190cm Sex Long Hair Sex
Male Yes 0.15 Male Yes 0.05 Male
No 0.85 No 0.95
Female Yes 0.01 Female Yes 0.70 Female
No 0.99 No 0.30
Naïve Bayes is NOT sensitive to irrelevant features...
Suppose we are trying to classify a persons sex based on
several features, including eye color. (Of course, eye color
is completely irrelevant to a persons gender)
p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * ….
p(Jessica | Female) = 9,000/10,000 * 9,975/10,000 * ….
p(Jessica | Male) = 9,001/10,000 * 2/10,000 * ….
Almost the same!
However, this assumes that we have good enough estimates of
the probabilities, so the more data the better.
An obvious point. I have used a
simple two class problem, and
cj
two possible values for each
example, for my previous
examples. However we can have
an arbitrary number of classes, or
feature values
p(d1|cj) p(d2|cj) … p(dn|cj)
Animal Mass >10kg Animal Color Animal
Cat Yes 0.15 Cat Black 0.33 Cat
No 0.85 White 0.23
Dog Yes 0.91 Brown 0.44 Dog
No 0.09 Dog Black 0.97
Pig
Pig Yes 0.99 White 0.03
No 0.01 Brown 0.90
Pig Black 0.04
White 0.01
Problem! Naïve Bayesian
p(d|cj)
Classifier
Naïve Bayes assumes
independence of
features…
p(d1|cj) p(d2|cj) p(dn|cj)
Sex Over 6 Sex Over 200
foot pounds
Male Yes 0.15 Male Yes 0.11
No 0.85 No 0.80
Female Yes 0.01 Female Yes 0.05
No 0.99 No 0.95
Solution Naïve Bayesian
p(d|cj)
Classifier
Consider the
relationships between
attributes…
p(d1|cj) p(d2|cj) p(dn|cj)
Sex Over 6 Sex Over 200 pounds
foot
Male Yes and Over 6 foot 0.11
Male Yes 0.15
No and Over 6 foot 0.59
No 0.85
Yes and NOT Over 6 foot 0.05
Female Yes 0.01
No and NOT Over 6 foot 0.35
No 0.99
Solution Naïve Bayesian
p(d|cj)
Classifier
Consider the
relationships between
attributes…
p(d1|cj) p(d2|cj) p(dn|cj)
But how do we find the set of connecting arcs??
The Naïve Bayesian Classifier has a piecewise quadratic decision boundary
Grasshoppers
Katydids
Ants
Adapted from slide by Ricardo Gutierrez-Osuna
One second of audio from the laser sensor. Only
0.2
Bombus impatiens (Common Eastern Bumble
0.1 Bee) is in the insectary.
0
-0.1 Background noise
Bee begins to cross laser
-0.2 4
x 10
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
-3
x 10 Single-Sided Amplitude Spectrum of Y(t)
4
3
Peak at
|Y(f)|
60Hz 197Hz
2 interference
Harmonics
1
0 0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
-3
x 10
4
3
|Y(f)|
0 0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
0 100 200 300 400 500 600 700
Wing Beat Frequency Hz
0 100 200 300 400 500 600 700
Wing Beat Frequency Hz
400 500 600 700
Anopheles stephensi: Female Aedes aegyptii : Female
mean =475, Std = 30 mean =567, Std = 43
517
If I see an insect with a wingbeat frequency of 500, what is it?
1 (500−475)2
−
𝑃 𝐴𝑛𝑜𝑝ℎ𝑒𝑙𝑒𝑠 𝑤𝑖𝑛𝑔𝑏𝑒𝑎𝑡 = 500 = 𝑒 2×30 2
2𝜋 30
517
400 500 600 700
12.2% of the 8.02% of the
What is the error rate? area under the area under the
pink curve red curve
Can we get more features?
Circadian Features
Aedes aegypti (yellow fever mosquito)
0 dawn dusk
0 12 24
Midnight Noon Midnight
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420Hz
500
What is it?
400
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420Hz at
500
11:00am
What is it?
400
0 12 24
Midnight Noon Midnight
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420 at
500
11:00am
What is it?
400
0 12 24
Midnight Noon Midnight
(Culex | [420Hz,11:00am]) = (6/ (6 + 6 + 0)) * (2/ (2 + 4 + 3)) = 0.111
(Anopheles | [420Hz,11:00am]) = (6/ (6 + 6 + 0)) * (4/ (2 + 4 + 3)) = 0.222
(Aedes | [420Hz,11:00am]) = (0/ (6 + 6 + 0)) * (3/ (2 + 4 + 3)) = 0.000
Which of the “Pigeon Problems” can be 10
solved by a decision tree? 9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
100 10
90 9
80 8
70 7
60 6
50 5
40 4
30 3
20 2
10 1
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
Advantages/Disadvantages of Naïve Bayes
• Advantages:
– Fast to train (single scan). Fast to classify
– Not sensitive to irrelevant features
– Handles real and discrete data
– Handles streaming data well
• Disadvantages:
– Assumes independence of features
A Tutorial on
Inference and Learning
in Bayesian Networks
Irina Rish
IBM T.J.Watson Research Center
rish@us.ibm.com
http://www.research.ibm.com/people/r/rish/
Outline
Motivation: learning probabilistic models from data
Representation: Bayesian network models
Probabilistic inference in Bayesian Networks
Exact inference
Approximate inference
Learning Bayesian Networks
Learning parameters
Learning graph structure (model selection)
Summary
Bayesian Networks
Structured, graphical representation of probabilistic
relationships between several random variables
Explicit representation of conditional independencies
Missing arcs encode conditional independence
Efficient representation of joint PDF P(X)
Generative model (not just discriminative): allows
arbitrary queries to be answered, e.g.
P (lung cancer=yes | smoking=no, positive X-ray=yes ) = ?
Bayesian Network: BN = (G, Θ)
P(S) G - directed acyclic graph (DAG)
Smoking nodes – random variables
edges – direct dependencies
P(C|S) P(B|S)
lung Cancer Bronchitis Θ - set of parameters in all
conditional probability
distributions (CPDs)
CPD:
P(X|C,S) C B D=0 D=1
P(D|C,B)
0 0 0.1 0.9 CPD of
X-ray Dyspnoea 0 1 0.7 0.3 node X:
1 0 0.8 0.2 P(X|parents(X))
1 1 0.9 0.1
Compact representation of joint distribution in a product form (chain rule):
P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)
1 + 2 + 2 + 4 + 4 = 13 parameters instead of 2 \ = ZY
Example: Printer Troubleshooting
Application
Output OK
Spool Print
Process OK Spooling On
Spooled GDI Data
Data OK Input OK
Local Disk Uncorrupted
Space Adequate Driver
Correct GDI Data
Driver Output OK Correct
Driver
Settings
Correct
Printer Print
Selected Data OK
Network Net/Local
Up Printing
Correct
Correct Net PC to Printer Local Local Port
Printer Path Path OK Transport OK Path OK
Local Cable
Connected
Net Cable Paper
Connected Loaded
Printer On Printer
and Online Data OK
Printer Memory
Print Output Adequate
OK
26 variables
Instead of 2 26 parameters we get
99 = 17x1 + 1x21 + 2x2 2 + 3x2 3 + 3x 2 4
[Heckerman, 95]
“Moral” graph of a BN
Moralization algorithm:
P(S)
1. Connect (“marry”) parents Smoking
P(B|S)
of each node. P(C|S)
2. Drop the directionality of lung Cancer Bronchitis
the edges.
P(D|C,B)
Resulting undirected graph is X-ray Dyspnoea
called the “moral” graph of BN P(X|C,S)
Interpretation:
every pair of nodes that occur together in a CPD is connected by an edge in the moral graph.
CPD for X and its k parents (called “family”) is represented by a clique of size
k
(k+1) in the moral graph, and contains d ( d − 1) probability parameters where
d is the number of values each variable can have (domain size).
Conditional Independence in BNs:
Three types of connections
Serial S Diverging
Smoking
Visit to Asia L B
A Lung Cancer Bronchitis
Knowing S makes L and B
T Tuberculosis independent (common cause)
L Converging B
X Chest X-ray
Lung Cancer Bronchitis
Knowing T makes
A and X independent NOT knowing D or M
(intermediate cause) D Dyspnoea
makes L and B
M independent
Running (common effect)
Marathon
d-separation
Nodes X and Y are d-separated if on any (undirected) path between X and
Y there is some variable Z such that is either
Z is in a serial or diverging connection and Z is known, or
Z is in a converging connection and neither Z nor any of Z’s descendants are
known
X Z X Y
Z X Y Z
Y M
Nodes X and Y are called d-connected if they are not d-separated
(there exists an undirected path between X and Y not d-
separated by any node or a set of nodes)
If nodes X and Y are d-separated by Z, then X and Y are
conditionally independent given Z (see Pearl, 1988)
Independence Relations in BN
A variable (node) is conditionally independent of its
non-descendants given its parents
Smoking
Lung Cancer
Bronchitis
Given Bronchitis and
Lung Cancer,
Chest X-ray Dyspnoea is independent
Dyspnoea
of X-ray (but may depend
Running on Running Marathon)
Marathon
Markov Blanket
A node is conditionally independent of ALL other nodes
given its Markov blanket, i.e. its parents, children, and
“spouses’’ (parents of common children)
(Proof left as a homework problem ☺)
Age Gender
Exposure to Toxins Smoking
Diet Cancer
Serum Calcium Lung Tumor
[Breese & Koller, 97]
What are BNs useful for?
Diagnosis: P(cause|symptom)=? cause
Prediction: P(symptom|cause)=?
C1 C2
Classification: max P(class|data)
class
Decision-making (given a cost function) symptom
Medicine
Speech Bio-
informatics
recognition
Text
Classification Computer
Stock market
troubleshooting
Application Examples
APRI system developed at AT&T Bell Labs
learns & uses Bayesian networks from data to identify customers
liable to default on bill payments
NASA Vista system
predict failures in propulsion systems
considers time criticality & suggests highest utility action
dynamically decide what information to show
Application Examples
Office Assistant in MS Office 97/ MS Office 95
Extension of Answer wizard
uses naïve Bayesian networks
help based on past experience (keyboard/mouse use) and task user is doing currently
This is the “smiley face” you get in your MS Office applications
Microsoft Pregnancy and Child-Care
Available on MSN in Health section
Frequently occurring children’s symptoms are linked to expert modules that repeatedly
ask parents relevant questions
Asks next best question based on provided information
Presents articles that are deemed relevant based on information provided
IBM’s systems management applications
Machine Learning for Systems @ Watson
(Hellerstein, Jayram, Rish (2000)) (Rish, Brodie, Ma (2001))
End-user transaction
Fault diagnosis using probes
recognition Software or hardware
components X
2
X 3
Issues:
Remote Procedure Calls (RPCs) Efficiency (scalability)
X 1
Missing data/noise:
R2 R5 R3 R2 R1 R2 R5 sensitivity analysis
“Adaptive” probing:
selecting “most-
informative” probes
on-line
Transaction1 Transaction2 T4
learning/model
T1 updates
T3
BUY? OPEN_DB? Probe outcomes T 2 on-line diagnosis
SELL? SEARCH? Goal: finding most-likely diagnosis
OXQ S…Q ) = GwOXS… £XS… P
: S…_
Pattern
Patterndiscovery,
discovery,classification,
classification,
diagnosis
diagnosisand
andprediction
prediction
Probabilistic Inference Tasks
Belief updating:
BEL(Xi ) = P(Xi = x i | evidence)
Finding most probable explanation (MPE)
x* = arg max P(x, e)
x
Finding maximum a-posteriory hypothesis
A⊆ X :
(a1* ,..., ak* ) = arg max ∑ P(x, e) hypothesis variables
a
X/A
Finding maximum-expected-utility (MEU) decision
(d 1* ,..., d k* ) = arg max
d
∑ P( x , e)U( x )
X/D
D ⊆ X : decision variables
U ( x ) : utility function
Belief Updating Task: Example
Smoking
lung Cancer Bronchitis
X-ray Dyspnoea
P (smoking| dyspnoea=yes ) = ?
Belief updating: find P(X|evidence)
P(s, d = 1)
P(s|d=1) = ∝ P(s, d = 1) =
S P(d = 1)
CC B
B ∑ P(s)P(c|s)P(b|s)P(x|c,s)P(d|c,b)=
d =1,b , x ,c
X
X D
D
P(s) ∑ ∑ P(b|s) ∑ ∑ P(c|s)P(x|c,s)P(d|c,b)
d =1 b x c
“Moral” graph
Variable Elimination
f ( s , d , b, x )
W*=4
Complexity: O(n exp(w*)) ”induced width”
(max induced clique size)
Efficient inference: variable orderings, conditioning, approximations
Variable elimination algorithms
(also called “bucket-elimination”)
Belief updating: VE-algorithm elim-bel (Dechter 1996)
∑∏ b Elimination operator
bucket B: P(b|a) P(d|b,a) P(e|b,c) B
bucket C: P(c|a) h B
(a, d, c, e)
C
bucket D: h C (a, d, e)
D
D
bucket E: e=0 h (a, e)
E
W*=4
bucket A: P(a) h E
(a)
”induced width” A
P(a|e=0) (max clique size)
Finding MPE = max P(x)
x
VE-algorithm elim-mpe (Dechter 1996)
∑
is replaced by max :
MPE = max P ( a ) P ( c | a ) P ( b | a ) P ( d | a , b ) P ( e | b, c )
a , e , d , c ,b
max
b
∏ Elimination operator
bucket B: P(b|a) P(d|b,a) P(e|b,c) B
bucket C: P(c|a) h B
(a, d, c, e) C
bucket D: h C (a, d, e)
D
D
bucket E: e=0 h (a, e)
E
h E (a) W*=4
bucket A: P(a) ”induced width” A
MPE (max clique size)
probability
Generating the MPE-solution
5. b' = arg max P(b | a' ) ×
b B: P(b|a) P(d|b,a) P(e|b,c)
× P(d' | b, a' ) × P(e' | b, c' )
4. c' = arg max P(c | a' ) × C: P(c|a) h B (a, d, c, e)
c
B
× h (a' , d' , c, e' )
C D: h C (a, d, e)
3. d' = arg max h (a' , d, e' )
d
E: e=0 h D (a, e)
2. e' = 0
1. a' = arg max P(a) ⋅ h E (a)
a
A: P(a) h E (a)
Return (a' , b' , c' , d' , e' )
*
Complexity of VE-inference: O (n exp( w )) o
3` − GG GG GGGG GG GG GG GG
t aGG 3` + XG= ¡GG G G G G
The width wo (X) of a variable X in graph G along the ordering o is the number
of nodes preceding X in the ordering and connected to X (earlier neighbors ).
The width wo of a graph is the maximum width wo (X) among all nodes.
The induced graph G ' along the ordering o is obtained by recursivel y connecting
earlier neighbors of each node, from last to the first in the ordering.
The width of the induced graph G' is called the induced width of the graph G
(denoted wo* ).
A B E
C D
B C D C
E B
D E A A
“Moral” graph wo*1 = 4 wo*2 = 2
Ordering is important! But finding min-w* ordering is NP- hard…
Inference is also NP-hard in general case [Cooper].
Learning Bayesian Networks
Combining domain expert <9.7 0.6 8 14 18>
<0.2 1.3 5 ?? ??>
knowledge with data <1.3 2.8 ?? 0 1 >
<?? 5.6 0 10 ??>
……………….
Efficient representation and
inference
Incremental learning: P(H) or
Handling missing data: <1.3 2.8 ?? 0 1 >
Learning causal relationships: S C
Learning tasks: four main cases
Known graph – learn parameters
P(S)
S
Complete data: P(C|S) P(B|S)
parameter estimation (ML, MAP) C B
Incomplete data: P(X|C,S) P(D|C,B)
non-linear parametric X D
optimization (gradient descent, EM)
Unknown graph – learn graph and parameters
Complete data: S S
optimization (search C B C B
in space of graphs)
Incomplete data: X D X D
structural EM,
mixture models ˆ = arg max Score(G)
G
G
Learning Parameters: complete data
(overview)
ML-estimate: max log P ( D | Θ) - decomposable!
Θ
PaX
Multinomial counts
C B θ x ,pa X = N x ,pa X
ML( θ x ,pa X ) =
P ( x | pa X ) ∑N x , pa X
X x
MAP-estimate max log P ( D | Θ) P ( Θ)
(Bayesian statistics) Θ
Conjugate priors - Dirichlet Dir(θ paX | α1,paX ,...,α m,paX )
N x ,pa X + α x ,pa X
MAP(θ x ,pa X ) =
∑N
x
x ,pa X + ∑ α x ,pa X
x
Equivalent sample size
(prior knowledge)
Learning Parameters
(details)
Learning Parameters: incomplete data
Non-decomposable marginal likelihood (hidden nodes)
Data
Initial parameters S X D C B
Expectation <? 0 1 0
<1 1 ? 0
1>
1>
<0 0 0 ? ?>
Compute EXPECTED <? ? 0 ? 1>
Current model Counts via inference in BN
………
(G, Θ) Expected counts
Maximization E P ( X ) [ N x ,pa X ] =
Update parameters N
EM-algorithm: (ML, MAP) ∑ p
k =1
( x , aX x | y k
, Θ, G )
iterate until convergence
Learning graph structure
ˆ = arg max Score(G)
Find G NP-hard
G
optimization
Heuristic search:
Complete data – local computations
S Add S->B
Incomplete data (score non-
decomposable):stochastic methods S
C B
C B
Delete
Local greedy search; K2 algorithm S->B
S Reverse
S->B
Constrained-based C B
S
methods (PC/IC algorithms) C B
Data impose independence
relations (constraints) on graph
structure
Scoring function:
Minimum Description Length (MDL)
Learning data compression
<9.7 0.6 8 14 18>
<0.2 1.3 5 ?? ??>
<1.3 2.8 ?? 0 1 >
<?? 5.6 0 10 ??>
……………….
log N
MDL( BN | D ) = − log P ( D | Θ, G ) + |Θ|
2
DL(Data|model) DL(Model)
Other: MDL = -BIC (Bayesian Information Criterion)
Bayesian score (BDe) - asymptotically equivalent to MDL
Model selection trade-offs
Naïve Bayes – too simple Unrestricted BN – too complex
(less parameters, but bad model) (possible overfitting + complexity)
Class Class
P(f1 | class) P(f 2 | class) P(fn | class) P(f1 | class) P(f 2 | class) P(fn | class)
feature f1 feature f 2 feature f n feature f1 feature f 2 feature f n
Various approximations between the two extremes
TAN:
tree-augmented Naïve Bayes Class
[Friedman et al. 1997]
P(f1 | class) P(f 2 | class) P(fn | class)
Based on Chow-Liu Tree Method
feature f1 feature f 2 feature f n
(CL) for learning trees
[Chow-Liu, 1968]
Tree-structured distributions
A joint probability distribution is tree-structured if it can be written as
n
) = ∏ P ( x i | x j ( i ) )
P (
i =1
where x j ( i ) is the parent of xi in Bayesian network for P(x) (a directed tree)
A
A
C
B
C
B
P(A,B,C,D,E)= D E
D E P(A)P(B|A)P(C|A)
P(D|C)P(E|B) Not a tree – has an (undirected) cycle
A tree (with root A)
A tree requires only [(d-1) + d(d-1)(n-1)] parameters, where d is domain size
Moreover, inference in trees is O(n) (linear) since their w*=1
Approximations by trees
True distribution P(X) Tree-approximation P’(X)
A A
C C
B B
D E E
D
How good is approximation? Use cross-entropy (KL-divergence):
P ( )
D ( P, P' ) = P ∑ P() log
P ' ()
D(P,P’) is non-negative, and D(P,P’)=0 if and only if P coincides with P’ (on a set of measure 1)
How to find the best tree-approximation?
Optimal trees: Chow-Liu result
Lemma
Given a joint PDF P(x) and a fixed tree structure T, the best
approximation P’(x) (i.e., P’(x) that minimizes D(P,P’) ) satisfies
P ' ( xi | x j ( i ) ) = P ( xi | x j (i ) ) for all i = 1,...,n
Such P’(x) is called the projection of P(x) on T.
Theorem [Chow and Liu, 1968]
Given a joint PDF P(x), the KL-divergence D(P,P’) is minimized by
projecting P(x) on a maximum-weight spanning tree (MSWT) over
nodes in X, where the weight on the edge ( X i , X j ) is defined by
the mutual information measure
P ( xi , x j )
I(X i ; X j ) = ∑ P( x , x
xi , x j
i j ) log
P ( xi ) P( x j )
Note, that I ( X ;Y ) = 0 when X and Y are independent
and that I ( X ;Y ) = D(P( x, y), P( x)P( y))
Proofs
Proof of Lemma :
n n
D ( P, P ' ) = − ∑ P( )∑ log P ' ( x i | x j ( i ) ) + ∑ P( ) log P ( ) = − ∑ P( )∑ log P ' ( xi | x j ( i ) ) −H ( X ) =
i =1 i =1
n n
= − ∑ ∑ P( ) log P ' ( xi | x j ( i ) ) −H ( X ) = − ∑ ∑ P(x , x i j (i ) ) log P ' ( xi | x j ( i ) ) −H ( X ) = (1)
i =1 i =1 xi , x j ( i )
n
= − ∑ ∑ P ( x j ( i ) )∑ P(x i | x j ( i ) ) log P ' ( xi | x j ( i ) ) −H ( X ) (2)
i =1 x j ( i ) xi
A known fact : given P(x), the maximum of ∑ P(x) log P ' ( x) is achieved by the choice P' (x) = P(x).
xi
Therefore, for any value of i and x j ( i ) , the term ∑ P(x | x ) log P ' ( x | x ) is maximized by
i j (i ) i j (i )
xi
choosing P ' ( x i | x j ( i ) ) = P ( x i | x j ( i ) ) (and thus the total D ( P , P ' ) is minimized ) , which proves the Lemma.
Proof of Theorem :
Replacing P ' ( x i | x j ( i ) ) = P ( xi | x j ( i ) ) in the expression (1) yields
n
D ( P, P ' ) = −∑ ∑ P(x , x i j (i ) ) log[ P ( x i x j ( i ) ) / P ( x j ( i ) )] −H ( X ) =
i =1 xi , x j ( i )
n P ( xi x j (i ) )
=−∑ ∑ P(x i , x )
j (i ) log + log P ( x i −H ( X ) =
)
i =1 xi , x j ( i ) P ( x j (i ) ) P ( x i )
n n
= − ∑ I ( X i , X j ( i ) ) − ∑ ∑ P ( x i ) log P ( xi ) −H ( X ).
i =1 i =1 xi
The last two terms are independen t of the choice of the tree, and thus D ( P , P ' )
is minimized by maximizing the sum of edge weights I ( X i , X j ( i ) ).
Chow-Liu algorithm
[As presented in Pearl, 1988]
1. From the given distribution P(x) (or from data generated by P(x)),
compute the joint distributionP( xi | x j ) for all i ≠ j
2. Using the pairwise distributions from step 1, compute the mutual
information
I(Xi ; X j ) for each pair of nodes and assign it as the
weight to the corresponding edge( X i , X j ) .
3. Compute the maximum-weight spanning tree (MSWT):
a. Start from the empty tree over n variables
b. Insert the two largest-weight edges
c. Find the next largest-weight edge and add it to the tree if no cycle is
formed; otherwise, discard the edge and repeat this step.
d. Repeat step (c) until n-1 edges have been selected (a tree is
constructed).
4. Select an arbitrary root node, and direct the edges outwards from
the root.
5. Tree approximation P’(x) can be computed as a projection of P(x) on
the resulting directed tree (using the product-form of P’(x)).
Summary:
Learning and inference in BNs
Bayesian Networks – graphical probabilistic models
Efficient representation and inference
Expert knowledge + learning from data
Learning:
parameters (parameter estimation, EM)
structure (optimization w/ score functions – e.g., MDL)
Complexity trade-off:
NB, BNs and trees
There is much more: causality, modeling time (DBNs, HMMs),
approximate inference, on-line learning, active learning, etc.
Online/print resources on BNs
Conferences & Journals
UAI, ICML, AAAI, AISTAT, KDD
MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI
Books and Papers
Bayesian Networks without Tears by Eugene Charniak. AI
Magazine: Winter 1991.
Probabilistic Reasoning in Intelligent Systems by Judea Pearl.
Morgan Kaufmann: 1998.
Probabilistic Reasoning in Expert Systems by Richard
Neapolitan. Wiley: 1990.
CACM special issue on Real-world applications of BNs, March
1995
Online/Print Resources on BNs
AUAI online: www.auai.org. Links to:
Electronic proceedings for UAI conferences
Other sites with information on BNs and reasoning under
uncertainty
Several tutorials and important articles
Research groups & companies working in this area
Other societies, mailing lists and conferences
Publicly available s/w for BNs
List of BN software maintained by Russell Almond at
bayes.stat.washington.edu/almond/belief.html
several free packages: generally research only
commercial packages: most powerful (& expensive) is
HUGIN; others include Netica and Dxpress
we are working on developing a Java based BN toolkit here at
Watson
Lecture 13: Bayesian networks I
CS221 / Spring 2019 / Charikar & Sadigh
Pac-Man competition
1. (1783) Adam Klein
2. (1769) Jonathan Hollenbeck
3. (1763) Michael Du
CS221 / Spring 2019 / Charikar & Sadigh 1
cs221.stanford.edu/q Question
Earthquakes and burglaries are independent events that will cause an
alarm to go off. Suppose you hear an alarm. How does hearing on the
radio that there’s an earthquake change your beliefs?
it increases the probability of burglary
it decreases the probability of burglary
it does not change the probability of burglary
CS221 / Spring 2019 / Charikar & Sadigh 2
• Situations like these arise all the time in practice: we have a lot of unknowns which are all dependent on
one another. If we obtain evidence on some of these unknowns, how does that affect our belief about the
other unknowns?
• In this lecture, we’ll see how we can perform this type of reasoning under uncertainty in a principled way
using Bayesian networks.
Review: definition
X1 X2 X3
f1 f2 f3 f4
Definition: factor graph
Variables:
X = (X1 , . . . , Xn ), where Xi ∈ Domaini
Factors:
f1 , . . . , fm , with each fj (X) ≥ 0
m
Y
Weight(x) = fj (x)
j=1
CS221 / Spring 2019 / Charikar & Sadigh 4
• Last week, we talked about factor graphs, which uses local factors to specify a weight Weight(x) for each
assignment x in a compact way. The stated objective was to find the maximum weight assignment.
• Given any factor graph, we saw a number of algorithms (backtracking search, beam search, Gibbs sampling,
variable elimination) for (approximately) optimizing this objective.
Review: person tracking
Problem: person tracking
Sensors report positions: 0, 2, 2. Objects don’t move very fast and
sensors are a bit noisy. What path did the person take?
t1 t2
X1 X2 X3
o1 o2 o3
• Variables Xi : location of object at time i
• Transition factors ti (xi , xi+1 ): incorporate physics
• Observation factors oi (xi ): incorporate sensors
[demo: maxVariableElimination()]
What do the factors mean?
CS221 / Spring 2019 / Charikar & Sadigh 6
• As an example, recall the object tracking example. We defined observation factors to capture the fact that
the true object position is close to the sensor reading, and the transition factors to capture the fact that
the true object positions across time are close to each other.
• We just set them rather arbitrarily. Is there a more principled way to think about these factors beyond
being non-negative functions?
Course plan
Search problems
Constraint satisfaction problems
Markov decision processes
Adversarial games Bayesian networks
Reflex States Variables Logic
”Low-level intelligence” ”High-level intelligence”
Machine learning
CS221 / Spring 2019 / Charikar & Sadigh 8
• Much of this class has been on developing modeling frameworks. We started with state-based models,
where we cast real-world problems as finding paths or policies through a state graph.
• Then, we saw that for a large class of problems (such as scheduling), it was much more convenient to use
the language of factor graphs.
• While factor graphs could be reduced to state-based models by fixing the variable ordering, we saw that
they also led to notions of treewidth and variable elimination, which allowed us to understand our models
much better.
• In this lecture, we will introduce another modeling framework, Bayesian networks, which are factor graphs
imbued with the language of probability. This will give probabilistic life to the factors of factor graphs.
Roadmap
Basics
Probabilistic programs
Inference
CS221 / Spring 2019 / Charikar & Sadigh 10
• Bayesian networks were popularized in AI by Judea Pearl in the 1980s, who showed that having a coherent
probabilistic framework is important for reasoning under uncertainty.
• There is a lot to say about the Bayesian networks (CS228 is an entire course about them and their cousins,
Markov networks). So we will devote most of this lecture focusing on modeling.
Review: probability (example)
Random variables: sunshine S ∈ {0, 1}, rain R ∈ {0, 1}
Joint distribution:
s r P(S = s, R = r)
0 0 0.20
P(S, R) = 0 1 0.08
1 0 0.70
1 1 0.02
Marginal distribution: Conditional distribution:
s P(S = s) s P(S = s | R = 1)
P(S) = 0 0.28 P(S | R = 1) = 0 0.8
1 0.72 1 0.2
(aggregate rows) (select rows, normalize)
CS221 / Spring 2019 / Charikar & Sadigh 12
• Before introducing Bayesian networks, let’s review probability (at least the relevant parts). We start with
an example about the weather. Suppose we have two boolean random variables, S and R representing
sunshine and rain. Think of an assignment to (S, R) as representing a possible state of the world.
• The joint distribution specifies a probability for each assignment to (S, R) (state of the the world). We
use lowercase letters (e.g., s and r) to denote values and uppercase letters (e.g., S and R) to denote
random variables. Note that P(S = s, R = r) is a probability (a number) while P(S, R) is a distribution (a
table of probabilities). We don’t know what state of the world we’re in, but we know what the probabilities
are (there are no unknown unknowns). The joint distribution contains all the information and acts as the
central source of truth.
• From it, we can derive a marginal distribution over a subset of the variables. We get this by aggregating
the rows that share the same value of S. The interpretation is that we are interested in S. We don’t
explicitly care about R, but we want to take into account R’s effect on S. We say that R is marginalized
out. This is a special form of elimination. In the last lecture, we leveraged max-elimination, where we
took the max over the eliminated variables; here, we are taking a sum.
• The conditional distribution selects rows of the table matching the condition (right of the bar), and then
normalizes the probabilities so that they sum to 1. The interpretation is that we observe the condition
(R = 1) and are interested in S. This is the conditioning that we saw for factor graphs, but where we
normalize the selected rows to get probabilities.
Review: probability (general)
Random variables:
X = (X1 , . . . , Xn ) partitioned into (A, B)
Joint distribution:
P(X) = P(X1 , . . . , Xn )
Marginal distribution:
P
P(A) = b P(A, B = b)
Conditional distribution:
P(A | B = b) ∝ P(A, B = b)
CS221 / Spring 2019 / Charikar & Sadigh 14
• In general, we have n random variables X1 , . . . , Xn and let X denote all of them. Suppose X is partitioned
into A and B (e.g., A = (X1 , X3 ) and B = (X2 , X4 , X5 ) if n = 5).
• The marginal and conditional distributions can be defined over the subsets A and B rather than just single
variables.
• Of course, we can also have a hybrid too: for n = 3, P(X1 | X3 = 1) marginalizes out X2 and conditions
on X3 = 1.
• It is important to remember the types of objects here: P(A) is a table where rows are possible assignments
to A, whereas P(A = a) is a number representing the probability of the row corresponding to assignment
a.
Probabilistic inference task
Random variables: unknown quantities in the world
X = (S, R, T, A)
In words:
• Observe evidence (traffic in autumn): T = 1, A = 1
• Interested in query (rain?): R
In symbols:
R | T = 1, A = 1)
P(|{z}
| {z }
query condition
(S is marginalized out)
CS221 / Spring 2019 / Charikar & Sadigh 16
• At this point, you should have all the definitions to compute any marginal or conditional distribution given
access to a joint probability distribution. But what is this really doing and how is this useful?
• We should think about each assignment x as a possible state of the world (it’s raining, it’s not sunny,
there is traffic, it is autumn, etc.). Think of the joint distribution as one giant database that contains full
information about how the world works.
• In practice, we’d like to ask questions by querying this probabilistic database. First, we observe some
evidence, which effectively fixes some of the variables. Second, we are interested in the distribution of
some set of variables which we didn’t observe. This forms a query, and the process of answering this query
(computing the desired distribution) is called probabilistic inference.
Challenges
Modeling: How to specify a joint distribution P(X1 , . . . , Xn ) com-
pactly?
Bayesian networks (factor graphs for probability distributions)
Inference: How to compute queries P(R | T = 1, A = 1) efficiently?
Variable elimination, Gibbs sampling, particle filtering (analogue of
algorithms for finding maximum weight assignment)
CS221 / Spring 2019 / Charikar & Sadigh 18
• In general, a joint distribution over n variables has size exponential in n. From a modeling perspective,
how do we even specify an object that large? Here, we will see that Bayesian networks, based on factor
graphs, offer an elegant solution.
• From an algorithms perspective, there is still the question of how we perform probabilistic inference ef-
ficiently. In the next lecture, we will see how we can adapt all of the algorithms that we saw before for
computing maximum weight assignments in factor graphs, essentially by replacing a max with a sum.
• The two desiderata are rather synergistic, and it is the same property — conditional independence — that
makes both possible.
Bayesian network (alarm)
def
P(B = b, E = e, A = a) = p(b)p(e)p(a | b, e)
B E b p(b) e p(e) b e a p(a | b, e)
1 1 0 0 0 1
A
0 1− 0 1− 0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1
p(b) = · [b = 1] + (1 − ) · [b = 0]
p(e) = · [e = 1] + (1 − ) · [e = 0]
p(a | b, e) = [a = (b ∨ e)]
CS221 / Spring 2019 / Charikar & Sadigh 20
• Let us try to model the situation. First, we establish that there are three variables, B (burglary), E
(earthquake), and A (alarm). Next, we connect up the variables to model the dependencies.
• Unlike in factor graphs, these dependencies are represented as directed edges. You can intuitively think
about the directionality as suggesting causality, though what this actually means is a deeper question and
beyond the scope of this class.
• For each variable, we specify a local conditional distribution (a factor) of that variable given its parent
variables. In this example, B and E have no parents while A has two parents, B and E. This local
conditional distribution is what governs how a variable is generated.
• We are writing the local conditional distributions using p, while P is reserved for the joint distribution over
all random variables, which is defined as the product.
Bayesian network (alarm)
p(b) p(e)
B E B p(a | b, e) E
A A
P(B = b, E = e, A = a) = p(b)p(e)p(a | b, e)
Bayesian networks are a special case of factor graphs!
CS221 / Spring 2019 / Charikar & Sadigh 22
• Note that the local conditional distributions (e.g., p(a | b, e)) are non-negative so they can be thought
of simply as factors of a factor graph. The joint probability of an assignment is then the weight of that
assignment.
• In this light, Bayesian networks are just a type of factor graphs, but with additional structure and inter-
pretation.
Probabilistic inference (alarm)
Joint distribution:
b e a P(B = b, E = e, A = a)
0 0 0 (1 − )2
0 0 1 0
0 1 0 0
0 1 1 (1 − )
1 0 0 0
1 0 1 (1 − )
1 1 0 0
1 1 1 2
Queries: P(B)? P(B | A = 1)? P(B | A = 1, E = 1)?
[demo: = 0.05]
CS221 / Spring 2019 / Charikar & Sadigh 24
• Bayesian networks can be used to capture common reasoning patterns under uncertainty (which was one
of their first applications).
• Consider the following model: Suppose the probability of an earthquake is and the probability of a
burglary is and both are independent. Suppose that the alarm always goes off if either an earthquake or
a burglary occurs.
• In the prior, we can eliminate A and E and get that the probability of the burglary is .
1
• Now suppose we hear the alarm A = 1. The probability of burglary is now P(B = 1 | A = 1) = 2− .
• Now suppose that you hear on the radio that there was an earthquake (E = 1). Then the probability of
burglary goes down to P(B = 1 | A = 1, E = 1) = again.
Explaining away
B E
Key idea: explaining away
Suppose two causes positively influence an effect. Conditioned on
the effect, conditioning on one cause reduces the probability of the
other cause.
CS221 / Spring 2019 / Charikar & Sadigh 26
• This last phenomenon has a special name: explaining away. Suppose we have two cause variables B
and E, which are parents of an effect variable A. Assume the causes influence the effect positively (e.g.,
through the OR function).
• Conditioned on the effect A = 1, there is some posterior probability of B. Conditioned on the effect A = 1
and the other cause E = 1, the new posterior probability is reduced. We then say that the other cause E
has explained away B.
Definition
Definition: Bayesian network
Let X = (X1 , . . . , Xn ) be random variables.
A Bayesian network is a directed acyclic graph (DAG) that spec-
ifies a joint distribution over X as a product of local conditional
distributions, one for each node:
n
def
Y
P(X1 = x1 , . . . , Xn = xn ) = p(xi | xParents(i) )
i=1
CS221 / Spring 2019 / Charikar & Sadigh 28
• Without further ado, let’s define a Bayesian network formally. A Bayesian network defines a large joint
distribution in a modular way, one variable at a time.
• First, the graph structure captures what other variables a given variable depends on.
• Second, we specify a local conditional distribution for variable Xi , which is a function that specifies a
distribution over Xi given an assignment xParents(i) to its parents in the graph (possibly none). The joint
distribution is simply defined to be the product of all of the local conditional distributions together.
• Notationally, we use lowercase p (in p(xi | xParents(i) )) to denote a local conditional distribution, and
uppercase P to denote the induced joint distribution over all variables. While the two can coincide, it is
important to keep these things separate in your head!
Special properties
Key difference from general factor graphs:
Key idea: locally normalized
All X
factors (local conditional distributions) satisfy:
p(xi | xParents(i) ) = 1 for each xParents(i)
xi
Implications:
• Consistency of sub-Bayesian networks
• Consistency of conditional distributions
CS221 / Spring 2019 / Charikar & Sadigh 30
• But Bayesian networks are more than that. The key property is that all the local conditional distributions,
being distributions, sum to 1 over the first argument.
• This simple property results in two important properties of Bayesian networks that are not present in
general factor graphs.
Consistency of sub-Bayesian networks
B E
A short calculation:
P
P(B = b, E = e) = a P(B = b, E = e, A = a)
P
= ap(b)p(e)p(a | b, e)
P
= p(b)p(e) a p(a | b, e)
= p(b)p(e)
CS221 / Spring 2019 / Charikar & Sadigh 32
• First, let’s see what happens when we marginalize A (by performing algebra on the joint probability). We
see that we end up with p(b)p(e), which actually defines a sub-Bayesian network with one fewer variable,
and the same local conditional probabilities.
• If one marginalizes out all the variables, then one gets 1, which verifies that a Bayesian network actually
defines a probability distribution.
• The philosophical ramification of this property is that there could be many other variables that depend
on the variables you’ve modeled (earthquakes also impacts traffic) but as long as you don’t observe them,
they can be ignored mathematically (ignorance is bliss). Note that this doesn’t mean that knowing about
the other things isn’t useful.
Consistency of sub-Bayesian networks
Key idea: marginalization
Marginalization of a leaf node yields a Bayesian network without
the node.
B E
B E
A
p(b) p(e)
p(b) p(e)
B p(a | b, e) E
B E
A
CS221 / Spring 2019 / Charikar & Sadigh 34
• This property is very attractive, because it means that whenever we have a large Bayesian network, where
we don’t care about some of the variables, we can just remove them (graph operations), and this encodes
the same distribution as we would have gotten from marginalizing out variables (algebraic operations).
The former, being visual, can be more intuitive.
Consistency of local conditionals
Key idea: local conditional distributions
Local conditional distributions (factors) are the true conditional
distributions.
A B C
D E
F G H
P(D = d | A = a, B = b) = p(d | a, b)
| {z } | {z }
from probabilistic inference by definition
CS221 / Spring 2019 / Charikar & Sadigh 36
• Note that the local conditional distributions p(d | a, b) are simply defined by the user. On the other hand,
the quantity P(D = d | A = a, B = b) is not defined, but follows from probabilistic inference on the joint
distribution defined by the Bayesian network.
• It’s not clear a priori that the two have anything to do with each other. The second special property that
we get from using Bayesian networks is that the two are actually the same.
• To show this, we can remove all the descendants of D by the consistency of sub-Bayesian networks,
leaving us with the Bayesian network P(A = a, B = b, D = d) = p(a)p(b)p(d | a, b). By the chain rule,
P(A = a, B = b, D = d) = P(A = a, B = b)P(D = d | A = a, B = b). If we marginalize out D, then
we are left with the Bayesian network P(A = a, B = b) = p(a)p(b). From this, we can conclude that
P(D = d | A = a, B = b) = p(d | a, b).
• This argument generalizes to any Bayesian network and local conditional distribution.
Medical diagnosis
Problem: cold or allergies?
You are coughing and have itchy eyes. Do you have a cold or
allergies?
[demo]
Variables: Cold, Allergies, Cough, Itchy eyes
Bayesian network: Factor graph:
C A C A
H I H I
CS221 / Spring 2019 / Charikar & Sadigh 38
• Here is another example (a cartoon version of Bayesian networks for medical diagnosis). Allergies and cold
are the two hidden variables that we’d like to infer (we have some prior over these two). Cough and itchy
eyes are symptoms that we observe as evidence, and we have some likelihood model of these symptoms
given the hidden causes.
• We can use the demo to infer the hidden state given the evidence.
Summary so far
B E
• Set of random variables capture state of world
• Local conditional distributions ⇒ joint distribution
• Probabilistic inference task: ask questions
• Captures reasoning patterns (e.g., explaining away)
• Factor graph interpretation (for inference later)
CS221 / Spring 2019 / Charikar & Sadigh 40
Roadmap
Basics
Probabilistic programs
Inference
CS221 / Spring 2019 / Charikar & Sadigh 41
Probabilistic programs
Goal: make it easier to write down complex Bayesian networks
Key idea: probabilistic program
Write a program to generate an assignment (rather than specifying
the probability of an assignment).
CS221 / Spring 2019 / Charikar & Sadigh 42
Probabilistic programs
B E
Probabilistic program: alarm
B ∼ Bernoulli()
E ∼ Bernoulli()
A=B∨E
Key idea: probabilistic program
A randomized program that sets the random variables.
def Bernoulli(epsilon):
return random.random() < epsilon
CS221 / Spring 2019 / Charikar & Sadigh 43
• There is another way of writing down Bayesian networks other than graphically or mathematically, and that
is as a probabilistic program. A probabilistic program is a randomized program that invokes a random
number generator to make random choices. Executing this program will assign values to a collection of
random variables X1 , . . . , Xn ; that is, generating an assignment.
• The probability (e.g., fraction of times) that the program generates that assignment is exactly the proba-
bility under the joint distribution specified by that program.
• We should think of this program as outputting the state of the world (or at least the part of the world
that we care about for our task).
• Note that the probabilistic program is only used to define joint distributions. We usually wouldn’t actually
run this program directly.
• For example, we show the probabilistic program for alarm. B ∼ Bernoulli() simply means that P(B =
1) = . Here, we can think about Bernoulli() as a randomized function (random() < epsilon) that
returns 1 with probability and 0 with probability 1 − .
Probabilistic program: example
Probabilistic program: object tracking
X0 = (0, 0)
For each time step i = 1, . . . , n:
With probability α:
Xi = Xi−1 + (1, 0) [go right]
With probability 1 − α:
Xi = Xi−1 + (0, 1) [go down]
Bayesian network structure:
X1 X2 X3 X4 X5
CS221 / Spring 2019 / Charikar & Sadigh 45
• This is a more interesting generative model since it has a for loop, which allows us to determine the
distribution over a templatized set of n variables rather than just 3 or 4.
• In these cases, variables are generally indexed by something like time or location.
• We can also draw the Bayesian network. Each Xi only depends on Xi−1 . This is a chain-structured
Bayesian network, called a Markov model.
Probabilistic program: example
(press ctrl-enter to save)
Run
CS221 / Spring 2019 / Charikar & Sadigh 47
• Try clicking [Run]. Each time a new assignment of (X1 , . . . , Xn ) is chosen.
Probabilistic inference: example
Query: what are possible trajectories given evidence X10 = (8, 2)?
(press ctrl-enter to save)
Run
CS221 / Spring 2019 / Charikar & Sadigh 49
• This program only serves for defining the distribution. Now we can query that distribution and ask the
question: suppose the program set X10 = (8, 2); what is the distribution over the other variables?
• In the demo, note that all trajectories are constrained to go through (8, 2) at time step 10.
Application: language modeling
Probabilistic program: Markov model
For each position i = 1, 2, . . . , n:
Generate word Xi ∼ p(Xi | Xi−1 )
Wreck a nice beach
X1 X2 X3 X4
CS221 / Spring 2019 / Charikar & Sadigh 51
• In the context of natural language, a Markov model is known as a bigram model. A higher-order general-
ization of bigram models are n-gram models (more generally known as higher-order Markov models).
• Language models are often used to measure the ”goodness” of a sentence, mostly within the context of a
larger system such as speech recognition or machine translation.
Application: object tracking
Probabilistic program: hidden Markov model (HMM)
For each time step t = 1, . . . , T :
Generate object location Ht ∼ p(Ht | Ht−1 )
Generate sensor reading Et ∼ p(Et | Ht )
(3,1) (3,2)
H1 H2 H3 H4 H5
E1 E2 E3 E4 E5
4 5
Applications: speech recognition, information extraction, gene finding
CS221 / Spring 2019 / Charikar & Sadigh 53
• Markov models are limiting because they do not have a way of talking about noisy evidence (sensor
readings). They can be extended quite easily to hidden Markov models, which introduce a parallel sequence
of observation variables.
• For example, in object tracking, Ht denotes the true object location, and Et denotes the noisy sensor
reading, which might be (i) the location Ht plus noise, or (ii) the distance from Ht plus noise, depending
on the type of sensor.
• In speech recognition, Ht would be the phonemes or words and Et would be the raw acoustic signal.
Application: multiple object tracking
Probabilistic program: factorial HMM
For each time step t = 1, . . . , T :
For each object o ∈ {a, b}:
o
Generate location Hto ∼ p(Hto | Ht−1 )
Generate sensor reading Et ∼ p(Et | Hta , Htb )
H1a H2a H3a H4a
H1b H2b H3b H4b
E1 E2 E3 E4
CS221 / Spring 2019 / Charikar & Sadigh 55
• An extension of an HMM, called a factorial HMM, can be used to track multiple objects. We assume
that each object moves independently according to a Markov model, but that we get one sensor reading
which is some noisy aggregated function of the true positions.
• For example, Et could be the set {Hta , Htb }, which reveals where the objects are, but doesn’t say which
object is responsible for which element in the set.
Application: document classification
Question: given a text document, what is it about?
Probabilistic program: naive Bayes
Generate label Y ∼ p(Y )
For each position i = 1, . . . , L:
Generate word Wi ∼ p(Wi | Y )
travel
W1 W2 ... WL
beach Paris
CS221 / Spring 2019 / Charikar & Sadigh 57
• Naive Bayes is a very simple model which can be used for classification. For document classification, we
generate a label and all the words in the document given that label.
• Note that the words are all generated independently, which is not a very realistic model of language, but
naive Bayes models are surprisingly effective for tasks such as document classification.
Application: topic modeling
Question: given a text document, what topics is it about?
Probabilistic program: latent Dirichlet allocation
Generate a distribution over topics α ∈ RK
For each position i = 1, . . . , L:
Generate a topic Zi ∼ p(Zi | α)
Generate a word Wi ∼ p(Wi | Zi )
α {travel:0.8,Europe:0.2}
travel Z1 Z2 ... ZL Europe
beach W1 W2 ... WL Euro
CS221 / Spring 2019 / Charikar & Sadigh 59
• A more sophisticated model of text is latent Dirichlet Allocation (LDA), which allows a document to not
just be about one topic (which was true in naive Bayes), but about multiple topics.
• Here, the distribution over topics α is chosen per document from a Dirichlet distribution. Note that α is a
continuous-valued random variable. For each position, we choose a topic according to that per-document
distribution and generate a word given that topic.
• Latent Dirichlet Alloction (LDA) has been very infuential for modeling not only text but images, videos,
music, etc.; any sort of data with hidden structure. It is very related to matrix factorization.
Application: medical diagnostics
Question: If patient has has a cough and fever, what disease(s) does
he/she have?
1 Pneumonia 0 Cold 0 Malaria
1 Fever 1 Cough 0 Vomit
Probabilistic program: diseases and symptoms
For each disease i = 1, . . . , m:
Generate activity of disease Di ∼ p(Di )
For each symptom j = 1, . . . , n:
Generate activity of symptom Sj ∼ p(Sj | D1:m )
CS221 / Spring 2019 / Charikar & Sadigh 61
• We already saw a special case of this model. In general, we would like to diagnose many diseases and
might have measured many symptoms and vitals.
Application: social network analysis
Question: Given a social network (graph over n people), what types of
people are there?
politician H1
0 E12 E13 0
scientist H2 E23 H3 scientist
Probabilistic program: stochastic block model
For each person i = 1, . . . , n:
Generate person type Hi ∼ p(Hi )
For each pair of people i 6= j:
Generate connectedness Eij ∼ p(Eij | Hi , Hj )
CS221 / Spring 2019 / Charikar & Sadigh 63
• One can also model graphs such as social networks. A very naive-Bayes-like model is that each node
(person) has a ”type”. Whether two people interact with each other is determined solely by their types
and random chance.
• Note: there are extensions called mixed membership models which, like LDA, allow each person to have
multiple types.
• In summary, it is quite easy to come up with probabilistic programs that tell a story of how the world works
for the domain of interest. These probabilistic programs define joint distributions over assignments to a
collection of variables. Usually, these programs describe how some collection of hidden variables H that
you’re interested in behave, and then describe the generation of the evidence E that you see conditioned
on H. After defining the model, one can do probabilistic inference to compute P(H | E = e).
Roadmap
Basics
Probabilistic programs
Inference
CS221 / Spring 2019 / Charikar & Sadigh 65
Review: probabilistic inference
Input
Bayesian network: P(X1 = x1 , . . . , Xn = xn )
Evidence: E = e where E ⊆ X is subset of variables
Query: Q ⊆ X is subset of variables
Output
P(Q = q | E = e) for all values q
Example: if coughing and itchy eyes, have a cold?
P(C | H = 1, I = 1)
CS221 / Spring 2019 / Charikar & Sadigh 66
Example: Markov model
X1 X2 X3 X4
Query: P(X3 = x3 | X2 = 5) for all x3
Tedious way:
X
∝ p(x1 )p(x2 = 5 | x1 )p(x3 | x2 = 5)p(x4 | x3 )
x1 ,x4
!
X
∝ p(x1 )p(x2 = 5 | x1 ) p(x3 | x2 = 5)
x1
∝ p(x3 | x2 = 5)
Fast way:
[whiteboard]
CS221 / Spring 2019 / Charikar & Sadigh 67
• Let’s first compute the query the old-fashioned way by grinding through the algebra. Then we’ll see a
faster, more graphical way, of doing this.
• We start by transforming the query into an expression that references the joint distribution, which allows
us to rewrite as the product of the local conditional probabilities. To do this, we invoke the definition of
marginal and conditional probability.
• One convenient shortcut we will take is make use of the proportional-to (∝) relation. Note that in the end,
we need to construct a distribution over X3 . This means that any quantity (such as P(X2 = 5)) which
doesn’t depend on X3 can be folded into the proportionality constant. If you don’t believe this, keep it
around to convince yourself that it doesn’t matter. Using ∝ can save you a lot of work.
P
• Next, we do some algebra to push the summations P inside. We notice that x4 p(x4 | x3 ) = 1 because
it’s a local conditional distribution. The factor x1 p(x1 )p(x2 = 5 | x1 ) can also be folded into the
proportionality constant.
• The final result is p(x3 | x2 = 5), which matches the query as we expected by the consistency of local
conditional distributions.
General strategy
Query:
P(Q | E = e)
Algorithm: general probabilistic inference strategy
• Remove (marginalize) variables that are not ancestors of Q
or E.
• Convert Bayesian network to factor graph.
• Condition on E = e (shade nodes + disconnect).
• Remove (marginalize) nodes disconnected from Q.
• Run probabilistic inference algorithm (manual, variable elim-
ination, Gibbs sampling, particle filtering).
CS221 / Spring 2019 / Charikar & Sadigh 69
• Our goal is to compute the conditional distribution over the query variables Q ⊆ H given evidence E = e.
We can do this with our bare hands by chugging through all the algebra starting with the definition of
marginal and conditional probability, but there is an easier way to do this that exploits the structure of the
Bayesian network.
• Step 1: remove variables which are not ancestors of Q or E. Intuitively, these don’t have an influence on Q
and E, so they can be removed. Mathematically, this is due to the consistency of sub-Bayesian networks.
• Step 2: turn this Bayesian network into a factor graph by simply introducing one factor per node which
is connected to that node and its parents. It’s important to include all the parents and the child into one
factor, not separate factors. From here out, all we need to think about is factor graphs.
• Step 3: condition on the evidence variables. Recall that conditioning on nodes in a factor graph shades
them in, and as a graph operation, rips out those variables from the graph.
• Step 4: remove nodes which are not connected to Q. These are independent of Q, so they have no impact
on the results.
• Step 5: Finally, run a standard probabilistic inference algorithm on the reduced factor graph. We’ll do this
manually for now using variable elimination. Later we’ll see automatic methods for doing this.
Example: alarm
b e a p(a | b, e)
B E b p(b) e p(e) 0 0 0 1
0 0 1 0
1 1 0 1 0 0
A 0 1 1 1
0 1− 0 1− 1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1
[whiteboard]
Query: P(B)
• Marginalize out A, E
Query: P(B | A = 1)
• Condition on A = 1
CS221 / Spring 2019 / Charikar & Sadigh 71
• Here is another example: the simple v-structured alarm network from last time.
• P(B) = p(b) trivially after marginalizing out A and E (step 1).
• For P(B | A = 1), step 1 doesn’t do anything. Conditioning (step 3) creates a factor graph with factors
p(b), p(e), and p(aP = 1 | b, e). In step 5, we eliminate E by replacing it and its incident factors with a
new factor f (b) = e p(e)p(a = 1 | b, e). Then, we multiply all the factors (which should only be unary
factors on the query variable B) and normalize: P(B = b | A = 1) ∝ p(b)f (b).
• To flesh this out, for b = 1, we have ( + (1 − )) = . For b = 0, we have (1 − )( + 0) = (1 − ). The
1
normalized result is thus P(B = 1 | A = 1) = +(1−) = 2− .
• For a probabilistic interpretation, note that all we’ve done is calculate P(B = b | A = 1) =
P(B=b)P(A=1|B=b) p(b)f (b)
P(A=1) =P p(bi )f (bi ) , where the first equality follows from Bayes’ rule and the second
bi ∈Domain(B)
follows from the fact that the local conditional distributions are the true conditional distributions. The
Bayesian network has simply given us a methodical, algorithmic way to calculate this probability.
Example: A-H (section)
A B C
D E
F G H
[whiteboard]
Query: P(C | B = b)
• Marginalize out everything else, note C ⊥
⊥B
Query: P(C, H | E = e)
• Marginalize out A, D, F, G, note C ⊥
⊥H|E
CS221 / Spring 2019 / Charikar & Sadigh 73
• In the first example, once we marginalize out all variables we can, we are left with C and B, which are
disconnected. We condition on B, which just removes that node, and so we’re just left with P(C) = p(c),
as expected.
• In the second example, note that the two query variables are independent,
P so we can compute them
separately. The result is P(C = c, H = h | E = e) ∝ p(c)p(h | e) b p(b)p(e | b, c).
• If we had the actual values of these probabilities, we can compute these quantities.
Summary
Bayesian networks: modular definition of large joint distribution over
variables
Probabilistic inference: condition on evidence, query variables of interest
Next time: algorithms for probabilistic inference
CS221 / Spring 2019 / Charikar & Sadigh 75
Multiple Linear Regression
Multiple Regression for k = 2,
Graphical Demonstration
Y The simple linear regression model
allows for one independent variable, “X”
Y = β0 + β1X + ε
Now we have a plane
X1
The multiple linear regression model
allows for more than one independent variable.
Y = β0 + β1X1 + β2X2 + ε
X2
Model development
Cont…
• We may write
• In other way
Cont…
• We can write Y=Xβ+ε, where Y, X, β and ε are matrices…Compute
their size??
• For any Yi, we can say
yi = β 0 + β1 xi1 + β 2 xi 2 + ..... + β p xip + ε i
p
ε=i yi − ∑ β j xij
j =0
• Optimize nn p
i
2
∑=
ε i
=i 1 =i 1 =j 0
∑ ∑ j ij
( y − β x ) 2
dSSE
=0
dβ
Cont…
• SSE is a scalar qty. can be written as εεT
εε T ( y − X β )( y − X β )T
SSE ==
• Now
d ( SSE ) d ( y − X β )( y − X β )T
= = 0
dβ dβ
−2 X T ( y − X β ) =
0
XT Xβ = XT y
( X T X ) −1 X T X β = ( X T X ) −1 X T y
β = ( X T X ) −1 X T y
Do it yourself…..
• Problem ….
MULTIPLE LINEAR REGRESSION
Part II
Dr. Ranjan Maity
CIT Kokrajhar
REQUIREMENTS
More than one independent variables
One dependent variable
Rule of Thumb: Min No of observations = 10* No of independent Variables
ASSUMPTIONS
Independence
Normality
Homoscedasticity
Linearity
THE R AND BETA
R = the magnitude of the relationship between the dependent variable and the best linear
combination of the predictor variables
R2 = the proportion of variation in Y accounted for by the set of independent variables (X’s).
Testing R2
Test R2 through an F testTest of each partial regression coefficient (b) by t-tests
Test of competing models (difference between R2) through an F test of difference of R2s
Test of each partial regression coefficient (b) by t-tests
Comparison of partial regression coefficients with each other - t-test of difference between
standardized partial regression coefficients (β)
R 2
measure that provides information about the goodness of fit of a model
CONT…
CONT…
Lecture 9: Linear
Regression
Goals
• Develop basic concepts of linear regression from
a probabilistic framework
• Estimating parameters and hypothesis testing
with linear models
• Linear regression in R
Regression
• Technique used for the modeling and analysis of
numerical data
• Exploits the relationship between two or more
variables so that we can gain information about one of
them through knowing values of the other
• Regression can be used for prediction, estimation,
hypothesis testing, and modeling causal relationships
Regression Lingo
Y = X1 + X2 + X3
Dependent Variable Independent Variable
Outcome Variable Predictor Variable
Response Variable Explanatory Variable
Why Linear Regression?
• Suppose we want to model the dependent variable Y in terms
of three predictors, X1, X2, X3
Y = f(X1, X2, X3)
• Typically will not have enough data to try and directly
estimate f
• Therefore, we usually have to assume that it has some
restricted form, such as linear
Y = X1 + X2 + X3
Linear Regression is a Probabilistic Model
• Much of mathematics is devoted to studying variables
that are deterministically related to one another
y = " 0 + "1 x
y "y
! "0
"x
"1 =
#y
#x
!
! x
!
!
!
• But we’re interested in understanding the relationship
between variables related in a nondeterministic fashion
!
A Linear Probabilistic Model
• Definition: There exists parameters " 0 , "1, and " ,2 such that for
any fixed value of the independent variable x, the dependent
variable is related to x through the model equation
y = "!0 +! "1 x +
!#
2
• " is a rv assumed to be N(0, # )
True Regression Line
!
"3 y = " 0 + "1 x
y "1
"2
"0 !
! x !
!
!
!
Implications
• The expected value of Y is a linear function of X, but for fixed
x, the variable Y differs from its expected value by a random
amount
• Formally, let x* denote a particular value of the independent
variable x, then our linear probabilistic model says:
E(Y | x*) = µY|x* = mean value of Y when x is x *
2
V (Y | x*) = " Y|x* = variance of Y when x is x *
!
Graphical Interpretation
y
y = " 0 + "1 x
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1
x
x1 x2
!
• For example, if x = height and y = weight then µ!
Y |x =60 is the average
!
weight for!
all individuals 60 inches tall in the population
!
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and " = 3
• Q1: What is the interpretation of "1 = 0.5?
The expected change in height associated with a 1-unit increase
in weight !
!
• Q2: If x = 20 what is the expected value of Y?
µY |x =20 = 7.5 + 0.5(20) = 17.5
• Q3: If x = 20 what is P(Y > 22)?
! " 22 -17.5 %
P(Y > 22 | x = 20) = P$ ' = 1( ) (1.5) = 0.067
# 3 &
Estimating Model Parameters
• Point estimates of "ˆ 0 and "ˆ1 are obtained by the principle of least
squares
n
f (" 0 , "1 ) = $ [ y i # (" 0 + "1 x i )] 2
! ! i=1
! y
"0
x
!
•
!
"ˆ = y # "ˆ x
0 1
!
Predicted and Residual Values
• Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
estimated regression line
yˆ1 = "ˆ 0 # "ˆ1 x1
yˆ = "ˆ # "ˆ x
2 0 1 2
• Residuals are
! the deviations of observed and predicted values
e1 = y1 " yˆ1
! e2 = y 2 " yˆ 2
y
e3
y1
! e1 e2
yˆ1
!
! ! ! x
!
!
Residuals Are Useful!
• They allow us to calculate the error sum of squares (SSE):
n n
SSE = " (ei ) = " (y i # yˆ i ) 2
2
i=1 i=1
• Which in turn allows us to estimate " 2 :
! SSE
"ˆ 2 =
n #2
!
• As well as an important statistic referred to as the coefficient of
determination:
!
n
2 SSE SST = # (y i " y ) 2
r = 1"
SST i=1
! !
Multiple Linear Regression
• Extension of the simple linear regression model to two or
more independent variables
y = " 0 + "1 x1 + " 2 x 2 + ...+ " n x n + #
Expression = Baseline + Age + Tissue + Sex + Error
!
• Partial Regression Coefficients: βi ≡ effect on the
dependent variable when increasing the ith independent
variable by 1 unit, holding all other predictors
constant
Categorical Independent Variables
• Qualitative variables are easily incorporated in regression
framework through dummy variables
• Simple example: sex can be coded as 0/1
• What if my categorical variable contains three levels:
0 if AA
xi = 1 if AG
2 if GG
Categorical Independent Variables
• Previous coding would result in colinearity
• Solution is to set up a series of dummy variable. In general
for k levels you need k-1 dummy variables
1 if AA
x1 =
0 otherwise
1 if AG
x2 =
0 otherwise
x1 x2
AA 1 0
AG 0 1
GG 0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
• The first thing we want to know after fitting a model is whether
any of the independent variables (X’s) are significantly related to
the dependent variable (Y):
H 0 : "1 = " 2 = ... = " k = 0
H A : At least one "1 # 0
R2 k
f = 2
•
(1$ R ) n $ (k + 1)
Rejection Region : F" ,k,n#(k +1)
!
!
Equivalent ANOVA Formulation of Omnibus Test
• We can also frame this in our now familiar ANOVA framework
- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Equivalent ANOVA Formulation of Omnibus Test
• We can also frame this in our now familiar ANOVA framework
- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Source of df Sum of Squares MS F
Variation
SSR MSR
Regression k SSR = # ( yˆ i " y ) 2
k MSE
SSE
Error n-2 SSE = # (y i " yˆ i ) 2
n "2
!
! n-1 !
Total SST = # (y i " y ) 2
! !
Rejection Region : F" ,k,n#(k +1)
!
F Test For Subsets of Independent Variables
• A powerful tool in multiple regression analyses is the ability to
compare two models
• For instance say we want to compare:
Full Model : y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + #
Reduced Model : y = " 0 + "1 x1 + " 2 x 2 + #
!
• Again, another example of ANOVA:
!
SSER = error sum of squares for
reduced model with l predictors (SSE R " SSE F ) /(k " l)
f =
SSEF = error sum of squares for SSE F /([n " (k + 1)]
full model with k predictors
!
Example of Model Comparison
• We have a quantitative trait and want to test the effects at two
markers, M1 and M2.
Full Model: Trait = Mean + M1 + M2 + (M1*M2) + error
Reduced Model: Trait = Mean + M1 + M2 + error
(SSE R " SSE F ) /(3 " 2) (SSE R " SSE F )
f = =
SSE F /([100 " (3 + 1)] SSE F /96
Rejection Region : Fa, 1, 96
!
Hypothesis Tests of Individual Regression
Coefficients
• Hypothesis tests for each "ˆ i can be done by simple t-tests:
H 0 : "ˆ i = 0
! H : "ˆ # 0
A i
"ˆ i $ " i
T=
se(" i )
Critical value : t" / 2,n#(k#1)
• Confidence Intervals
! are equally easy to obtain:
! "ˆ i ± t# / 2,n$(k$1) • se("ˆ i )
Checking Assumptions
• Critically important to examine data and check assumptions
underlying the regression model
Outliers
Normality
Constant variance
Independence among residuals
• Standard diagnostic plots include:
scatter plots of y versus xi (outliers)
qq plot of residuals (normality)
residuals versus fitted values (independence, constant variance)
residuals versus xi (outliers, constant variance)
• We’ll explore diagnostic plots in more detail in R
Fixed -vs- Random Effects Models
• In ANOVA and Regression analyses our independent variables can
be treated as Fixed or Random
• Fixed Effects: variables whose levels are either sampled
exhaustively or are the only ones considered relevant to the
experimenter
• Random Effects: variables whose levels are randomly sampled
from a large population of levels
• Example from our recent AJHG paper:
Expression = Baseline + Population + Individual + Error
Lecture 9: Linear
Regression
Goals
• Develop basic concepts of linear regression from
a probabilistic framework
• Estimating parameters and hypothesis testing
with linear models
• Linear regression in R
Regression
• Technique used for the modeling and analysis of
numerical data
• Exploits the relationship between two or more
variables so that we can gain information about one of
them through knowing values of the other
• Regression can be used for prediction, estimation,
hypothesis testing, and modeling causal relationships
Regression Lingo
Y = X1 + X2 + X3
Dependent Variable Independent Variable
Outcome Variable Predictor Variable
Response Variable Explanatory Variable
Why Linear Regression?
• Suppose we want to model the dependent variable Y in terms
of three predictors, X1, X2, X3
Y = f(X1, X2, X3)
• Typically will not have enough data to try and directly
estimate f
• Therefore, we usually have to assume that it has some
restricted form, such as linear
Y = X1 + X2 + X3
Linear Regression is a Probabilistic Model
• Much of mathematics is devoted to studying variables
that are deterministically related to one another
y = " 0 + "1 x
y "y
! "0
"x
"1 =
#y
#x
!
! x
!
!
!
• But we’re interested in understanding the relationship
between variables related in a nondeterministic fashion
!
A Linear Probabilistic Model
• Definition: There exists parameters " 0 , "1, and " ,2 such that for
any fixed value of the independent variable x, the dependent
variable is related to x through the model equation
y = "!0 +! "1 x +
!#
2
• " is a rv assumed to be N(0, # )
True Regression Line
!
"3 y = " 0 + "1 x
y "1
"2
"0 !
! x !
!
!
!
Implications
• The expected value of Y is a linear function of X, but for fixed
x, the variable Y differs from its expected value by a random
amount
• Formally, let x* denote a particular value of the independent
variable x, then our linear probabilistic model says:
E(Y | x*) = µY|x* = mean value of Y when x is x *
2
V (Y | x*) = " Y|x* = variance of Y when x is x *
!
Graphical Interpretation
y
y = " 0 + "1 x
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1
x
x1 x2
!
• For example, if x = height and y = weight then µ!
Y |x =60 is the average
!
weight for!
all individuals 60 inches tall in the population
!
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and " = 3
• Q1: What is the interpretation of "1 = 0.5?
The expected change in height associated with a 1-unit increase
in weight !
!
• Q2: If x = 20 what is the expected value of Y?
µY |x =20 = 7.5 + 0.5(20) = 17.5
• Q3: If x = 20 what is P(Y > 22)?
! " 22 -17.5 %
P(Y > 22 | x = 20) = P$ ' = 1( ) (1.5) = 0.067
# 3 &
Estimating Model Parameters
• Point estimates of "ˆ 0 and "ˆ1 are obtained by the principle of least
squares
n
f (" 0 , "1 ) = $ [ y i # (" 0 + "1 x i )] 2
! ! i=1
! y
"0
x
!
•
!
"ˆ = y # "ˆ x
0 1
!
Predicted and Residual Values
• Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
estimated regression line
yˆ1 = "ˆ 0 # "ˆ1 x1
yˆ = "ˆ # "ˆ x
2 0 1 2
• Residuals are
! the deviations of observed and predicted values
e1 = y1 " yˆ1
! e2 = y 2 " yˆ 2
y
e3
y1
! e1 e2
yˆ1
!
! ! ! x
!
!
Residuals Are Useful!
• They allow us to calculate the error sum of squares (SSE):
n n
SSE = " (ei ) = " (y i # yˆ i ) 2
2
i=1 i=1
• Which in turn allows us to estimate " 2 :
! SSE
"ˆ 2 =
n #2
!
• As well as an important statistic referred to as the coefficient of
determination:
!
n
2 SSE SST = # (y i " y ) 2
r = 1"
SST i=1
! !
Multiple Linear Regression
• Extension of the simple linear regression model to two or
more independent variables
y = " 0 + "1 x1 + " 2 x 2 + ...+ " n x n + #
Expression = Baseline + Age + Tissue + Sex + Error
!
• Partial Regression Coefficients: βi ≡ effect on the
dependent variable when increasing the ith independent
variable by 1 unit, holding all other predictors
constant
Categorical Independent Variables
• Qualitative variables are easily incorporated in regression
framework through dummy variables
• Simple example: sex can be coded as 0/1
• What if my categorical variable contains three levels:
0 if AA
xi = 1 if AG
2 if GG
Categorical Independent Variables
• Previous coding would result in colinearity
• Solution is to set up a series of dummy variable. In general
for k levels you need k-1 dummy variables
1 if AA
x1 =
0 otherwise
1 if AG
x2 =
0 otherwise
x1 x2
AA 1 0
AG 0 1
GG 0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
• The first thing we want to know after fitting a model is whether
any of the independent variables (X’s) are significantly related to
the dependent variable (Y):
H 0 : "1 = " 2 = ... = " k = 0
H A : At least one "1 # 0
R2 k
f = 2
•
(1$ R ) n $ (k + 1)
Rejection Region : F" ,k,n#(k +1)
!
!
Equivalent ANOVA Formulation of Omnibus Test
• We can also frame this in our now familiar ANOVA framework
- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Equivalent ANOVA Formulation of Omnibus Test
• We can also frame this in our now familiar ANOVA framework
- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Source of df Sum of Squares MS F
Variation
SSR MSR
Regression k SSR = # ( yˆ i " y ) 2
k MSE
SSE
Error n-2 SSE = # (y i " yˆ i ) 2
n "2
!
! n-1 !
Total SST = # (y i " y ) 2
! !
Rejection Region : F" ,k,n#(k +1)
!
F Test For Subsets of Independent Variables
• A powerful tool in multiple regression analyses is the ability to
compare two models
• For instance say we want to compare:
Full Model : y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + #
Reduced Model : y = " 0 + "1 x1 + " 2 x 2 + #
!
• Again, another example of ANOVA:
!
SSER = error sum of squares for
reduced model with l predictors (SSE R " SSE F ) /(k " l)
f =
SSEF = error sum of squares for SSE F /([n " (k + 1)]
full model with k predictors
!
Example of Model Comparison
• We have a quantitative trait and want to test the effects at two
markers, M1 and M2.
Full Model: Trait = Mean + M1 + M2 + (M1*M2) + error
Reduced Model: Trait = Mean + M1 + M2 + error
(SSE R " SSE F ) /(3 " 2) (SSE R " SSE F )
f = =
SSE F /([100 " (3 + 1)] SSE F /96
Rejection Region : Fa, 1, 96
!
Hypothesis Tests of Individual Regression
Coefficients
• Hypothesis tests for each "ˆ i can be done by simple t-tests:
H 0 : "ˆ i = 0
! H : "ˆ # 0
A i
"ˆ i $ " i
T=
se(" i )
Critical value : t" / 2,n#(k#1)
• Confidence Intervals
! are equally easy to obtain:
! "ˆ i ± t# / 2,n$(k$1) • se("ˆ i )
Checking Assumptions
• Critically important to examine data and check assumptions
underlying the regression model
Outliers
Normality
Constant variance
Independence among residuals
• Standard diagnostic plots include:
scatter plots of y versus xi (outliers)
qq plot of residuals (normality)
residuals versus fitted values (independence, constant variance)
residuals versus xi (outliers, constant variance)
• We’ll explore diagnostic plots in more detail in R
Fixed -vs- Random Effects Models
• In ANOVA and Regression analyses our independent variables can
be treated as Fixed or Random
• Fixed Effects: variables whose levels are either sampled
exhaustively or are the only ones considered relevant to the
experimenter
• Random Effects: variables whose levels are randomly sampled
from a large population of levels
• Example from our recent AJHG paper:
Expression = Baseline + Population + Individual + Error
Simple Linear Regression
The Model
The model has a deterministic and a probabilistic components
House
Cost
Most lots sell
for $25,000
House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost
Most lots sell
for $25,000
House size
The first order linear model
Y = β 0 + β1 X + ε
Y = dependent variable
X = independent variable
β0 = Y-intercept β0 and β1 are unknown population
β1 = slope of the line Y parameters, therefore are estimated
from the data.
ε = error variable
Rise β1 = Rise/Run
β0 Run
X
Estimating the Coefficients
The estimates are determined by
drawing a sample from the population of interest,
calculating sample statistics.
producing a straight line that cuts into the data.
Y
Question: What should be
considered a good line?
X
The Least Squares
(Regression) Line
A good line is one that minimizes
the sum of squared differences between the
points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
4 (2,4)
The second line is horizontal
3 (4,3.2)
2.5
2
(1,2)
(3,1.5)
1
The smaller the sum of
1 2 3 4 squared differences
the better the fit of the
line to the data.
The Estimated Coefficients
To calculate the estimates of the line The regression equation that estimates
coefficients, that minimize the differences the equation of the first order linear model
between the data points and the line, use is:
the formulas:
cov(X,Y ) sXY
b1 = 2
sX
= 2
sX
Yˆ = b0 + b1 X
b0 = Y − b1 X
The Simple Linear Regression Line
• Example 17.2 (Xm17-02)
A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars. Car Odometer Price
A random sample of 100 cars is selected, 1 37388 14636
and the data 2 44758 14122
recorded.
3 45833 14016
Find the regression line. 4 30862 15590
5 31705 15568
6 34010 14718
. .
Independent .
Dependent
. .
variable .
X variable Y
. . .
• Solution
– Solving by hand: Calculate a number of statistics
X = 36,009.45; sX2 =
∑ (X i − X )2
= 43,528,690
n −1
Y = 14,822.823; cov(X,Y ) =
∑ (X i − X )(Yi − Y )
= −2,712,511
n −1
where n = 100.
cov( X , Y ) −2, 712,511
b1 = 2
= = −.06232
sX 43,528, 690
b0 = Y − b1 X = 14,822.82 − (−.06232)(36, 009.45) = 17, 067
Yˆ = b0 + b1 X = 17,067 − .0623X
• Solution – continued
– Using the computer (Xm17-02)
Tools > Data Analysis > Regression >
[Shade the Y range and the X range] > OK
Xm17-02
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8063
R Square 0.6501
Adjusted R Square 0.6466
Yˆ = 17,067 − .0623X
Standard Error 303.1
Observations 100
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561
Coefficients Standard Error t Stat P-value
Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000
Interpreting the Linear Regression -
Equation
17067 Odometer Line Fit Plot
16000
15000
Price
14000
0 No data 13000
Odometer
Yˆ = 17,067 − .0623X
The intercept is b0 = $17067. This is the slope of the line.
For each additional mile on the odometer,
Do not interpret the intercept as the the price decreases by an average of $0.0623
“Price of cars that have not been driven”
Error Variable: Required Conditions
The error ε is a critical part of the regression model.
Four requirements involving the distribution of ε must be satisfied.
The probability distribution of ε is normal.
The mean of ε is zero: E(ε) = 0.
The standard deviation of ε is σε for all values of X.
The set of errors associated with different values of Y are all independent.
The Normality of ε
E(Y|X3)
The standard deviation remains constant,
β +β X µ3
0 1 3
E(Y|X2)
β0 + β1 X2 µ2
but the mean value changes with X E(Y|X1)
β0 + β1X1 µ1
X1 X2 X3
From the first three assumptions we
have: Y is normally distributed with
mean E(Y) = β0 + β1X, and a constant
standard deviation σε
Assessing the Model
The least squares method will produces a regression line
whether or not there are linear relationship between X
and Y.
Consequently, it is important to assess how well the
linear model fits the data.
Several methods are used to assess the model. All are
based on the sum of squares for errors, SSE.
Sum of Squares for Errors
This is the sum of differences between the points and the regression line.
It can serve as a measure of how well the line fits the data. SSE is
defined by
n
SSE = ∑ (Yi − Yˆi ) 2 .
i=1
– A shortcut formula
SSE = (n − 1)s 2
−
[cov(X,Y)]
2
Y
sX2
Standard Error of Estimate
The mean error is equal to zero.
If σε is small the errors tend to be close to zero (close to the mean error).
Then, the model fits the data well.
Therefore, we can, use σε as a measure of the suitability of using a linear
model.
An estimator of σε is given by sε
S tan dard Error of Estimate
SSE
sε =
n−2
Example 17.3
Calculate the standard error of estimate for Example
17.2, and describe what does it tell you about the
model fit?
Solution
∑ i i
(Y − Yˆ ) 2
Calculated before
sY2 = = 259,996
n −1
2 2
[cov( X , Y )] ( −2, 712,511)
SSE = (n − 1) sY2 − 2
= 99(259,996) − = 9,005,450
sX 43,528,690
SSE 9,005,450
sε = = = 303.13 It is hard to assess the model based
n−2 98 on sε even when compared with the
mean value of Y.
s ε = 303.1 y = 14,823
Testing the Slope
When no linear relationship exists between two variables, the regression
line should be horizontal.
Linear relationship. No linear relationship.
Different inputs (X) yield Different inputs (X) yield
different outputs (Y). the same output (Y).
The slope is not equal to zero The slope is equal to zero
We can draw inference about β1 from b1 by testing
H0: β1 = 0
H1: β1 ≠ 0 (or < 0,or > 0)
The test statistic is
b1 − β1 where sε
t= sb1 =
If the error variable issnormally distributed, the statistic has Student
2
b1
distribution with d.f. = n-2.
(n −1)s X
t
The standard error of b1.
Example 17.4
Test to determine whether there is enough evidence to infer that there is
a linear relationship between the car auction price and the odometer
reading for all three-year-old Tauruses, in Example 17.2.
Use α = 5%.
Solving by hand
To compute “t” we need the values of b1 and sb1.
b1 = −.0623
sε 303.1
sb1 = = = .00462
2
(n −1)sX (99)(43,528,690)
b1 − β1 −.0623 − 0
t= = = −13.49
sb1 .00462
The rejection region is t > t.025 or t < -t.025 with ν
= n-2 = 98.
Approximately, t.025 = 1.984
Xm17-02
• Using the computer
Price Odometer SUMMARY OUTPUT
14636 37388
14122 44758 Regression Statistics
14016 45833 Multiple R 0.8063
15590 30862 R Square 0.6501 There is overwhelming evidence to infer
15568 31705 Adjusted R Squ 0.6466
14718 34010 Standard Error 303.1
that the odometer reading affects the
14470 45854 Observations 100 auction selling price.
15690 19057
15072 40149 ANOVA
14802 40237 df SS MS F Significance F
15190 32359 Regression 1 16734111 16734111 182.11 0.0000
14660 43533 Residual 98 9005450 91892
15612 32744 Total 99 25739561
15610 34470
14634 37720 Coefficients Standard Error t Stat P-value
14632 41350 Intercept 17067 169 100.97 0.0000
15740 24469 Odometer -0.0623 0.0046 -13.49 0.0000
Coefficient of Determination
To measure the strength of the linear relationship we use the coefficient
of determination:
[cov(X,Y)]
2
R 2
= 2 2
s s
(or, = rXY );
2
X Y
2 SSE
or, R = 1− (see p. 18 above)
∑ (Yi − Y ) 2
• To understand the significance of this coefficient
note:
The regression model
Overall variability in Y
The error
y2
Two data points (X1,Y1) and (X2,Y2)
of a certain sample are shown.
y1 Variation in Y = SSR + SSE
x1 x2
Variation explained by the
Total variation in Y = + Unexplained variation (error)
regression line
(Y1 − Y ) 2 + (Y2 − Y ) 2 = (Yˆ1 − Y ) 2 + (Yˆ2 − Y ) 2 +(Y1 − Yˆ1) 2 + (Y2 − Yˆ2 ) 2
• R2 measures the proportion of the variation in Y
that is explained by the variation in X.
2
R =1−
SSE
=
∑ i −SSE
(Y −Y ) 2
=
SSR
∑ (Yi −Y ) 2
∑ (Y −Y )
i
2
∑ (Yi −Y ) 2
• R2 takes on any value between zero and one.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between X and Y.
Example 17.5
Find the coefficient of determination for Example 17.2; what does this statistic
tell you about the model?
Solution
Solving by hand;
2
2 [cov(X,Y )] [−2,712,511]2
R = 2 2
= (43,528,688)(259,996) = .6501
sX sY
– Using the computer
From the regression output we have
SUMMARY OUTPUT
Regression Statistics
65% of the variation in the auction
Multiple R 0.8063 selling price is explained by the
R Square 0.6501
Adjusted R Square 0.6466 variation in odometer reading. The
Standard Error 303.1 rest (35%) remains unexplained by
Observations 100
this model.
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561
CoefficientsStandard Error t Stat P-value
Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000
22/03/2022
2D 2D
VECTOR MATRIX TENSOR
VECTOR CONVERSION2
Cam mot
ALpanati blw the wo m 20
Ael mwrLasm9dusruoimg
2x
NOTE-
axb txo
x
ac
mp. cond
T27
2 Ix4
0 O
4x2
LINE AR FONCTION
dit n,y be uo_wotes, t b a
Mm.bun amd F br a
lmman mciom.
F )amd FLy) aauckms_Jso
ADDITIVITY > F =F(n)+ Fl
HOMOHENEITY F a c) cF()
Taume. ouous both.it ua
Vnbh umaie
D
Ftxy)
-C- Fca)
9 F)
t(xtn) F)tF()
4x O
2x 2
NOT LINEAR
MATRIX
3nt 4 10
t 3
t TENSOR
7 M-dummuon.al meteApacz-
7gmnau m than on tqal to 3.
A colounMinagtoh piiel24-bia)
R
&
HI