National Economic University
Chapter 3
 Panel Data Regression
 Dr. Phung Minh Duc
 Contents
1. Introduction
2. Panel data
3. Regression models with Panel data
4. Estimation model selection Tests
5. Some defects of the panel model
6. Commands on Stata
7. Practice
 Introduction
Endogeneity Problem
 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑢
❖ Endogeneity problem is when the independent variable is correlated with the
 error term:
 𝑐𝑜𝑣(𝑋, 𝑢) ≠ 0
❖ If the model contains endogenous variables => The coefficients estimated by
 the OLS method are biased and unstable: 𝐸(𝛽 1 ) ≠ 𝛽1
❖ Endogeneity is a frequent problem in economic and econometrics.
 Introduction
Sources of endogeneity
❖ Omitted variable: Independent variables are not observed and end up in
 the error term, so the error term is correlated with the independent
 variables used in the model
❖ Measurement: Measurement error can cause correlation between the
 mismeasured variable and the error term
❖ Simultaneity: The independent variable and the dependent variable are
 related at the same time
 Introduction
Solutions for endogeneity
❖ Find and include a proxy variable in the model
❖ Using instrumental variable method (IV)
❖ Using econometric models with panel data
 Panel Data
❖ Panel data is a set of data collected on the same set of individuals
 (household, enterprise, province, etc…) along time at equally spaced time
 points..
❖ Panel data contains two directions:
 ▪ The horizontal information between objects at the same time
 (characteristic of cross-sectional data)
 ▪ The vertical information of each object along time(characteristic of
 time series data).
 Panel Data
❖ Panel data structure
 Individual Time Depvar (Y) Indepvar (X)
 1 1 𝑦11 𝑥11
 1 2 𝑦12 𝑥12
 1 3 𝑦13 𝑥13
 … … … …
 N 1 𝑦𝑁1 𝑥𝑁1
 N 2 𝑦𝑁2 𝑥𝑁2
 N 3 𝑦𝑁3 𝑥𝑁3
 Panel Data
❖ Note:
 The variables in the panel dataset can include the following groups:
 ▪ Group 1: Variables that change in both directions, such as: the output of
 a enterprises, personal consumption, etc.
 ▪ Group 2: Variables that change horizontally but not vertically, such as:
 the gender of household head, religion, etc.
 ▪ Group 3: Variables that change vertically but not horizontally. such as:
 exchange rate, basic interest rate, general macroeconomic environment
 So, panel data provides more dimensional information than other data
 types and are very useful in applied research.
 Panel Data
❖ Balance panel data is a data set with full individual information at all
 times of observation
❖ Unbalance panel data is a data set with missing information of some
 individuals at some time of observation
❖ Sources of unbalance:
 ▪ Self-selection (enterprise bankruptcy, province merged, individual
 death, etc.)
 ▪ Random factors (data entry errors, data at a certain time cannot be
 collected)
 Panel Data
❖ The size of the dataset
 Suppose the data set contains information about N individuals at T of the
observation period, then there are the following cases:
 ▪ N >> and T <<: The traditional panel data format
 ▪ N << and T >>: Take care of the autocorrelation problem
 ▪ N << and T<<: The data format is rarely used
 ▪ N >> and T>>: Being interested in research (Big data)
 Panel Data
❖ Advantages of panel data
 ▪ Rich information: Horizontal (observations) and vertical (time)
 ▪ Solve the problem of endogeneity due to the lack of unobserved variables
 (individual characteristic variables)
 If intra-individual variation is considered, the impact of unobserved factors
 can be excluded (individual characteristics do not change over time).
 Panel Data
❖ Advantages of panel data
 ▪ Achieve vivid and refined analytics:
 For example, in poverty reduction research, the panel data not only shows
 the number of poor households, but also provides information on which
 households are chronically poor, temporary poor or falling back into poverty.
 Panel Data
❖ Advantages of panel data
 ▪ Reducing multicollinearity in the problem with distributed lag
 ▪ Increasing degrees of freedom, increasing the accuracy of statistical
 inferences
 ▪ Suitable for datasets collected in developing countries
 Panel Data
❖ Some typical panel datasets in Vietnam
 ▪ Vietnam Household Living Standard Survey (VHLSS)
 ▪ General Enterprise Survey (GES)
 ▪ Small and Medium Enterprise Survey (SMES)
 ▪ Provincial Competitiveness Index (PCI)
 ▪ …
 Panel Data
❖ Practice on Stata
 ▪ Create an panel data file from annual data files
 ▪ Using the commands:
 ➢ merge
 ➢ reshape long, i(id) j(time)
 ➢ xtset id time
 Regression models with Panel data
❖ General Panel Regression Model
 𝑌𝑖𝑡 = 𝛽0 + 𝛽1 𝑋1𝑖𝑡 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 (1)
In which:
• 𝑖 is the individual index, 𝑗 is the time index;
• 𝑐𝑖 represents an unobserved factor (individual characteristic), which does
 not change over time, that has an impact on 𝑌.
Note: Since 𝑐𝑖 represents the difference between individuals in the set of
observations, and this difference does not depend on time, the model (1) is
also called individual effect models.
 Regression models with Panel data
Depending on the nature of 𝑐𝑖 , we have three models with different estimation
methods as follows:
▪ Pooled Estimation Model: There is no (or omitted) 𝑐𝑖 in the model
▪ Random Effects Estimation: There exists 𝑐𝑖 but 𝑐𝑖 is not correlated with
 any independent variable 𝑋𝑘 in the model
▪ Fixed Effects Estimation: There exist 𝑐𝑖 and 𝑐𝑖 are correlated with at
 least one independent variable 𝑋𝑘 in the model
 Regression models with Panel data
❖ Pooled Estimation Model (POLS)
 𝑌𝑖𝑡 = 𝛽0 + 𝛽1 𝑋1𝑖𝑡 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 (1)
 ▪ If 𝑐𝑖 really does not exist, then OLS is the best estimator for (1), with the
 following assumptions:
 ➢ POLS1: 𝐸 𝑢𝑖𝑡 𝑋 = 0, ∀𝑖, 𝑡
 ➢ POLS2: Random error _𝑖𝑡 not autocorrelated
 ➢ POLS3: 𝑣𝑎𝑟 𝑢𝑖𝑡 𝑋 = 𝜎 2 , ∀𝑖, 𝑡
 ▪ If 𝑐𝑖 exists (which is quite common), then OLS obtains a biased estimator
 ▪ Command on Stata: reg Y X1 … Xk
 Regression models with Panel data
❖ Random Effects Estimation (RE)
 𝑌𝑖𝑡 = 𝛽0 + 𝛽1 𝑋1𝑖𝑡 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 (1)
 ▪ If 𝑐𝑖 exists, but 𝑐𝑖 is not correlated with 𝑋𝑖𝑡 , then there is no
 endogenous variable problem. However, since c is included in the
 random error, then the new random errors 𝑣𝑖𝑡 = 𝑐𝑖 + 𝑢𝑖𝑡 is
 autocorrelated.
 ▪ The random effect estimation method focuses on solving the
 autocorrelation problem of 𝑣𝑖𝑡
 Regression models with Panel data
❖ Random Effects Estimation (RE)
 𝑌𝑖𝑡 = 𝛽0 + 𝛽1 𝑋1𝑖𝑡 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 (1)
The assumptions of the RE estimation method are as follows
 ❖ RE1: 𝐸 𝑢𝑖𝑡 𝑋, 𝑐 = 0, ; 𝐸 𝑐𝑖 , 𝑋 = 0, ∀𝑖, 𝑡
 ❖ RE2: The random error 𝑢𝑖𝑡 not autocorrelated
 ❖ RE3: v𝑎𝑟 𝑐𝑖 𝑋 = 𝜎𝑐2 ; 𝑣𝑎𝑟 𝑢𝑖𝑡 𝑋, 𝑐 = 𝜎𝑢2 , ∀𝑖, 𝑡
 Regression models with Panel data
❖ Random Effects Estimation (RE)
 𝑌𝑖𝑡 = 𝛽0 + 𝛽1 𝑋1𝑖𝑡 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 (1)
Estimation methods for RE model:
 ❖ Generalized Least Squares (GLS) Estimator
 ❖ Maximum Likelihood Estimation (MLE)
Command on Stata (for GLS estimator method):
 xtreg Y X1 … Xk, re
 Regression models with Panel data
❖ Fixed Effects Estimation (FE)
 𝑌𝑖𝑡 = 𝛽0 + 𝛽1 𝑋1𝑖𝑡 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 (1)
 ▪ If 𝑐𝑖 exists, and 𝑐𝑖 is correlated with 𝑋𝑖𝑡 , then there is endogenous
 variable problem.
 ▪ The assumptions of the FE estimation method are as follows:
 ❖ FE1: 𝐸 𝑢𝑖𝑡 𝑋𝑖 , 𝑐𝑖 = 0, ∀𝑡, that mean:
 𝑐𝑜𝑣 𝑢𝑖𝑡 , 𝑋𝑖 = 0 and 𝑐𝑜𝑣 𝑢𝑖𝑡 , 𝑐𝑖 = 0
 ❖ FE2: 𝑟𝑎𝑛𝑘 𝐸 𝑋 ′ 𝑋 = 𝑘
 ❖ RE3: v𝑎𝑟 𝑢𝑖𝑡 𝑋𝑖𝑡 = 𝜎𝑢2 ; 𝑐𝑜𝑣(𝑢𝑖 , 𝑢𝑗 ) = 0, ∀𝑖 ≠ 𝑗
 Regression models with Panel data
❖ The within estimator with FE model
 𝑌𝑖𝑡 = 𝛽0 + 𝛽1 𝑋𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 (1)
▪ For each 𝑖, average the equation (1) over time, we get:
 1 1 1
 σ 𝑌 = 𝛽0 + 𝛽1 . 𝑇 σ𝑡 𝑋𝑖𝑡 + 𝑐𝑖 + 𝑇 . σ𝑡 𝑢𝑖𝑡
 𝑇 𝑡 𝑖𝑡
 or 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑐𝑖 + 𝑢𝑖 (2)
▪ From (1) and (2), because 𝑐𝑖 is fixed over time, we have:
 ሷ = 𝛽1 𝑋ሷ 𝑖𝑡 + 𝑢ሷ 𝑖𝑡
 𝑌𝑖𝑡 − 𝑌𝑖 = 𝛽1 (𝑋𝑖𝑡 − 𝑋𝑖 ) + (𝑢𝑖𝑡 − 𝑢𝑖 ) or 𝑌𝑖𝑡 (3)
A pooled OLS estimator that is based on the time-demeaned variables is called the fixed
effects estimator or the within estimator.
Command on Stata: xtreg Y X, fe
 Estimation model selection Tests
Breusch – Pagan Test
❖ Hypothesis testing:
 𝐻0 : 𝑣𝑎𝑟 𝑐𝑖 = 0
 ቊ
 𝐻1 : 𝑣𝑎𝑟(𝑐𝑖 ) ≠ 0
❖ Test statistics
 σ𝑛𝑖=1(σ𝑇𝑡=1 𝑣𝑖𝑡 )2
 1− 𝑛
 (𝑛𝑇)2 σ𝑖=1 σ𝑇𝑡=1 𝑣 2 𝑖𝑡
 𝜆𝐿𝑀 =
 2 𝑛𝑇 2 − 𝑛𝑇
If 𝐻0 is true, then 𝜆𝐿𝑀 obeys the law of Chi-squared with one degree of freedom
Command on Stata: xttest0
 Estimation model selection Tests
Hausman Test
❖ Hypothesis testing:
 𝐻0 : 𝑐𝑜𝑣 𝑐𝑖 , 𝑢𝑖𝑡 = 0
 ቊ
 𝐻1 : 𝑐𝑜𝑣 (𝑐𝑖 , 𝑢𝑖𝑡 ) ≠ 0
❖ Test statistics
 𝜒 2 𝑞𝑠 = (𝛽መ𝐹𝐸 − 𝛽መ𝑅𝐸 )′(𝑉𝐹𝐸 − 𝑉𝑅𝐸 )−1 (𝛽መ𝐹𝐸 − 𝛽መ𝑅𝐸 )
If 𝐻0 is true, then 𝜆𝐿𝑀 obeys the law of Chi-squared with one degree of
freedom
Command on Stata: hausman fe re
 Estimation model selection Tests
 𝑐𝑖 = 0
 RE POLS
 (xttest0; P>>)
 𝑐𝑖 ≠ 0, P<<
 FE or RE 𝑃≫
(Hausman) RE
 𝑃≪
 FE
 Estimation model selection Tests
Practice on Stata
❖ Step 1: Model selection between POLS and RE
 xtreg Y X1…Xk, re
 Xttest0 => If P-value >> then POLS is the best model
❖ Step 2: Model selection between FE and RE
 xtreg Y X1…Xk, fe
 est store fe
 xtreg Y X1…Xk, re
 est store re
 hausman fe re => If P-value << then FE is the best model
 If P-value >> then RE is the best model
 Some defects of the panel model
❖ In FE model
▪ Autocorrelation
 xtserial Y X If P-value << then the model has autocorrelation
 => xtregar Y, X, fe
▪ Contemporaneous correlations
 xttest2
 If P-value << then the model has contemporaneous correlations
 => xtscc Y X, fe
▪ Heteroskedasticity
 xttest3 If P-value << then the model has heteroskedasticity
 => xtreg Y X, fe robust
 Some defects of the panel model
❖ In RE model
▪ Autocorrelation
 xttest1 If P-value >> then the model has autocorrelation
 => xtregar Y, X, fe
▪ Heteroskedasticity
 xtreg Y X, re
 predict res1, ue
 robvar res1, by (id)
 If P-value << then the model has heteroskedasticity
 => xtreg Y X, re robust
Practice