CH1&2
SLR
$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$
Least Square
residual
$e_i = Y_i - \hat{Y_i}$ MSE in SLR:
Normal Equations
$$
\begin{aligned}
\sum{Y_i} &= nb_0 + b_1\sum{X_i}
\
\sum{X_iY_i} &= b_0\sum{X_i} + b_1\sum{X_i^2}
\end{aligned}
$$
Inferences on b1
Confidence Interval
Inferences on b0
b0 in the form of linear combination of Y: $b_0 = \sum_{i=1}^n(1/n-\bar{X} k_i)Y_i$
Distribution
Confidence Interval
Inferences on E{Yh}
Distribution
Confidence Interval
1 | predict(copier.lm,newdata = data.frame(copier_nums=6),interval="pred",level=0.9) |
Prediction of New Observation
Prediction Interval
when parameters are known
when parameters are unknown
Power Analysis
Notice that the power consists of two parts. The left one is very small.
WHS Confidence Band (SLR with p=2)
Measures to describe the degree of linear association
Coefficient of Determination
1 | summary(copier.lm)$r.squared |
Coefficient of Correlation
Inferences on Correlation Coefficients
In a bivariate normal model, the parameter ρ 12provides information about the degree of the linear relationship between the two variables Y 1 and Y 2 .
Point Estimator
The maximum likelihood estimator of $\rho_{12}$.
Spearman’s Rank Correlation
1 | cor( Y1, Y2, method="spearman" ) |
CH3 Remediation
Diagnostics and Remedial Measures
Residual Plot
$e_i = Y_i - \hat{Y_i}$
Standard Version:
$e_i \ vs. \ \hat{Y_i}$
$e_i \ vs. \ X_i$
$|e_i| \ or \ e_i^2 \ vs. \ X_i$
SemiStudentized Residual
Outliers
Test of Normality of error terms
Small departures from normality do not create any serious problems. Major departures, on the other hand, should be of concern.
Normality Plot
1 | qqnorm(ei) |
short-tailed: ignore if not extreme
long-tailed: 1. use another dist. 2. use resampling method
skewed: do a transformation
Shapiro-Wilk Test
Null hypothesis: normal
1 | shapiro.test( resid(CH01TA01.lm) ) |
Test of Homogeneity
Brown-Forsythe Test
1 | require( lawstat ) |
Note that the test statistic from BF.htest is a F statistic but what we want is a t statistic. So t = sqrt(BF.htest$statistic)
Lack of Fit(LOF) Test
A formal test for determining whether a specific type of regression function adequately fits the data.
c: number of variables
Assumptions
For obversations Y given X
Independent
Normally distributed
the distributions of Y have the same variance $\sigma^2$
Require replicates.
Test Statistic
1 | reduced.model.lm <- lm(Y ~ X) |
Uncover curvilinear relationship in SLR model
Box-Cox power transform
The model:
First, standardize Y:
Use maximum likelihood to estimate $\lambda$ and rest of the parameters.
Code:
1 | require(MASS) |
CH4 Simultaneous Inference
Simultaneous Inference
Bonferroni’s Inequality
type I error $\alpha = P(A)$
Bonferroni-adjusted critical point
Bonferroni Conf. Intervals on E{Yh}
Working-Hotelling-Scheffé (WHS) Intervals on E{Yh}
General Form:
$$
W^2= pF(1–\alpha; p, n–p).
$$
So, use the WHS value if W ≤ B, and use Bonferroni if B < W.
Regression through the Origin
Inferences of b1
Estimate of b1:
Confidence Interval of b1
Inferences of E{Y}
Code:
1 | lm(Y ~ X - 1) |
Inverse Prediction
CH5 Matrix Approach to SLR
Basic Rules
Special Matrices
X’X / X’Y / Y’Y
all ones:
Covariance Matrix
Matrix Formulation of SLR
LS Estimation
b
MSE = SSE/df
$\hat{Y_h}$
Prediction
Hat Matrix
Residual Vector
ANOVA
Quadratic Forms
Codes
1 | # Inverse a matrix X -> X^(-1) |
Be careful when use X in matrix form. It is a design matrix that has extra ones vector.
CH6 Multiple Regression I
Multiple Linear Regression (MLR)
Matrix form:
There are p-1 variables.
Matrix Form
MLR ANOVA Table
F test of MLR
Adjusted R^2
Notice that by adding a new X k to the model, SSE cannot increase. Thus we can drive R 2 → 1 simply by pushing p → n.
An adjusted R2 compensates by replacing the SS terms with MS terms:
MLR Inferences
Inferences on b
Pointwise
Familywise (Bonferroni Adjustment)
Inferences on E{Y}
Pointwise CI and familywise CI
pointwise: familywise
WHS (Confidence Band)
simultaneous 1–α confidence (hyper-)band on E{Y} over all possible vectors Xh , use the WHS method:
Prediction
estimate: standard error:
CI (pointwise) :
CI (familywise S-method): (g: number of intervals to be predictted)
(bonferroni method)
Test for constant variance
Brown-Forsythe Test
1 | require( lawstat ) |
Notice that the order is about $\hat{Y}$.
Lack of Fit test
CH7 Multiple Regression II
ANOVA
Partial R^2. (Coefficient of Partial Determination.)
Codes for getting partial R square
1 | CH07TA01.aov = anova(lm(Y~X1+X2+X3)) |
1 | CH07TA01.aov[2,2]/anova(lm(Y~X1))[2,2] |
Multicollinearity
Large correlation between X1 and X2 indicates possible multicollinearity!
Remedies for multicollinearity
CH8 Quantitative and Qualitative Predictors for MLR
Polynomial Regression
centering:
Why use centering in regression?
In regression, it is often recommended to scale the features so that the predictors have a mean of 0. This makes it easier to interpret the intercept term as the expected value of Y when the predictor values are set to their means. There are few other aspects that vouch for feature centering in case of regression:
- When one variable has a very large scale: e.g. if you are using the population size of a country as a predictor. In that case, the regression coefficients may be on a very small order of magnitude (e.g. e^-9) which can be a little annoying when you’re reading computer output, so you may convert the variable to, for example, population size in millions or just perform a Normalization.
- While creating power terms: Let’s say you have a variable, X, that ranges from 1 to 2, but you suspect a curvilinear relationship with the response variable, and so you want to create an X² term. If you don’t center X first, your squared term will be highly correlated with X, which could muddy the estimation of the beta. Centering first addresses this issue.
- Creating interaction terms: If an interaction/product term is created from two variables that are not centered on 0, some amount of collinearity will be induced (with the exact amount depending on various factors).
Testing Polynomial Models
Qualitative Predictator
Binary scatter plot
Equal-Slope ANCOVA (without interaction)
Plot two types of data
Unequal-Slope ANCOVA (with interaction)
Example:
To test whether the two levels of the categorial variable are different from each other is just to test if the coefficient of that variable in the lm model is significant or not.
Access the quality of a fit
overall residual plot
individual residual plot of each level of the factor
normality qqplot
outlier analysis via Studentized deleted residuals
influencial diagnostics
make sure no observations with unusually high influence
CH9 Model Building – I: Variable Selection
Metrics for Variable Selection
1. Maximum $R_p^2$
Among a group of different possible models, each with p parameters (p–1 predictors), choose the model with the highest $R_p^2$ .
But R 2cannot decrease.
In practice, we look for a flat point.
2. Maximum $R_{ap}^2$
Adjusted R^2
3. Mallow’s Cp
4. AIC and BIC
Information Criterion (IC).
5. PRESS
When prediction of a future Y i is a central goal, we can study the prediction error for each observation.
Let Y i(i) be the value predicted at observation i after leaving Y i out of the MLR calculations. (A “leave-one-out,” or LOO, predictor: a kind of cross-validation).
If the model predicts Y i(i) well – even without Y i being fit – it could be a good model.
Stepwise Variable Selection
Forward Stepwise Selection & Backward Elimination
CH10 Model Building – II: Diagnostics
Added-Variable Plots
Partial Residual Plot
Identifying Outlying Y Observations—Studentized Deleted Residuals
Studentized Residuals
Deleted Residuals
Studentized Deleted Residuals **
It could be used to identify outlying Y observations.
confidens
- Use t test from Studentized Deleted Residuals
- Use Hat matrix from Leverage, see the following.
Identifying Outlying X Observations—Hat Matrix Leverage Values
Hat Matrix
where p is the number of regression parameters in the regression function including the intercept term.
In addition, it can be shown that h iiis a measure of the distance between the X values for the ith case and the means of the X values for all n cases. Thus, a large value h ii indicates that the ith case is distant from the center of all X observations.
Leverage
The diagonal element hii in this context is called the leverage (in terms of the X values) of the ith case. Thus, a large value h iiindicates that the ith case is distant from the center of all X observations.
Hence, leverage values greater than 2p/n are considered by this rule to indicate outlying cases with regard to their X values. Another suggested guideline is that h ii values exceeding .5 indicate very high leverage, whereas those between .2 and .5 indicate moderate leverage. Additional evidence of an outlying case is the existence of a gap between the leverage values for most of the cases and the unusually large leverage value(s).
Draw a plot:
Use leverage to identify hidden extrapolation
Influence Measures
After identifying cases that are outlying with respect to their Y values and/or their X values, the next step is to ascertain whether or not these outlying cases are influential. We shall consider a case to be influential if its exclusion causes major changes in the fitted regression function.
DFFITS
The letters DF stand for the difference between the fitted value Y i for the ith case when all n cases are used in fitting the regression function and the predicted value Y i(i) for the ith case obtained when the ith case is omitted in fitting the regression function.
ti is from the studentized deleted residual.
Cook’s Distance
In contrast to the DFFITS measure, which considers the influence of the ith case on the fitted value $\hat{Y_i}$ for this case, Cook’s distance measure considers the influence of the ith case on all n fitted values.
plotting codes:
1 | par( mfrow=c(1,2) ) |
DFBETAS
A measure of the influence of the ith case on each regression coefficient $b_k$ (k = 0, 1, . . . , p −1) is the difference between the estimated regression coefficient $b_k$ based on all n cases and the regression coefficient obtained when the ith case is omitted, to be denoted by $b_{k(i)}$ .
useage
Multicollinearity Diagnostics
Variance Inflation
These factors measure how much the variances of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related.
VIFs (variance inflation factors)
Note that the Y here in the code does not involve in calculation. We can do this without knowing Y.
Differences between correlation, collinearity and multicollinearity
Summary
CH11 Model Building – III: Remedial Measures
Weighted LS (Addressing Heterogeneous Variances)
By using WLS, we change the MLR model by changing the constant variace to heterogeneous variace, i.e.,assuming every point has different variance.
Use $s_j^2$ to estimate $\sigma_j^2$ if there are replications
Estimation of Variance Function or Standard Deviation Function
Ridge Regression (Addressing multicollinearity)
center Y
standardize the predictors
Correlation Transformation
Use of the correlation transformation helps with controlling roundoff errors and, by expressing the regression coefficients in the same units, may be of help when these coefficients are compared.
The correlation transformation is a simple modification of the usual standardization of a variable.
The predicted value: $\hat{Y} = X b^R$
Choose constant parameter c.
Code for choosing c:
- plot and find a flat point or use 2. vif as criterion
Smoothing
Draw contour plot and loess smoothing plot
Bootstrapping
Bootstrapping in regression and confidence interval
$2b_1 - b_1^*(1-\alpha/2) \leq 2b_1 - b_1^*(\alpha/2) $
Codes in R
Draw Scatteplot Matrix
1 | CH06FI05.df = data.frame( Y_SALES, X1_TARGPOP, X2_DISPINC ) |
Correlation Matrix
1 | cor(CH06FI05.df) |
Normality plot
1 | qqnorm(resid(data.lm)) |
Confidence Interval Estimation
Beta1 ,Beta2 ,…
Pointwise conf. Intervals
1 | confint( data.lm ) |
Simultaneous Bonferroni corrected CI
1 | g = length( coef(CH06FI05.lm) ) - 1 |
E{Yh} g number
Pointwise CI
1 | predict.lm( CH06FI05.lm, newdata=newdata.df, interval='confidence', level=1-alpha)) ) |
Simultaneous CI
Yh (predictted values)
First compare Scheffé and Bonferroni critical points:
1 | Spoint = sqrt( g*qf(1-alpha, g, CH06FI05.lm$df.residual) ) |
Pointwise Prediction CI
1 | predict.lm( CH06FI05.lm, newdata=newdata.df, interval='prediction' ) |
Simultaneous Prediction CI
1 | predict.lm( CH06FI05.lm, newdata=newdata.df, interval='prediction', level=1-(alpha/g)) ) |
Overlay density curve
1 | hist(X,prob=T) |
Diagonal Hat Matrix values
1 | hii <- hatvalues(data.lm) |
Draw loess smoothing plot
1 | scatter.smooth(Y~X,span=q,degree=1,family="symmetric") |
Overlay a curve on a scatter plot
1 | plot(conc ~ temp,xlim=c(-4,5),ylim=c(0,150),xlab="x",ylab="y",main="") |
Draw two different data on the same scatter plot
1 | plot(Y ~ X,pch = 1 + (match(State,c("Iowa","Nebraska")) -1)*18 ) |
Determine which of the predictor variables are significant (including interactions terms)
plot the data (paris plot)
center the predictors to accommodate the higher-order interaction terms
build full regression model
compare (anova) the full model to the reduced model (without interaction terms) and choose the reduced model if not significant
use anova to filter main effect terms and determine the final model
fit the final model and access the quality of fit
i) VIFs
ii) normality and residual
iii) outlier (studentized residual plot)
iv) influence measures (identify observations that are out of the following standards)
a) hat values
b) cook’s distance
c) DFFITS
d) DFBETAS
Variable Selection
Given a bunch of variables, choose a reduced model using backward elimination and use min BIC as selection criterion.
ANCOVA model and analysis
draw scatter plot in two different types of pattern
determine a model (equal slope or unequal slope) based on the plot
assess the model by anonva(full model) to check if the interaction term is significant
refit the model it if needed
choose a corresponding ANCOVA model. Wirte the model and decide which parameters to test. use anova(m1,m2 ) to analyze and draw conclusions.
Predicting values out of the range of known values
When we predict values for points outside the range of data taken it is called extrapolation. Extrapolation over too far a range can be dangerous unless it is certain that the relationship between the variables continues over the entire range.
Confusions and Answers
When to use Type 1 and type 3?
After all the variavles are selected (the model is decided), what kind of analysis should be chosen? Anova or lm?
lm uses type I ss, the sequence matters. So, which order to choose
The anova()
and summary()
perform two different tests. The first is clearly an ANOVA test, while summary simple performs a t-test. T-tests should not be used to compare the explanatory capabilities of two models, because the t-test was set up just to the check significance of a predictor to be different from 0.
On the other hand ANOVA test aims to detect whether the difference of explanatory power between two models is significant. If it is not, we prefer the model with fewer variables, because it is more parsimonious (Occam razor).
- If do a regression fit with predictors x1, x2 and x1x2, and they need to be scaled. Does the two operations equal? For operation 1, scale x1 and x2 and x3 =scaled x1x2 and fit them in regression model while for operation 2, scale x1, x2 and x3 seperately.