CH1&2

SLR

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$

Least Square

residual

$e_i = Y_i - \hat{Y_i}$ MSE in SLR:

Normal Equations

$$
\begin{aligned}
\sum{Y_i} &= nb_0 + b_1\sum{X_i}
\
\sum{X_iY_i} &= b_0\sum{X_i} + b_1\sum{X_i^2}
\end{aligned}
$$

Inferences on b1

Confidence Interval

Inferences on b0

b0 in the form of linear combination of Y: $b_0 = \sum_{i=1}^n(1/n-\bar{X} k_i)Y_i$

Distribution

Confidence Interval

Inferences on E{Yh}

Distribution

Confidence Interval

1	predict(copier.lm,newdata = data.frame(copier_nums=6),interval="pred",level=0.9)

Prediction of New Observation

Prediction Interval

when parameters are known

when parameters are unknown

Power Analysis

Notice that the power consists of two parts. The left one is very small.

WHS Confidence Band (SLR with p=2)

Measures to describe the degree of linear association

Coefficient of Determination

1	summary(copier.lm)$r.squared

Coefficient of Correlation

Inferences on Correlation Coefﬁcients

In a bivariate normal model, the parameter ρ 12provides information about the degree of the linear relationship between the two variables Y 1 and Y 2 .

Point Estimator

The maximum likelihood estimator of $\rho_{12}$.

Spearman’s Rank Correlation

1	cor( Y1, Y2, method="spearman" )

CH3 Remediation

Diagnostics and Remedial Measures

Residual Plot

$e_i = Y_i - \hat{Y_i}$

Standard Version:

$e_i \ vs. \ \hat{Y_i}$

$e_i \ vs. \ X_i$

$|e_i| \ or \ e_i^2 \ vs. \ X_i$

SemiStudentized Residual

Outliers

Test of Normality of error terms

Small departures from normality do not create any serious problems. Major departures, on the other hand, should be of concern.

Normality Plot

1 2	qqnorm(ei) qqline(ei,col="red")

short-tailed: ignore if not extreme

long-tailed: 1. use another dist. 2. use resampling method

skewed: do a transformation

Shapiro-Wilk Test

Null hypothesis: normal

1	shapiro.test( resid(CH01TA01.lm) )

Test of Homogeneity

Brown-Forsythe Test

require( lawstat )
n1 <- as.integer(length(X)/2)
n2 <- length(X) - n1 
ei <- resid(data.lm)
BF.htest = levene.test( ei[order(X)],group=c(rep(1,n1),rep(2,n2)), location="median" )

Note that the test statistic from BF.htest is a F statistic but what we want is a t statistic. So t = sqrt(BF.htest$statistic)

Lack of Fit(LOF) Test

A formal test for determining whether a speciﬁc type of regression function adequately ﬁts the data.

c: number of variables

Assumptions

For obversations Y given X

Independent
Normally distributed
the distributions of Y have the same variance $\sigma^2$

Require replicates.

Test Statistic

1
2
3

reduced.model.lm <- lm(Y ~ X)
full.model.lm <- lm(Y ~ factor(X))
anova(reduced.model.lm, full.model.lm)

Uncover curvilinear relationship in SLR model

Box-Cox power transform

The model:

First, standardize Y:

Use maximum likelihood to estimate $\lambda$ and rest of the parameters.

Code:

1 2	require(MASS) CH03TA08.bc = boxcox( CH03TA08.lm, lambda=seq(-1, 1, 0.1), interp=F )

CH4 Simultaneous Inference

Simultaneous Inference

Bonferroni’s Inequality

type I error $\alpha = P(A)$

Bonferroni-adjusted critical point

Bonferroni Conf. Intervals on E{Yh}

Working-Hotelling-Scheffé (WHS) Intervals on E{Yh}

General Form:

$$
W^2= pF(1–\alpha; p, n–p).
$$

So, use the WHS value if W ≤ B, and use Bonferroni if B < W.

Regression through the Origin

Inferences of b1

Estimate of b1:

Confidence Interval of b1

Inferences of E{Y}

Code:

1	lm(Y ~ X - 1)

Inverse Prediction

CH5 Matrix Approach to SLR

Basic Rules

Special Matrices

X’X / X’Y / Y’Y

all ones:

Covariance Matrix

Matrix Formulation of SLR

LS Estimation

b

MSE = SSE/df

$\hat{Y_h}$

Prediction

Hat Matrix

Residual Vector

ANOVA

Quadratic Forms

Codes

# Inverse a matrix X -> X^(-1)
solve(X)
# Transpose X
t(X)
# XX'
X %*% t(X)
# Matix of ones
matrix(1,n,n)

Be careful when use X in matrix form. It is a design matrix that has extra ones vector.

CH6 Multiple Regression I

Multiple Linear Regression (MLR)

Matrix form:

There are p-1 variables.

Matrix Form

MLR ANOVA Table

F test of MLR

Adjusted R^2

Notice that by adding a new X k to the model, SSE cannot increase. Thus we can drive R 2 → 1 simply by pushing p → n.

An adjusted R2 compensates by replacing the SS terms with MS terms:

MLR Inferences

Inferences on b

Pointwise

Familywise (Bonferroni Adjustment)

Inferences on E{Y}

Pointwise CI and familywise CI

pointwise: familywise

WHS (Confidence Band)

simultaneous 1–α confidence (hyper-)band on E{Y} over all possible vectors Xh , use the WHS method:

Prediction

estimate: standard error:

CI (pointwise) :

CI (familywise S-method): (g: number of intervals to be predictted)

(bonferroni method)

Test for constant variance

Brown-Forsythe Test

require( lawstat )
n1 <- as.integer(length(Y)/2)
n2 <- length(Y) - n1 
ei <- resid(data.lm)
BF.htest = levene.test( ei[order(fitted(data.lm)],group=c(rep(1,n1),rep(2,n2)), location="median" )

Notice that the order is about $\hat{Y}$.

Lack of Fit test

CH7 Multiple Regression II

ANOVA

Partial R^2. (Coefficient of Partial Determination.)

Codes for getting partial R square

1 2	CH07TA01.aov = anova(lm(Y~X1+X2+X3)) CH07TA01.aov[3,2]/anova(lm(Y~X1+X2))[3,2]

1	CH07TA01.aov[2,2]/anova(lm(Y~X1))[2,2]

Multicollinearity

Large correlation between X1 and X2 indicates possible multicollinearity!

Remedies for multicollinearity

CH8 Quantitative and Qualitative Predictors for MLR

Polynomial Regression

centering:

Why use centering in regression?

In regression, it is often recommended to scale the features so that the predictors have a mean of 0. This makes it easier to interpret the intercept term as the expected value of Y when the predictor values are set to their means. There are few other aspects that vouch for feature centering in case of regression:

When one variable has a very large scale: e.g. if you are using the population size of a country as a predictor. In that case, the regression coefficients may be on a very small order of magnitude (e.g. e^-9) which can be a little annoying when you’re reading computer output, so you may convert the variable to, for example, population size in millions or just perform a Normalization.
While creating power terms: Let’s say you have a variable, X, that ranges from 1 to 2, but you suspect a curvilinear relationship with the response variable, and so you want to create an X² term. If you don’t center X first, your squared term will be highly correlated with X, which could muddy the estimation of the beta. Centering first addresses this issue.
Creating interaction terms: If an interaction/product term is created from two variables that are not centered on 0, some amount of collinearity will be induced (with the exact amount depending on various factors).

Testing Polynomial Models

Qualitative Predictator

Binary scatter plot

Equal-Slope ANCOVA (without interaction)

Plot two types of data

Unequal-Slope ANCOVA (with interaction)

Example:

To test whether the two levels of the categorial variable are different from each other is just to test if the coefficient of that variable in the lm model is significant or not.

Access the quality of a fit

overall residual plot
individual residual plot of each level of the factor
normality qqplot
outlier analysis via Studentized deleted residuals
influencial diagnostics

make sure no observations with unusually high influence

CH9 Model Building – I: Variable Selection

Metrics for Variable Selection

1. Maximum $R_p^2$

Among a group of different possible models, each with p parameters (p–1 predictors), choose the model with the highest $R_p^2$ .

But R 2cannot decrease.

In practice, we look for a flat point.

2. Maximum $R_{ap}^2$

Adjusted R^2

3. Mallow’s Cp

4. AIC and BIC

Information Criterion (IC).

5. PRESS

When prediction of a future Y i is a central goal, we can study the prediction error for each observation.

Let Y i(i) be the value predicted at observation i after leaving Y i out of the MLR calculations. (A “leave-one-out,” or LOO, predictor: a kind of cross-validation).

If the model predicts Y i(i) well – even without Y i being fit – it could be a good model.

Stepwise Variable Selection

Forward Stepwise Selection & Backward Elimination

CH10 Model Building – II: Diagnostics

Added-Variable Plots

Partial Residual Plot

Identifying Outlying Y Observations—Studentized Deleted Residuals

Studentized Residuals

Deleted Residuals

Studentized Deleted Residuals **

It could be used to identify outlying Y observations.

confidens

Use t test from Studentized Deleted Residuals

Use Hat matrix from Leverage, see the following.

Identifying Outlying X Observations—Hat Matrix Leverage Values

Hat Matrix

where p is the number of regression parameters in the regression function including the intercept term.

In addition, it can be shown that h iiis a measure of the distance between the X values for the ith case and the means of the X values for all n cases. Thus, a large value h ii indicates that the ith case is distant from the center of all X observations.

Leverage

The diagonal element hii in this context is called the leverage (in terms of the X values) of the ith case. Thus, a large value h iiindicates that the ith case is distant from the center of all X observations.

Hence, leverage values greater than 2p/n are considered by this rule to indicate outlying cases with regard to their X values. Another suggested guideline is that h ii values exceeding .5 indicate very high leverage, whereas those between .2 and .5 indicate moderate leverage. Additional evidence of an outlying case is the existence of a gap between the leverage values for most of the cases and the unusually large leverage value(s).

Draw a plot:

Use leverage to identify hidden extrapolation

Influence Measures

After identifying cases that are outlying with respect to their Y values and/or their X values, the next step is to ascertain whether or not these outlying cases are inﬂuential. We shall consider a case to be inﬂuential if its exclusion causes major changes in the ﬁtted regression function.

DFFITS

The letters DF stand for the difference between the ﬁtted value Y i for the ith case when all n cases are used in ﬁtting the regression function and the predicted value Y i(i) for the ith case obtained when the ith case is omitted in ﬁtting the regression function.

ti is from the studentized deleted residual.

Cook’s Distance

In contrast to the DFFITS measure, which considers the inﬂuence of the ith case on the ﬁtted value $\hat{Y_i}$ for this case, Cook’s distance measure considers the inﬂuence of the ith case on all n ﬁtted values.

plotting codes:

par( mfrow=c(1,2) )
pi <- 3.14159
yhat = fitted(GR.lm)
radius = sqrt( cooks.distance(GR.lm)/pi )
plot( ei ~ yhat, pch=''); abline( h=0 )
symbols( yhat, ei, circles=radius, inches=.15,
                  bg='black', fg='white', add=T )

plot( cooks.distance(lm(Y~X1+X2)), type='o',
pch=19 )

DFBETAS

A measure of the inﬂuence of the ith case on each regression coefﬁcient $b_k$ (k = 0, 1, . . . , p −1) is the difference between the estimated regression coefﬁcient $b_k$ based on all n cases and the regression coefﬁcient obtained when the ith case is omitted, to be denoted by $b_{k(i)}$ .

useage

Multicollinearity Diagnostics

Variance Inflation

These factors measure how much the variances of the estimated regression coefﬁcients are inﬂated as compared to when the predictor variables are not linearly related.

VIFs (variance inflation factors)

Note that the Y here in the code does not involve in calculation. We can do this without knowing Y.

Differences between correlation, collinearity and multicollinearity

Summary

CH11 Model Building – III: Remedial Measures

Weighted LS (Addressing Heterogeneous Variances)

By using WLS, we change the MLR model by changing the constant variace to heterogeneous variace, i.e.,assuming every point has different variance.

Use $s_j^2$ to estimate $\sigma_j^2$ if there are replications

Estimation of Variance Function or Standard Deviation Function

Ridge Regression (Addressing multicollinearity)

center Y
standardize the predictors

Correlation Transformation
Use of the correlation transformation helps with controlling roundoff errors and, by expressing the regression coefﬁcients in the same units, may be of help when these coefﬁcients are compared.

The correlation transformation is a simple modiﬁcation of the usual standardization of a variable.

The predicted value: $\hat{Y} = X b^R$
Choose constant parameter c.

Code for choosing c:

plot and find a flat point or use 2. vif as criterion

Smoothing

Draw contour plot and loess smoothing plot

Bootstrapping

Bootstrapping in regression and confidence interval

$2b_1 - b_1^*(1-\alpha/2) \leq 2b_1 - b_1^*(\alpha/2) $

Codes in R

Draw Scatteplot Matrix

1 2	CH06FI05.df = data.frame( Y_SALES, X1_TARGPOP, X2_DISPINC ) pairs( CH06FI05.df )

Correlation Matrix

1	cor(CH06FI05.df)

Normality plot

1 2	qqnorm(resid(data.lm)) qqline(resid(data.lm),col="red")

Confidence Interval Estimation

Beta1 ,Beta2 ,…

Pointwise conf. Intervals

1	confint( data.lm )

Simultaneous Bonferroni corrected CI

1 2	g = length( coef(CH06FI05.lm) ) - 1 confint( CH06FI05.lm, level=1-(alpha/g) )[2:3,]

E{Yh} g number

Pointwise CI

1	predict.lm( CH06FI05.lm, newdata=newdata.df, interval='confidence', level=1-alpha)) )

Simultaneous CI

Yh (predictted values)

First compare Scheffé and Bonferroni critical points:

1 2	Spoint = sqrt( gqf(1-alpha, g, CH06FI05.lm$df.residual) ) Bpoint = qt( 1-(.5(alpha/g)), CH06FI05.lm$df.residual )

Pointwise Prediction CI

1	predict.lm( CH06FI05.lm, newdata=newdata.df, interval='prediction' )

Simultaneous Prediction CI

1	predict.lm( CH06FI05.lm, newdata=newdata.df, interval='prediction', level=1-(alpha/g)) )

Overlay density curve

1 2	hist(X,prob=T) lines(density(X))

Diagonal Hat Matrix values

1	hii <- hatvalues(data.lm)

Draw loess smoothing plot

1	scatter.smooth(Y~X,span=q,degree=1,family="symmetric")

Overlay a curve on a scatter plot

plot(conc ~ temp,xlim=c(-4,5),ylim=c(0,150),xlab="x",ylab="y",main="")
silicon.log.lm <- lm(conc.log~temp)
coefs<- coef(silicon.log.lm)
par(new=T)
curve(exp(coefs[1]+coefs[2]*x),xlim=c(-4,5),ylim=c(0,150),xlab="x",ylab="y",main="")

Draw two different data on the same scatter plot

1 2	plot(Y ~ X,pch = 1 + (match(State,c("Iowa","Nebraska")) -1)*18 ) legend(65,5,legend=c("Iowa","Nebraska"),pch=c(1,19))# Type of Questions

Determine which of the predictor variables are significant (including interactions terms)

plot the data (paris plot)
center the predictors to accommodate the higher-order interaction terms
build full regression model
compare (anova) the full model to the reduced model (without interaction terms) and choose the reduced model if not significant
use anova to filter main effect terms and determine the final model

fit the final model and access the quality of fit

i) VIFs

ii) normality and residual

iii) outlier (studentized residual plot)

iv) influence measures (identify observations that are out of the following standards)

a) hat values

b) cook’s distance

c) DFFITS

d) DFBETAS

Variable Selection

Given a bunch of variables, choose a reduced model using backward elimination and use min BIC as selection criterion.

ANCOVA model and analysis

draw scatter plot in two different types of pattern
determine a model (equal slope or unequal slope) based on the plot
assess the model by anonva(full model) to check if the interaction term is significant
refit the model it if needed
choose a corresponding ANCOVA model. Wirte the model and decide which parameters to test. use anova(m1,m2 ) to analyze and draw conclusions.

Predicting values out of the range of known values

When we predict values for points outside the range of data taken it is called extrapolation. Extrapolation over too far a range can be dangerous unless it is certain that the relationship between the variables continues over the entire range.

Confusions and Answers

When to use Type 1 and type 3?
After all the variavles are selected (the model is decided), what kind of analysis should be chosen? Anova or lm?

lm uses type I ss, the sequence matters. So, which order to choose

The anova() and summary() perform two different tests. The first is clearly an ANOVA test, while summary simple performs a t-test. T-tests should not be used to compare the explanatory capabilities of two models, because the t-test was set up just to the check significance of a predictor to be different from 0.

On the other hand ANOVA test aims to detect whether the difference of explanatory power between two models is significant. If it is not, we prefer the model with fewer variables, because it is more parsimonious (Occam razor).

If do a regression fit with predictors x1, x2 and x1x2, and they need to be scaled. Does the two operations equal? For operation 1, scale x1 and x2 and x3 =scaled x1x2 and fit them in regression model while for operation 2, scale x1, x2 and x3 seperately.