logistic regressino
test
xgboost
test2
Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.
1 | $ hexo new "My New Post" |
More info: Writing
1 | $ hexo server |
More info: Server
1 | $ hexo generate |
More info: Generating
1 | $ hexo deploy |
More info: Deployment
$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$
$e_i = Y_i - \hat{Y_i}$ MSE in SLR:
$$
\begin{aligned}
\sum{Y_i} &= nb_0 + b_1\sum{X_i}
\
\sum{X_iY_i} &= b_0\sum{X_i} + b_1\sum{X_i^2}
\end{aligned}
$$
b0 in the form of linear combination of Y: $b_0 = \sum_{i=1}^n(1/n-\bar{X} k_i)Y_i$
1 | predict(copier.lm,newdata = data.frame(copier_nums=6),interval="pred",level=0.9) |
when parameters are known
when parameters are unknown
Notice that the power consists of two parts. The left one is very small.
1 | summary(copier.lm)$r.squared |
In a bivariate normal model, the parameter ρ 12provides information about the degree of the linear relationship between the two variables Y 1 and Y 2 .
The maximum likelihood estimator of $\rho_{12}$.
1 | cor( Y1, Y2, method="spearman" ) |
Diagnostics and Remedial Measures
$e_i = Y_i - \hat{Y_i}$
Standard Version:
$e_i \ vs. \ \hat{Y_i}$
$e_i \ vs. \ X_i$
$|e_i| \ or \ e_i^2 \ vs. \ X_i$
Small departures from normality do not create any serious problems. Major departures, on the other hand, should be of concern.
1 | qqnorm(ei) |
short-tailed: ignore if not extreme
long-tailed: 1. use another dist. 2. use resampling method
skewed: do a transformation
Null hypothesis: normal
1 | shapiro.test( resid(CH01TA01.lm) ) |
1 | require( lawstat ) |
Note that the test statistic from BF.htest is a F statistic but what we want is a t statistic. So t = sqrt(BF.htest$statistic)
A formal test for determining whether a specific type of regression function adequately fits the data.
c: number of variables
For obversations Y given X
Independent
Normally distributed
the distributions of Y have the same variance $\sigma^2$
Require replicates.
1 | reduced.model.lm <- lm(Y ~ X) |
The model:
First, standardize Y:
Use maximum likelihood to estimate $\lambda$ and rest of the parameters.
Code:
1 | require(MASS) |
type I error $\alpha = P(A)$
Bonferroni-adjusted critical point
$$
W^2= pF(1–\alpha; p, n–p).
$$
So, use the WHS value if W ≤ B, and use Bonferroni if B < W.
Estimate of b1:
Confidence Interval of b1
Code:
1 | lm(Y ~ X - 1) |
all ones:
MSE = SSE/df
1 | # Inverse a matrix X -> X^(-1) |
Be careful when use X in matrix form. It is a design matrix that has extra ones vector.
Matrix form:
There are p-1 variables.
Notice that by adding a new X k to the model, SSE cannot increase. Thus we can drive R 2 → 1 simply by pushing p → n.
An adjusted R2 compensates by replacing the SS terms with MS terms:
pointwise: familywise
simultaneous 1–α confidence (hyper-)band on E{Y} over all possible vectors Xh , use the WHS method:
estimate: standard error:
CI (pointwise) :
CI (familywise S-method): (g: number of intervals to be predictted)
(bonferroni method)
1 | require( lawstat ) |
Notice that the order is about $\hat{Y}$.
1 | CH07TA01.aov = anova(lm(Y~X1+X2+X3)) |
1 | CH07TA01.aov[2,2]/anova(lm(Y~X1))[2,2] |
Large correlation between X1 and X2 indicates possible multicollinearity!
centering:
In regression, it is often recommended to scale the features so that the predictors have a mean of 0. This makes it easier to interpret the intercept term as the expected value of Y when the predictor values are set to their means. There are few other aspects that vouch for feature centering in case of regression:
Plot two types of data
Example:
To test whether the two levels of the categorial variable are different from each other is just to test if the coefficient of that variable in the lm model is significant or not.
overall residual plot
individual residual plot of each level of the factor
normality qqplot
outlier analysis via Studentized deleted residuals
influencial diagnostics
make sure no observations with unusually high influence
Among a group of different possible models, each with p parameters (p–1 predictors), choose the model with the highest $R_p^2$ .
But R 2cannot decrease.
In practice, we look for a flat point.
Adjusted R^2
Information Criterion (IC).
When prediction of a future Y i is a central goal, we can study the prediction error for each observation.
Let Y i(i) be the value predicted at observation i after leaving Y i out of the MLR calculations. (A “leave-one-out,” or LOO, predictor: a kind of cross-validation).
If the model predicts Y i(i) well – even without Y i being fit – it could be a good model.
It could be used to identify outlying Y observations.
where p is the number of regression parameters in the regression function including the intercept term.
In addition, it can be shown that h iiis a measure of the distance between the X values for the ith case and the means of the X values for all n cases. Thus, a large value h ii indicates that the ith case is distant from the center of all X observations.
The diagonal element hii in this context is called the leverage (in terms of the X values) of the ith case. Thus, a large value h iiindicates that the ith case is distant from the center of all X observations.
Hence, leverage values greater than 2p/n are considered by this rule to indicate outlying cases with regard to their X values. Another suggested guideline is that h ii values exceeding .5 indicate very high leverage, whereas those between .2 and .5 indicate moderate leverage. Additional evidence of an outlying case is the existence of a gap between the leverage values for most of the cases and the unusually large leverage value(s).
Draw a plot:
After identifying cases that are outlying with respect to their Y values and/or their X values, the next step is to ascertain whether or not these outlying cases are influential. We shall consider a case to be influential if its exclusion causes major changes in the fitted regression function.
The letters DF stand for the difference between the fitted value Y i for the ith case when all n cases are used in fitting the regression function and the predicted value Y i(i) for the ith case obtained when the ith case is omitted in fitting the regression function.
ti is from the studentized deleted residual.
In contrast to the DFFITS measure, which considers the influence of the ith case on the fitted value $\hat{Y_i}$ for this case, Cook’s distance measure considers the influence of the ith case on all n fitted values.
plotting codes:
1 | par( mfrow=c(1,2) ) |
A measure of the influence of the ith case on each regression coefficient $b_k$ (k = 0, 1, . . . , p −1) is the difference between the estimated regression coefficient $b_k$ based on all n cases and the regression coefficient obtained when the ith case is omitted, to be denoted by $b_{k(i)}$ .
useage
These factors measure how much the variances of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related.
Note that the Y here in the code does not involve in calculation. We can do this without knowing Y.
By using WLS, we change the MLR model by changing the constant variace to heterogeneous variace, i.e.,assuming every point has different variance.
Use $s_j^2$ to estimate $\sigma_j^2$ if there are replications
center Y
standardize the predictors
Use of the correlation transformation helps with controlling roundoff errors and, by expressing the regression coefficients in the same units, may be of help when these coefficients are compared.
The correlation transformation is a simple modification of the usual standardization of a variable.
The predicted value: $\hat{Y} = X b^R$
Choose constant parameter c.
Code for choosing c:
$2b_1 - b_1^*(1-\alpha/2) \leq 2b_1 - b_1^*(\alpha/2) $
1 | CH06FI05.df = data.frame( Y_SALES, X1_TARGPOP, X2_DISPINC ) |
1 | cor(CH06FI05.df) |
1 | qqnorm(resid(data.lm)) |
1 | confint( data.lm ) |
1 | g = length( coef(CH06FI05.lm) ) - 1 |
1 | predict.lm( CH06FI05.lm, newdata=newdata.df, interval='confidence', level=1-alpha)) ) |
First compare Scheffé and Bonferroni critical points:
1 | Spoint = sqrt( g*qf(1-alpha, g, CH06FI05.lm$df.residual) ) |
1 | predict.lm( CH06FI05.lm, newdata=newdata.df, interval='prediction' ) |
1 | predict.lm( CH06FI05.lm, newdata=newdata.df, interval='prediction', level=1-(alpha/g)) ) |
1 | hist(X,prob=T) |
1 | hii <- hatvalues(data.lm) |
1 | scatter.smooth(Y~X,span=q,degree=1,family="symmetric") |
1 | plot(conc ~ temp,xlim=c(-4,5),ylim=c(0,150),xlab="x",ylab="y",main="") |
1 | plot(Y ~ X,pch = 1 + (match(State,c("Iowa","Nebraska")) -1)*18 ) |
plot the data (paris plot)
center the predictors to accommodate the higher-order interaction terms
build full regression model
compare (anova) the full model to the reduced model (without interaction terms) and choose the reduced model if not significant
use anova to filter main effect terms and determine the final model
fit the final model and access the quality of fit
i) VIFs
ii) normality and residual
iii) outlier (studentized residual plot)
iv) influence measures (identify observations that are out of the following standards)
a) hat values
b) cook’s distance
c) DFFITS
d) DFBETAS
Given a bunch of variables, choose a reduced model using backward elimination and use min BIC as selection criterion.
draw scatter plot in two different types of pattern
determine a model (equal slope or unequal slope) based on the plot
assess the model by anonva(full model) to check if the interaction term is significant
refit the model it if needed
choose a corresponding ANCOVA model. Wirte the model and decide which parameters to test. use anova(m1,m2 ) to analyze and draw conclusions.
When we predict values for points outside the range of data taken it is called extrapolation. Extrapolation over too far a range can be dangerous unless it is certain that the relationship between the variables continues over the entire range.
When to use Type 1 and type 3?
After all the variavles are selected (the model is decided), what kind of analysis should be chosen? Anova or lm?
lm uses type I ss, the sequence matters. So, which order to choose
The anova()
and summary()
perform two different tests. The first is clearly an ANOVA test, while summary simple performs a t-test. T-tests should not be used to compare the explanatory capabilities of two models, because the t-test was set up just to the check significance of a predictor to be different from 0.
On the other hand ANOVA test aims to detect whether the difference of explanatory power between two models is significant. If it is not, we prefer the model with fewer variables, because it is more parsimonious (Occam razor).