This is a study note for using \(stat\) package for linear regression, such as: Statistical concepts and methods used in the analysis of data. Statistical models. Analysis of single sample, two sample and paired sample data. Simple and multiple linear regression including polynomial regression.. Model building. Regression with indicator variables.. The topic consist of:
Reference:
library(dplyr) # data manipulation
library(stats) # statistical calculations and random number generation.
library(caret)
This data frame contains house and sale price data for 932 homes in Sacramento CA. The original data were obtained from the website for the SpatialKey software. From their website: “The Sacramento real estate transactions file is a list of 985 real estate transactions in the Sacramento area reported over a five-day period, as reported by the Sacramento Bee.” Google was used to fill in missing/incorrect data.
Output: price
data(Sacramento)
(Sacramento <- Sacramento %>%
dplyr::select(price,everything()) %>%
as_tibble())
## # A tibble: 932 x 9
## price city zip beds baths sqft type latitude longitude
## <int> <fct> <fct> <int> <dbl> <int> <fct> <dbl> <dbl>
## 1 59222 SACRAMENTO z95838 2 1 836 Resident… 38.6 -121.
## 2 68212 SACRAMENTO z95823 3 1 1167 Resident… 38.5 -121.
## 3 68880 SACRAMENTO z95815 2 1 796 Resident… 38.6 -121.
## 4 69307 SACRAMENTO z95815 2 1 852 Resident… 38.6 -121.
## 5 81900 SACRAMENTO z95824 2 1 797 Resident… 38.5 -121.
## 6 89921 SACRAMENTO z95841 3 1 1122 Condo 38.7 -121.
## 7 90895 SACRAMENTO z95842 3 2 1104 Resident… 38.7 -121.
## 8 91002 SACRAMENTO z95820 3 1 1177 Resident… 38.5 -121.
## 9 94905 RANCHO_CORD… z95670 2 2 941 Condo 38.6 -121.
## 10 98937 RIO_LINDA z95673 3 2 1146 Resident… 38.7 -121.
## # … with 922 more rows
Data from Dr. Hans Hofmann of the University of Hamburg and stored at the UC Irvine Machine Learning Repository. These data have two classes for the credit worthiness: good or bad. There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, Installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, Number of people being liable to provide maintenance for, telephone, and foreign worker status. Many of these predictors are discrete and have been expanded into several 0/1 indicator variables
Output: Class
data(GermanCredit)
(GermanCredit <- GermanCredit %>%
dplyr::select(Class, Duration, Amount,
ResidenceDuration, Age, NumberExistingCredits) %>%
dplyr::select(Class, everything()) %>%
as_tibble())
## # A tibble: 1,000 x 6
## Class Duration Amount ResidenceDuration Age NumberExistingCredits
## <fct> <int> <int> <int> <int> <int>
## 1 Good 6 1169 4 67 2
## 2 Bad 48 5951 2 22 1
## 3 Good 12 2096 3 49 1
## 4 Good 42 7882 4 45 1
## 5 Bad 24 4870 4 53 2
## 6 Good 36 9055 4 35 1
## 7 Good 24 2835 4 53 1
## 8 Good 36 6948 2 35 1
## 9 Good 12 3059 4 61 1
## 10 Bad 30 5234 2 28 2
## # … with 990 more rows
Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution
# Descriptive statistics
library(psych)
describe(Sacramento)
## vars n mean sd median trimmed mad
## price 1 932 246661.58 131126.86 220000.00 230782.79 109712.40
## city* 2 932 24.57 11.87 33.00 26.05 1.48
## zip* 3 932 40.19 19.94 43.50 41.35 24.46
## beds 4 932 3.28 0.89 3.00 3.25 1.48
## baths 5 932 2.05 0.72 2.00 2.01 0.00
## sqft 6 932 1680.32 726.27 1470.00 1575.40 530.77
## type* 7 932 2.87 0.47 3.00 3.00 0.00
## latitude 8 932 38.59 0.13 38.62 38.59 0.14
## longitude 9 932 -121.36 0.14 -121.38 -121.38 0.10
## min max range skew kurtosis se
## price 30000.00 884790.00 854790.00 1.35 2.53 4295.20
## city* 1.00 37.00 36.00 -0.73 -1.17 0.39
## zip* 1.00 68.00 67.00 -0.42 -1.10 0.65
## beds 1.00 8.00 7.00 0.28 0.73 0.03
## baths 1.00 5.00 4.00 0.65 1.02 0.02
## sqft 484.00 4878.00 4394.00 1.34 1.62 23.79
## type* 1.00 3.00 2.00 -3.56 10.92 0.02
## latitude 38.24 39.02 0.78 -0.04 -0.28 0.00
## longitude -121.55 -120.60 0.95 1.91 5.31 0.00
describeBy(Sacramento$price,group=Sacramento$type)
##
## Descriptive statistics by group
## group: Condo
## vars n mean sd median trimmed mad min max range skew
## X1 1 53 148668.6 74408 127000 141056.7 54856.2 40000 360000 320000 0.98
## kurtosis se
## X1 0.37 10220.72
## --------------------------------------------------------
## group: Multi_Family
## vars n mean sd median trimmed mad min max range
## X1 1 13 224534.7 85061.71 221250 218380.4 90660.99 1e+05 416767 316767
## skew kurtosis se
## X1 0.54 -0.34 23591.87
## --------------------------------------------------------
## group: Residential
## vars n mean sd median trimmed mad min max range
## X1 1 866 252991 132049.8 224126 237092.8 109154.9 30000 884790 854790
## skew kurtosis se
## X1 1.34 2.44 4487.23
Using Historgram and boxplot, four type of quantitative variables for the distribution of a particular set of data can be understand:
par(mfrow=c(1,2))
hist(Sacramento$price)
boxplot(Sacramento$price ~ Sacramento$type)
# subset with criterion===============================================
Sacramento.sub <- Sacramento
Sacramento.sub$pval <- 2*pnorm(sort(-abs(Sacramento.sub$price)))
Sacramento.sub$fdr <- p.adjust(Sacramento.sub$pval, method='fdr')
Sacramento.sub <- subset(Sacramento.sub, fdr < 0.05)
https://www.youtube.com/channel/UCXCscRMmz4vqKzEunyVrEnQ/videos
# t test
t.test(Sacramento$price, mu = 2.38*10^5,
paired=FALSE, var.equal = FALSE, conf.level=0.95)
##
## One Sample t-test
##
## data: Sacramento$price
## t = 2.0166, df = 931, p-value = 0.04403
## alternative hypothesis: true mean is not equal to 238000
## 95 percent confidence interval:
## 238232.2 255091.0
## sample estimates:
## mean of x
## 246661.6
# wilcox_test
wilcox.test(Sacramento$price,
correct=TRUE, conf.int=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: Sacramento$price
## V = 434778, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
## 223250 239159
## sample estimates:
## (pseudo)median
## 231000
Sacramento.lm1 <- lm(price ~ sqft , data=Sacramento)
Sacramento.lm2 <- lm(price ~ sqft + baths, data=Sacramento)
Sacramento.lm3 <- lm(price ~ sqft + baths + beds, data=Sacramento)
Sacramento.lm4 <- lm(price ~ sqft + baths + beds + type, data=Sacramento)
# Subplot: Diagnosis fro linear model==================================
par(mfrow=c(2,3))
plot(Sacramento.lm1, which=1:6)
Sacramento$res = resid(Sacramento.lm1)
Sacramento$res.standard <- rstandard(Sacramento.lm1)
Sacramento$res.student <- rstudent(Sacramento.lm1)
Sacramento$dffit <- dffits(Sacramento.lm1)
Sacramento$cooks.d <- cooks.distance(Sacramento.lm1)
# 1. linearity (no lack of fit): Residuals versus fitted values. Fix: add quadratic term
# 2. Normality: Normal QQ. Fix: log transform, non-parametric test
# 3. homoscedasticity: Scale-Location. Fix: Log transform
# 4. Outlier: Cook's ddistance. Fix: log transform, non-parametric test, remove outlier
This section gives an example of simple linear regression
, that is, regression with only a single explanatory variable—with seven observations. The seven data points are \({y_i, x_i}\), for \(i= 1,2,3,...,7\). The simple linear regression model is:
\[ y_i = \beta_0 + \beta_1 x_i +\varepsilon_i, \]
where $ _0$ is the y-intercept and \(\beta_1\) is the slope of the regression line. This model can be represented in matrix form as:
\[ \left[\begin{array} \\ y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{array}\right] = \left[\begin{array} \\1 & x_1 \\1 & x_2 \\1 & x_3 \\1 & x_4 \\1 & x_5 \\1 & x_6 \\ 1 & x_7 \end{array}\right] \left[\begin{array} \beta_0 \\ \beta_1 \end{array}\right] + \left[\begin{array} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \varepsilon_4 \\ \varepsilon_5 \\ \varepsilon_6 \\ \varepsilon_7 \end{array}\right] \] where the first column of ones in the design matrix allows estimation of the y-intercept while the second column contains the x-values associated with the corresponding y-values.
Sacramento.lm1 <- lm(price ~ sqft , data=Sacramento)
summary(Sacramento.lm1)
##
## Call:
## lm(formula = price ~ sqft, data = Sacramento)
##
## Residuals:
## Min 1Q Median 3Q Max
## -231889 -54717 -11822 38993 600141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13859.393 6948.714 1.995 0.0464 *
## sqft 138.546 3.796 36.495 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84130 on 930 degrees of freedom
## Multiple R-squared: 0.5888, Adjusted R-squared: 0.5884
## F-statistic: 1332 on 1 and 930 DF, p-value: < 2.2e-16
confint(Sacramento.lm1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 222.4162 27496.3702
## sqft 131.0962 145.9967
This section contains an example of multiple Linear regression
with two covariates (explanatory variables): w
and x
. Again suppose that the data consist of seven observations, and that for each observed value to be predicted (\(y_i\)), values \(w_i\) and \(x_i\) of the two covariates are also observed. The model to be considered is:
\[ y_i = \beta_0 + \beta_1 w_i + \beta_2 x_i + \varepsilon_i \]
This model can be written in matrix terms as:
\[ \left[\begin{array} \\y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{array}\right] = \left[\begin{array} \\1 & w_1 & x_1 \\1 & w_2 & x_2 \\1 & w_3 & x_3 \\1 & w_4 & x_4 \\1 & w_5 & x_5 \\1 & w_6 & x_6 \\ 1& w_7 & x_7 \end{array}\right] \left[\begin{array} \\\beta_0 \\ \beta_1 \\ \beta_2 \end{array}\right] + \left[\begin{array}\\ \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \varepsilon_4 \\ \varepsilon_5 \\ \varepsilon_6 \\ \varepsilon_7 \end{array}\right] \]
Here the 7×3 matrix on the right side is the design matrix.
SacramentoP.lm <- lm(price ~ sqft + beds + I(sqft*beds), data=Sacramento) # with interaction
#prediction in 3D
data.mash<-{}
data.mash$sqft <- c(0.23)
data.mash$beds <- c(55.0)
predz<-predict(SacramentoP.lm, newdata=data.mash)
# 2nd order regression model, Coefficients (numeric)
SacramentoM.lm <- lm(price ~ sqft + beds + I(sqft*beds) + I(sqft^2) + I(beds^2) , data=Sacramento)
summary(SacramentoM.lm)
##
## Call:
## lm(formula = price ~ sqft + beds + I(sqft * beds) + I(sqft^2) +
## I(beds^2), data = Sacramento)
##
## Residuals:
## Min 1Q Median 3Q Max
## -207686 -51064 -10761 35514 615519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.637e+04 2.529e+04 -1.833 0.0671 .
## sqft 2.137e+02 2.398e+01 8.914 <2e-16 ***
## beds 1.401e+04 1.763e+04 0.794 0.4272
## I(sqft * beds) -7.936e+00 9.451e+00 -0.840 0.4013
## I(sqft^2) -3.995e-03 6.543e-03 -0.611 0.5417
## I(beds^2) -4.447e+03 4.245e+03 -1.048 0.2951
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81730 on 926 degrees of freedom
## Multiple R-squared: 0.6136, Adjusted R-squared: 0.6115
## F-statistic: 294.1 on 5 and 926 DF, p-value: < 2.2e-16
# logistic regression==========================================================
#General linear model, logistic: family=binomial
#assume, z is either yes (1) or no (0)
GermanCredit.glm0 <- glm(Class ~ 1, data = GermanCredit, family = binomial)
GermanCredit.glm1 <- glm(Class ~ Age, data = GermanCredit, family = binomial)
GermanCredit.glm2 <- glm(Class ~ Duration + Age, data = GermanCredit, family = binomial)
GermanCredit.glm5 <- glm(Class ~ Duration + Amount + ResidenceDuration + Age + NumberExistingCredits, family = "binomial", data = GermanCredit)
#summmay of glm
#deviance residuals (a type of residual that makes sense for Yes/No or count data)
#a beds of regression coeficients with se and tests of coeficient = 0,
#two deviance values and the AIC statistic for the fitted model.
summary(GermanCredit.glm1)
##
## Call:
## glm(formula = Class ~ Age, family = binomial, data = GermanCredit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8736 -1.4600 0.8063 0.8781 0.9540
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.200919 0.232853 0.863 0.38822
## Age 0.018440 0.006434 2.866 0.00416 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 1213.1 on 998 degrees of freedom
## AIC: 1217.1
##
## Number of Fisher Scoring iterations: 4
# Grid of potential expanatory variable
(GermanCredit.new <- expand.grid(Age = c(21:90)) %>% as_tibble())
## # A tibble: 70 x 1
## Age
## <int>
## 1 21
## 2 22
## 3 23
## 4 24
## 5 25
## 6 26
## 7 27
## 8 28
## 9 29
## 10 30
## # … with 60 more rows
# prediction
GermanCredit.new$pred <- predict(GermanCredit.glm1, newdata=GermanCredit.new, type = "terms")
GermanCredit.new$predP <- predict(GermanCredit.glm1, newdata=GermanCredit.new, type='response')
# Predicted odds & probability
par(mfrow = c(1,2),oma = c(0, 0, 2, 0))
plot(GermanCredit.new$Age, GermanCredit.new$pred, type='l', xlab='Age', ylab='Predicted odds')
plot(GermanCredit.new$Age, GermanCredit.new$predP, type='l', xlab='Age', ylab='Predicted probability')
mtext("Predicted Odds & probability", outer = TRUE, cex = 1.5)
#Testing comparisons anova(reduced model,full model)
anova(GermanCredit.glm1, GermanCredit.glm2, test='Chi')
## Analysis of Deviance Table
##
## Model 1: Class ~ Age
## Model 2: Class ~ Duration + Age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 998 1213.1
## 2 997 1169.2 1 43.946 3.375e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans::emmeans(GermanCredit.glm2, c("Duration","Age"))
## Duration Age emmean SE df asymp.LCL asymp.UCL
## 20.9 35.5 0.892 0.0719 Inf 0.751 1.03
##
## Results are given on the logit (not the response) scale.
## Confidence level used: 0.95
# Regression diagnostics ============================================
#if the Sacramento.lm1 is the best fit
Sacramento.n <- nrow(Sacramento)
Sacramento.sse0 <- sum(resid(Sacramento.lm1) ^2)
#AIC
AIC(Sacramento.lm1, k = 2)
## [1] 23786.79
Sacramento.lm1.AIC <- Sacramento.n + Sacramento.n * log(2*pi) + Sacramento.n * log(Sacramento.sse0/Sacramento.n) + 2 * (1+1)
#BIC
BIC(Sacramento.lm1)
## [1] 23801.3
AIC(Sacramento.lm1, k=log(Sacramento.n))
## [1] 23801.3
Sacramento.lm1.BIC <- Sacramento.n + Sacramento.n * log(2*pi) + Sacramento.n * log(Sacramento.sse0/Sacramento.n) + log(Sacramento.n)*(1+1)
library(car)
#calculate vif
vif(Sacramento.lm3)
## sqft baths beds
## 2.949793 2.528712 2.185211
#plot residual and cook's D against observation
plot(Sacramento.lm3,4)
#outlier, sr>1.96
sresid <- rstudent(Sacramento.lm3)
#influent point, cook.d >1
cooks.d <- cooks.distance(Sacramento.lm3)
nobs <- length(sresid)
plot(1:nobs, sresid)
plot(1:nobs, cooks.d)
#calculate the PRESS statistic
#out-of-sample residual
pres <- resid(Sacramento.lm3)/sqrt(1-lm.influence(Sacramento.lm3)$hat)
#PRESS statistic
press <- sum(pres^2)
#rMSEP
sqrt(press/length(pres))
## [1] 82715.87
# prediction for price in 2D
Sacramento$pred <- predict(Sacramento.lm1, newdata=Sacramento)
# calibration for Time in 2D
library(chemCal)
newy <- 100 # at given newy
predx<-t(sapply(newy,function(price) inverse.predict(Sacramento.lm1, price)))
Sacramento.pred<-{}
Sacramento.pred$sqft <- seq(min(Sacramento$sqft), max(Sacramento$sqft), length.out=100)
Sacramento.pred$preds <- predict(Sacramento.lm1,
newdata = data.frame(sqft=Sacramento.pred$sqft),
se.fit=T, interval = 'confidence')
str(Sacramento.pred)
## List of 2
## $ sqft : num [1:100] 484 528 573 617 662 ...
## $ preds:List of 4
## ..$ fit : num [1:100, 1:3] 80916 87065 93214 99364 105513 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:3] "fit" "lwr" "upr"
## ..$ se.fit : Named num [1:100] 5312 5169 5027 4887 4749 ...
## .. ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
## ..$ df : int 930
## ..$ residual.scale: num 84126
# plot original data point
with(Sacramento, plot(sqft, price, pch=19, col=4),type='p')
# plot bandwidth
polygon(c(rev(Sacramento.pred$sqft), Sacramento.pred$sqft),
c(rev(Sacramento.pred[["preds"]][["fit"]][ ,3]), Sacramento.pred[["preds"]][["fit"]][ ,2]),
col=rgb(0.8,0.8,0.8,0.5),border = NA)
# plot mean prediction, upper and lower bound line
lines(Sacramento.pred$sqft, Sacramento.pred[["preds"]][["fit"]][,1])
lines(Sacramento.pred$sqft, Sacramento.pred[["preds"]][["fit"]][,2], lty=2)
lines(Sacramento.pred$sqft, Sacramento.pred[["preds"]][["fit"]][,3], lty=2)
legend('topright', bty='n', lty=c(1,2), legend=c('fitted line','95% confidence'))
#histogram
hist(Sacramento$price)
#boxplot
boxplot(price~type, data=Sacramento, main='', ylab='')
# transformation=====================================================
Sacramento$logy<-log(Sacramento$price)
# odd ratio===========================================================
library(epitools)
oddsratio(counts, method='wald')
# Covariance matrix====================================================
#pre view of collinearity for model selection
cor(Sacramento[,1:2]) > 0.8
cor(cbind(Sacramento$price, Sacramento$Time, Sacramento$predy))
#pair all if only numeric
pairs(Sacramento)
#Omit a column(or columns) by number
pairs(Sacramento[,c(-4,-5)])
#pair from column 1 to 2
pairs(Sacramento[,1:2],
main='Scatterplot matrix')
#Select desired columns by name.
pairs(Sacramento[,c('price','Time','predy')],
main='Scatterplot matrix')
#Assemble a matrix with the desired columns.
pairs(cbind(Sacramento$price, Sacramento$Time, Sacramento$predy),
labels=c('yname','xname','pred price'),
main='Scatterplot matrix')
# Contengency beds================================================
#A.re-range the count to matrix (fake code)
counts <- matrix(Sacramento$count, nrow=2, byrow=T)
#B. convert long format to contengency beds
sub.Sacramento <- Sacramento[,c(6,7)]
#split 1 column to matrix by trt
counts <- with(sub.Sacramento,beds(type,color))
#give row and col names
dimnames(counts) <- list(c('Yes','No'), c('a', 'b','c') )
#chi-squred test
chisq.test(counts, correct=F)
#Odd ratio
library(epitools)
oddsratio(counts, method='wald')
counts2 <- counts[c(2,1),]
oddsratio(counts2, method='wald')
#test of H0: row TRT have same odd, All the expected counts are larger than 1
chisq.test(counts, simulate.p.value=T)