Use of this document

This is a study note for using \(stat\) package for linear regression, such as: Statistical concepts and methods used in the analysis of data. Statistical models. Analysis of single sample, two sample and paired sample data. Simple and multiple linear regression including polynomial regression.. Model building. Regression with indicator variables.. The topic consist of:

Reference:

Prerequisites

library(dplyr) # data manipulation
library(stats) # statistical calculations and random number generation.

Dataset

library(caret)

Sacramento House price Data

This data frame contains house and sale price data for 932 homes in Sacramento CA. The original data were obtained from the website for the SpatialKey software. From their website: “The Sacramento real estate transactions file is a list of 985 real estate transactions in the Sacramento area reported over a five-day period, as reported by the Sacramento Bee.” Google was used to fill in missing/incorrect data.

Output: price

data(Sacramento)
(Sacramento <- Sacramento %>% 
    dplyr::select(price,everything()) %>% 
    as_tibble())
## # A tibble: 932 x 9
##    price city         zip     beds baths  sqft type      latitude longitude
##    <int> <fct>        <fct>  <int> <dbl> <int> <fct>        <dbl>     <dbl>
##  1 59222 SACRAMENTO   z95838     2     1   836 Resident…     38.6     -121.
##  2 68212 SACRAMENTO   z95823     3     1  1167 Resident…     38.5     -121.
##  3 68880 SACRAMENTO   z95815     2     1   796 Resident…     38.6     -121.
##  4 69307 SACRAMENTO   z95815     2     1   852 Resident…     38.6     -121.
##  5 81900 SACRAMENTO   z95824     2     1   797 Resident…     38.5     -121.
##  6 89921 SACRAMENTO   z95841     3     1  1122 Condo         38.7     -121.
##  7 90895 SACRAMENTO   z95842     3     2  1104 Resident…     38.7     -121.
##  8 91002 SACRAMENTO   z95820     3     1  1177 Resident…     38.5     -121.
##  9 94905 RANCHO_CORD… z95670     2     2   941 Condo         38.6     -121.
## 10 98937 RIO_LINDA    z95673     3     2  1146 Resident…     38.7     -121.
## # … with 922 more rows

German Credit Data

Data from Dr. Hans Hofmann of the University of Hamburg and stored at the UC Irvine Machine Learning Repository. These data have two classes for the credit worthiness: good or bad. There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, Installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, Number of people being liable to provide maintenance for, telephone, and foreign worker status. Many of these predictors are discrete and have been expanded into several 0/1 indicator variables

Output: Class

data(GermanCredit)
(GermanCredit <- GermanCredit %>% 
    dplyr::select(Class, Duration, Amount, 
                  ResidenceDuration, Age, NumberExistingCredits) %>% 
    dplyr::select(Class, everything()) %>% 
    as_tibble())
## # A tibble: 1,000 x 6
##    Class Duration Amount ResidenceDuration   Age NumberExistingCredits
##    <fct>    <int>  <int>             <int> <int>                 <int>
##  1 Good         6   1169                 4    67                     2
##  2 Bad         48   5951                 2    22                     1
##  3 Good        12   2096                 3    49                     1
##  4 Good        42   7882                 4    45                     1
##  5 Bad         24   4870                 4    53                     2
##  6 Good        36   9055                 4    35                     1
##  7 Good        24   2835                 4    53                     1
##  8 Good        36   6948                 2    35                     1
##  9 Good        12   3059                 4    61                     1
## 10 Bad         30   5234                 2    28                     2
## # … with 990 more rows

1. Inferential statistical analysis

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution

# Descriptive statistics
library(psych)
describe(Sacramento)
##           vars   n      mean        sd    median   trimmed       mad
## price        1 932 246661.58 131126.86 220000.00 230782.79 109712.40
## city*        2 932     24.57     11.87     33.00     26.05      1.48
## zip*         3 932     40.19     19.94     43.50     41.35     24.46
## beds         4 932      3.28      0.89      3.00      3.25      1.48
## baths        5 932      2.05      0.72      2.00      2.01      0.00
## sqft         6 932   1680.32    726.27   1470.00   1575.40    530.77
## type*        7 932      2.87      0.47      3.00      3.00      0.00
## latitude     8 932     38.59      0.13     38.62     38.59      0.14
## longitude    9 932   -121.36      0.14   -121.38   -121.38      0.10
##                min       max     range  skew kurtosis      se
## price     30000.00 884790.00 854790.00  1.35     2.53 4295.20
## city*         1.00     37.00     36.00 -0.73    -1.17    0.39
## zip*          1.00     68.00     67.00 -0.42    -1.10    0.65
## beds          1.00      8.00      7.00  0.28     0.73    0.03
## baths         1.00      5.00      4.00  0.65     1.02    0.02
## sqft        484.00   4878.00   4394.00  1.34     1.62   23.79
## type*         1.00      3.00      2.00 -3.56    10.92    0.02
## latitude     38.24     39.02      0.78 -0.04    -0.28    0.00
## longitude  -121.55   -120.60      0.95  1.91     5.31    0.00
describeBy(Sacramento$price,group=Sacramento$type)
## 
##  Descriptive statistics by group 
## group: Condo
##    vars  n     mean    sd median  trimmed     mad   min    max  range skew
## X1    1 53 148668.6 74408 127000 141056.7 54856.2 40000 360000 320000 0.98
##    kurtosis       se
## X1     0.37 10220.72
## -------------------------------------------------------- 
## group: Multi_Family
##    vars  n     mean       sd median  trimmed      mad   min    max  range
## X1    1 13 224534.7 85061.71 221250 218380.4 90660.99 1e+05 416767 316767
##    skew kurtosis       se
## X1 0.54    -0.34 23591.87
## -------------------------------------------------------- 
## group: Residential
##    vars   n   mean       sd median  trimmed      mad   min    max  range
## X1    1 866 252991 132049.8 224126 237092.8 109154.9 30000 884790 854790
##    skew kurtosis      se
## X1 1.34     2.44 4487.23

1.1 Distribution

Using Historgram and boxplot, four type of quantitative variables for the distribution of a particular set of data can be understand:

  • Shape: Distribution
  • Center: mean or median or mode
  • Spread: variance, quantile
  • Outliers
par(mfrow=c(1,2))
hist(Sacramento$price)
boxplot(Sacramento$price ~ Sacramento$type)

# subset with criterion===============================================
Sacramento.sub <- Sacramento
Sacramento.sub$pval <- 2*pnorm(sort(-abs(Sacramento.sub$price)))
Sacramento.sub$fdr <- p.adjust(Sacramento.sub$pval, method='fdr')
Sacramento.sub <- subset(Sacramento.sub, fdr < 0.05)

1.2 Statistical Test

https://www.youtube.com/channel/UCXCscRMmz4vqKzEunyVrEnQ/videos

Common Statisical Test

Common Statisical Test

# t test
t.test(Sacramento$price, mu = 2.38*10^5,
       paired=FALSE, var.equal = FALSE, conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Sacramento$price
## t = 2.0166, df = 931, p-value = 0.04403
## alternative hypothesis: true mean is not equal to 238000
## 95 percent confidence interval:
##  238232.2 255091.0
## sample estimates:
## mean of x 
##  246661.6
# wilcox_test
wilcox.test(Sacramento$price,
            correct=TRUE, conf.int=TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Sacramento$price
## V = 434778, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
##  223250 239159
## sample estimates:
## (pseudo)median 
##         231000

2. Regression Analysis

2.1 Least Squares Regression Line

Linear regression explained by linear algebra

Linear regression explained by linear algebra

2.2 Linear Regression

2.2.1 Analysis of Residuals: Assumption of Linear Regression

Sacramento.lm1 <- lm(price ~ sqft , data=Sacramento)
Sacramento.lm2 <- lm(price ~ sqft + baths, data=Sacramento)
Sacramento.lm3 <- lm(price ~ sqft + baths + beds, data=Sacramento)
Sacramento.lm4 <- lm(price ~ sqft + baths + beds + type, data=Sacramento)
# Subplot: Diagnosis fro linear model==================================
par(mfrow=c(2,3))
plot(Sacramento.lm1, which=1:6)

Sacramento$res = resid(Sacramento.lm1)
Sacramento$res.standard <- rstandard(Sacramento.lm1)
Sacramento$res.student <- rstudent(Sacramento.lm1)
Sacramento$dffit <- dffits(Sacramento.lm1)
Sacramento$cooks.d <- cooks.distance(Sacramento.lm1)
# 1. linearity (no lack of fit): Residuals versus fitted values. Fix: add quadratic term 
# 2. Normality: Normal QQ. Fix: log transform, non-parametric test
# 3. homoscedasticity: Scale-Location. Fix: Log transform
# 4. Outlier: Cook's ddistance. Fix: log transform, non-parametric test, remove outlier

2.2.2 Simple Linear Model

This section gives an example of simple linear regression, that is, regression with only a single explanatory variable—with seven observations. The seven data points are \({y_i, x_i}\), for \(i= 1,2,3,...,7\). The simple linear regression model is:

\[ y_i = \beta_0 + \beta_1 x_i +\varepsilon_i, \]

where $ _0$ is the y-intercept and \(\beta_1\) is the slope of the regression line. This model can be represented in matrix form as:

\[ \left[\begin{array} \\ y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{array}\right] = \left[\begin{array} \\1 & x_1 \\1 & x_2 \\1 & x_3 \\1 & x_4 \\1 & x_5 \\1 & x_6 \\ 1 & x_7 \end{array}\right] \left[\begin{array} \beta_0 \\ \beta_1 \end{array}\right] + \left[\begin{array} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \varepsilon_4 \\ \varepsilon_5 \\ \varepsilon_6 \\ \varepsilon_7 \end{array}\right] \] where the first column of ones in the design matrix allows estimation of the y-intercept while the second column contains the x-values associated with the corresponding y-values.

Sacramento.lm1 <- lm(price ~ sqft , data=Sacramento)
summary(Sacramento.lm1)
## 
## Call:
## lm(formula = price ~ sqft, data = Sacramento)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -231889  -54717  -11822   38993  600141 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13859.393   6948.714   1.995   0.0464 *  
## sqft          138.546      3.796  36.495   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84130 on 930 degrees of freedom
## Multiple R-squared:  0.5888, Adjusted R-squared:  0.5884 
## F-statistic:  1332 on 1 and 930 DF,  p-value: < 2.2e-16
confint(Sacramento.lm1, level=0.95)
##                2.5 %     97.5 %
## (Intercept) 222.4162 27496.3702
## sqft        131.0962   145.9967

2.2.3 Multiple Linear Model

This section contains an example of multiple Linear regression with two covariates (explanatory variables): w and x. Again suppose that the data consist of seven observations, and that for each observed value to be predicted (\(y_i\)), values \(w_i\) and \(x_i\) of the two covariates are also observed. The model to be considered is:

\[ y_i = \beta_0 + \beta_1 w_i + \beta_2 x_i + \varepsilon_i \]

This model can be written in matrix terms as:

\[ \left[\begin{array} \\y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{array}\right] = \left[\begin{array} \\1 & w_1 & x_1 \\1 & w_2 & x_2 \\1 & w_3 & x_3 \\1 & w_4 & x_4 \\1 & w_5 & x_5 \\1 & w_6 & x_6 \\ 1& w_7 & x_7 \end{array}\right] \left[\begin{array} \\\beta_0 \\ \beta_1 \\ \beta_2 \end{array}\right] + \left[\begin{array}\\ \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \varepsilon_4 \\ \varepsilon_5 \\ \varepsilon_6 \\ \varepsilon_7 \end{array}\right] \]

Here the 7×3 matrix on the right side is the design matrix.

SacramentoP.lm <- lm(price ~ sqft + beds + I(sqft*beds), data=Sacramento) # with interaction
#prediction  in 3D
data.mash<-{}
data.mash$sqft <- c(0.23)
data.mash$beds <- c(55.0)
predz<-predict(SacramentoP.lm, newdata=data.mash)
# 2nd order regression model, Coefficients (numeric)
SacramentoM.lm <- lm(price ~ sqft + beds + I(sqft*beds) + I(sqft^2) + I(beds^2) , data=Sacramento)
summary(SacramentoM.lm)
## 
## Call:
## lm(formula = price ~ sqft + beds + I(sqft * beds) + I(sqft^2) + 
##     I(beds^2), data = Sacramento)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -207686  -51064  -10761   35514  615519 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.637e+04  2.529e+04  -1.833   0.0671 .  
## sqft            2.137e+02  2.398e+01   8.914   <2e-16 ***
## beds            1.401e+04  1.763e+04   0.794   0.4272    
## I(sqft * beds) -7.936e+00  9.451e+00  -0.840   0.4013    
## I(sqft^2)      -3.995e-03  6.543e-03  -0.611   0.5417    
## I(beds^2)      -4.447e+03  4.245e+03  -1.048   0.2951    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81730 on 926 degrees of freedom
## Multiple R-squared:  0.6136, Adjusted R-squared:  0.6115 
## F-statistic: 294.1 on 5 and 926 DF,  p-value: < 2.2e-16

2.3 Logistic Regression

# logistic regression==========================================================
#General linear model, logistic: family=binomial
#assume, z is either yes (1) or no (0)
GermanCredit.glm0 <- glm(Class ~ 1, data = GermanCredit, family = binomial)
GermanCredit.glm1 <- glm(Class ~ Age, data = GermanCredit, family = binomial)
GermanCredit.glm2 <- glm(Class ~ Duration + Age, data = GermanCredit, family = binomial)
GermanCredit.glm5 <- glm(Class ~ Duration + Amount + ResidenceDuration + Age + NumberExistingCredits, family = "binomial", data = GermanCredit)
#summmay of glm
#deviance residuals (a type of residual that makes sense for Yes/No or count data)
#a beds of regression coeficients with se and tests of coeficient = 0, 
#two deviance values and the AIC statistic for the fitted model.
summary(GermanCredit.glm1)
## 
## Call:
## glm(formula = Class ~ Age, family = binomial, data = GermanCredit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8736  -1.4600   0.8063   0.8781   0.9540  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 0.200919   0.232853   0.863  0.38822   
## Age         0.018440   0.006434   2.866  0.00416 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.7  on 999  degrees of freedom
## Residual deviance: 1213.1  on 998  degrees of freedom
## AIC: 1217.1
## 
## Number of Fisher Scoring iterations: 4
# Grid of potential expanatory variable
(GermanCredit.new <- expand.grid(Age = c(21:90)) %>% as_tibble())
## # A tibble: 70 x 1
##      Age
##    <int>
##  1    21
##  2    22
##  3    23
##  4    24
##  5    25
##  6    26
##  7    27
##  8    28
##  9    29
## 10    30
## # … with 60 more rows
# prediction
GermanCredit.new$pred <- predict(GermanCredit.glm1, newdata=GermanCredit.new, type = "terms")
GermanCredit.new$predP <- predict(GermanCredit.glm1, newdata=GermanCredit.new, type='response')
# Predicted odds & probability
par(mfrow = c(1,2),oma = c(0, 0, 2, 0))
plot(GermanCredit.new$Age, GermanCredit.new$pred, type='l', xlab='Age', ylab='Predicted odds')
plot(GermanCredit.new$Age, GermanCredit.new$predP, type='l', xlab='Age', ylab='Predicted probability')
mtext("Predicted Odds & probability", outer = TRUE, cex = 1.5)

#Testing comparisons anova(reduced model,full model)
anova(GermanCredit.glm1, GermanCredit.glm2, test='Chi')
## Analysis of Deviance Table
## 
## Model 1: Class ~ Age
## Model 2: Class ~ Duration + Age
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       998     1213.1                          
## 2       997     1169.2  1   43.946 3.375e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans::emmeans(GermanCredit.glm2, c("Duration","Age"))
##  Duration  Age emmean     SE  df asymp.LCL asymp.UCL
##      20.9 35.5  0.892 0.0719 Inf     0.751      1.03
## 
## Results are given on the logit (not the response) scale. 
## Confidence level used: 0.95

2.4 Diagnosis

Correlation Coefficient to Coefficient of Determinant

# Regression diagnostics ============================================
#if the Sacramento.lm1 is the best fit
Sacramento.n <- nrow(Sacramento)
Sacramento.sse0 <- sum(resid(Sacramento.lm1) ^2)
#AIC
AIC(Sacramento.lm1, k = 2)
## [1] 23786.79
Sacramento.lm1.AIC <- Sacramento.n + Sacramento.n * log(2*pi) + Sacramento.n * log(Sacramento.sse0/Sacramento.n) + 2 * (1+1)
#BIC
BIC(Sacramento.lm1)
## [1] 23801.3
AIC(Sacramento.lm1, k=log(Sacramento.n))
## [1] 23801.3
Sacramento.lm1.BIC <- Sacramento.n + Sacramento.n * log(2*pi) + Sacramento.n * log(Sacramento.sse0/Sacramento.n) + log(Sacramento.n)*(1+1)
library(car)
#calculate vif
vif(Sacramento.lm3)
##     sqft    baths     beds 
## 2.949793 2.528712 2.185211
#plot residual and cook's D against observation
plot(Sacramento.lm3,4)

#outlier, sr>1.96
sresid <- rstudent(Sacramento.lm3)
#influent point, cook.d >1
cooks.d <- cooks.distance(Sacramento.lm3)
nobs <- length(sresid)
plot(1:nobs, sresid)

plot(1:nobs, cooks.d)

#calculate the PRESS statistic
#out-of-sample residual
pres <- resid(Sacramento.lm3)/sqrt(1-lm.influence(Sacramento.lm3)$hat)
#PRESS statistic
press <- sum(pres^2)
#rMSEP
sqrt(press/length(pres))
## [1] 82715.87

2.5 Prediction

# prediction for price in 2D
Sacramento$pred <- predict(Sacramento.lm1, newdata=Sacramento)
# calibration for Time in 2D
library(chemCal)
newy <- 100 # at given newy
predx<-t(sapply(newy,function(price) inverse.predict(Sacramento.lm1, price)))
Sacramento.pred<-{}
Sacramento.pred$sqft <- seq(min(Sacramento$sqft), max(Sacramento$sqft), length.out=100)
Sacramento.pred$preds <- predict(Sacramento.lm1, 
                               newdata = data.frame(sqft=Sacramento.pred$sqft),
                               se.fit=T, interval = 'confidence')
str(Sacramento.pred)
## List of 2
##  $ sqft : num [1:100] 484 528 573 617 662 ...
##  $ preds:List of 4
##   ..$ fit           : num [1:100, 1:3] 80916 87065 93214 99364 105513 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:3] "fit" "lwr" "upr"
##   ..$ se.fit        : Named num [1:100] 5312 5169 5027 4887 4749 ...
##   .. ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
##   ..$ df            : int 930
##   ..$ residual.scale: num 84126
# plot original data point
with(Sacramento, plot(sqft, price, pch=19, col=4),type='p')
# plot bandwidth
polygon(c(rev(Sacramento.pred$sqft), Sacramento.pred$sqft), 
        c(rev(Sacramento.pred[["preds"]][["fit"]][ ,3]), Sacramento.pred[["preds"]][["fit"]][ ,2]),
        col=rgb(0.8,0.8,0.8,0.5),border = NA)
# plot mean prediction, upper and lower bound line
lines(Sacramento.pred$sqft, Sacramento.pred[["preds"]][["fit"]][,1])
lines(Sacramento.pred$sqft, Sacramento.pred[["preds"]][["fit"]][,2], lty=2)
lines(Sacramento.pred$sqft, Sacramento.pred[["preds"]][["fit"]][,3], lty=2)
legend('topright', bty='n', lty=c(1,2), legend=c('fitted line','95% confidence'))

Unsort

#histogram
hist(Sacramento$price)
#boxplot
boxplot(price~type, data=Sacramento, main='', ylab='')

# transformation=====================================================
Sacramento$logy<-log(Sacramento$price)

# odd ratio===========================================================
library(epitools)
oddsratio(counts, method='wald')

# Covariance matrix====================================================
#pre  view of collinearity for model selection
cor(Sacramento[,1:2]) > 0.8
cor(cbind(Sacramento$price, Sacramento$Time, Sacramento$predy))
#pair all if only numeric
pairs(Sacramento)
#Omit a column(or columns) by number
pairs(Sacramento[,c(-4,-5)])
#pair from column 1 to 2 
pairs(Sacramento[,1:2], 
      main='Scatterplot matrix')
#Select desired columns by name.
pairs(Sacramento[,c('price','Time','predy')], 
      main='Scatterplot matrix')
#Assemble a matrix with the desired columns.
pairs(cbind(Sacramento$price, Sacramento$Time, Sacramento$predy), 
      labels=c('yname','xname','pred price'),
      main='Scatterplot matrix')


# Contengency beds================================================
#A.re-range the count to matrix (fake code)
counts <- matrix(Sacramento$count, nrow=2, byrow=T)
#B. convert long format to contengency beds
sub.Sacramento <- Sacramento[,c(6,7)]
#split 1 column to matrix by trt
counts <- with(sub.Sacramento,beds(type,color))
#give row and col names
dimnames(counts) <- list(c('Yes','No'), c('a', 'b','c') )
#chi-squred test
chisq.test(counts, correct=F)
#Odd ratio
library(epitools)
oddsratio(counts, method='wald')
counts2 <- counts[c(2,1),]
oddsratio(counts2, method='wald')
#test of H0: row TRT have same odd, All the expected counts are larger than 1
chisq.test(counts, simulate.p.value=T)