attach(mtcars)

Use of this document

Once an ANOVA has given a statistically significant result (but not neccessary, check 4.2.1), then we want to check which specific pairs of means are different from other, mean, control. There are two main objective of doing multiple comparisons:

Pairwise comparison among all group with t test
Contrast with specific group (i.e. control group, best group)

1 Three Assumption

The analysis of variance has the same three assumptions to the ANOVA:

Normality of the dependent variables distribution
Homogeneity of variance
Independent observations

1.1 Normality of the dependent variables distribution

# perform a Shapiro-Wilk test of normality with H0 = that a sample x_i came from a normally distributed population
# Test univariate Normality
shapiro.test(mtcars$mpg)
# Test Multivariate Normality
mvnormtest::mshapiro.test(t(mtcars[,3:5]))

# Significant departures from the line suggest violations of normality.
qqnorm(mtcars$mpg)
qqline(mtcars$mpg)

1.2 Homogeneity of variance

The variance in each cell should be similar. Check via Levene’s test or other homogeneity of variance tests which are generally produced as part of the ANOVA statistical output. Sample size: per cell > 20 is preferred; aids robustness to violation of the first two assumptions, and a larger sample size increases power

# Levene's test for Homogeneity of Variance with H0 = equal variance
# less sensitive than the Bartlett test to departures from normality
car::leveneTest(price ~ type, data = Sacramento)
# Bartlett Test of Homogeneity of Variances with H0 = equal variance
bartlett.test(y~G, data=mydata)
# Figner-Killeen Test of Homogeneity of Variances with H0 = equal variance
fligner.test(y~G, data=mydata)

# Homogeneity of Variance Plot
library(HH)
HH::hovPlot(price ~ type, data=Sacramento)

1.3 Independent observations

Scores on one variable or for one group should not be dependent on another variable or group (usually guaranteed by the design of the study)

2 Threshold for positive test

2.1 Confusion matrix

The following is the generalization of confusion matrix from a single hypothesis test to multiple hypothesis tests.

Classification of a single hypothesis test
	\(H_A\) is true	\(H_0\) is true
Reject \(H_0\)	Correct inference (TP) (probability= \(1-\beta\))	Type I error (FP) (probability= \(\alpha\))
Fail to reject \(H_0\)	Type II error (FN) (probability= \(\beta\))	Correct inference (TN) (probability= \(1-\alpha\))

Confusion matrix of multiple hypothesis tests
	\(H_A\) is true	\(H_0\) is true	Total	\(\frac{H_A \ is \ true}{Total}\)	\(\frac{H_0 \ is \ true}{Total}\)
Reject \(H_0\)	\(TP\)	\(FP\)	\(TP+FP\)	Positive predictive value, \(FDR = \frac{TP}{TP+FP}\) (*Precision*)	False Discovery Rate, \(FDR = \frac{FP}{TP+FP}\)
Fail to reject \(H_0\)	\(FN\)	\(TN\)	\(FN+TN\)	False omission rate, \(FOR = \frac{FN}{FN+TN}\)	Negative predictive value, \(\frac{TN}{FN+TN}\)
Total	\(TP+FN\)	\(FP+TN\)	\(TP+FP+FN+TN\)	Prevalence, \(\frac{TP+FN}{TP+FP+FN+TN}\)
\(\frac{Reject \ H_0}{Total}\)	True Positive Rate, \(TPR = \frac{TP}{TP+FN}\) (*Recall*, Sensitivity, probability of detection, Power)	False Positive Rate, \(FPR= \frac{FP}{FP+TN}\) (Fall-out)	Positive likelihood ratio, \(LR+ = \frac{TPR}{FPR}\)	F1 score \(= \frac{Precision \times Recall}{Precision + Recall}\)
\(\frac{Fail \ to \ reject H_0}{Total}\)	False Negative Rate, \(FNR = \frac{FN}{TP+FN}\) (Miss rate)	True Negative Rate, \(TNR = \frac{TN}{FP+TN}\) (Specificity Selectivity)	Negative likelihood ratio, \(LR- = \frac{FNR}{TNR}\)
			Diagnostic odds ratio, \(DOR = \frac{LR+}{LR-}\)

2.3 The multiple testing problem

When doing multiple hypothesis testing, we are not more concern about the accuracy of single hypothesis testing

i.e. For 10000 test, with alpha of 0.01, FP = 100, which is too high.

Instead, we focus on the the percentage of False positive over all positive inference (False Discover Rate).

i.e. For 10000 test, we either decrease \(\alpha\) from 0.01 to a small value or limit False Discover Rate < \(\alpha\) for each individual test

2.4 Controlling procedures

Controlling procedures	Threshold	Discussion
Per-Comparison error rate (PCER) controlling procedures	\(Guarantees \ PCER = Pr(FP_i > 0) \leq \alpha \ marginally \ for \ all 1 \leq i \leq m\)
Family-wise error rate (FWER) controlling procedures	\(Guarantees \ FWER = Pr(FP > 0) \leq \alpha\)	Too conservative (not easy) to reject H0. Reduce the type I error but over-kill Specificity, resulting in Low power (too conservative) to detect TP.
False discovery rate (FDR)-controlling procedures	\(Guarantees \ FDR = E[\frac{FP}{TP+FP}] \leq \alpha\)

3 Method

Method	when to use?	Discription
Fisher’s Least Significant Difference (LSD)	use LSD only when Contrast with specific group (i.e. control group, best group). Otherwise, less reliance.	The p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value.
Least significant ranges: q-test	use q-test when the experience require high precision	the q-value gives the expected positive false discovery rate (pFDR) obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.
Least significant ranges: Shortest Significant Ranges (SSR)	use SSR in general experimental test

简略选择指南多重比较的应用条件

3.1 Fisher’s Least Significant Difference (LSD)

[protected vs unprotected])(https://www.graphpad.com/support/faq/fishers-least-significant-difference-lsd-test/)

Protection means that you only perform the calculations described above when the overall ANOVA resulted in a P value less than 0.05. If the P value for the ANOVA is greater than 0.05 (or whatever significance level you set), you conclude that the data are consistent with the null hypothesis that all population means are identical, and you don’t look further. The unprotected Fisher’s LSD test is essentially a set of t tests, without any correction for multiple comparisons.

Pairwise comparison with t test
Test	Purpose	Following distribution	Statistic calculation
Fisher’s Least Significant Difference, `LSD`	The first post hoc. Identify which pairs of means are statistically different, by exploring all possible pair-wise t-test on means. The results are not quite the same as truly doing individual t tests, because the Fisher’s LSD test uses the pooled SD from all the groups and not just the two being compared.	Student’s \(t\)-distribution	\(t_{\alpha/2} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Dunn–`Šidák` correction (FWER-controled on `LSD`)	Adjustment on confidence level \(alpha\) by applying the concept of multiplicative inequality, that the probability of occurrence of intersection of each event is more than or equal to the probability of occurrence of each event.	Student’s \(t\)-distribution with \(\alpha_{adjusted} =1- \sqrt[k]{1-\alpha}\)	\(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
`bonferroni` Correction (FWER-controled on `LSD`)	Adjustment on confidence level \(alpha\), less sensitive than \(sidak\), suffers from a lack of statistical power	Student’s \(t\)-distribution with \(\alpha_{adjusted} = \frac{\alpha}{m}\)	\(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
`Holm-Bonferroni` Method (FWER-controled on `LSD`)	A modification of the Bonferroni correction on the lack of statistical power.	Student’s \(t\)-distribution with \(\alpha_{adjusted} = \frac{\alpha}{m+1-rank}\)	\(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
`Scheffé` Method (FWER-controled on `LSD`)	Can be used to examine all possible linear combinations of group means, not just pairwise comparisons. Use as exploratory post-hoc method.	\(F\)-distribution	\(\sqrt{(k-1)F_{critical}} \sqrt{MSE((\frac{1}{n_1}+\frac{1}{n_2}))}\)
`Dunnett`’s correction	Similar to `LSD`, but addressing a special case of multiple comparisons problem - pairwise comparisons of multiple treatment groups with a single control group., yielding narrower confidence intervals. Comparing every treatments mean to a control mean	Dunnett’s t-distribution, \(t_{Dunnett} = ts \sqrt{\frac{2}{n}}\), where t is draw from Multivariate t-distribution	\(t_{Dunnett} \sqrt{\frac{2MS_{S/A}}{n}}\)
Tamhane’s `T2`	A modification on `Sidak` for equal variances assumption is violated.	Student’s \(t\)-distribution with \(\alpha_{adjusted} =1- \sqrt[k]{1-\alpha}\)	\(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Dunnett’s `T3`	A modification on `T2`. Also for equal variances assumption is violated. provides a narrower CI than `T2`	quasi-normalized maximum-magnitude distribution (studentized maximum modulus distribution) with using Welch’s degree of freedom	\(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Benjamini–Hochberg procedure, `BH` (FDR-controlled)	Instead of FWER, use FDR-controlling procedure to obtain higher power.	-	\(q \ value = p_{(k)} \leq \frac{\alpha k}{m}\)

blog 1 blog 2 Studentized ange Distribution Post-Hoc Tests 如何通俗地理解Family-wise error rate和False discovery rate

3.2 Least significant ranges: q-test

homogeneous subgroup analysis
Test	Purpose	Following distribution	Statistic calculation
Tukey’s Honestly Significant Difference Test, Tukey’s `HSD`	Compare every mean with every other mean, to figure out which groups differ for equal sample sizes	Studentized Range Distribution	\(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance
Student-Newman-Keuls, `SNK`	A modification of `HSD`. With equal sample sizes, it also compares pairs of means within homogeneous subsets, using a stepwise procedure. Means are ordered from highest to lowest, and extreme differences are tested first. Reduceing type I error, but increase type II error	Studentized Range Distribution	\(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance
`Tukey's-b`	make pairwise comparisons between groups. The critical value is the average of the corresponding value for the ``HSD` and the `SNK`	Studentized Range Distribution	-
Hochberg’s `GT2`	Similar to `HSD`. Multiple comparison and range test that uses the Studentized maximum modulus	-	-
Gabriel	A modification on `GT2`. Pairwise comparison test that used the Studentized maximum modulus and is generally more powerful than Hochberg’s GT2 when the cell sizes are unequal. When cell sizes is identical, it equal to `GT2`	-	-
`Tukey-Kramer` Method	Similar to `HSD`. compare every mean with every other mean for unequal sample sizes, by using the harmonic mean of the cell size of the two comparisons.	Studentized Range Distribution	\(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n_{harmonic}}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance
Games-Howell	A modification of `Tukey-Kramer` for equal variances assumption is violated.	Studentized Range Distribution with using Welch’s degree of freedom	\(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Ryan-Einot-Gabriel-Welsch-Q(REGWQ)	A modification on q-test using REGW multiple stepdown procedure	Studentized Range Distribution	-
Dunnett’s C	Purpose for equal variances assumption is violated.	Studentized Range Distribution	-

3.3 Least significant ranges: Shortest Significant Ranges (SSR)

homogeneous subgroup analysis
Test	Purpose	Following distribution	Statistic calculation
`Duncan`’s new multiple range test (MRT)	A modification of `SNK`, using SSR Value instead of q-value. Makes pairwise comparisons using a stepwise order of comparisons identical to the order used by the `SNK`, but sets a protection level for the error rate for the collection of tests, rather than an error rate for individual tests. Reduceing type II error, but increase type I error	Duncan’ s Multiple Range Distribution, \(SSR\)	Statistic

3.4 Other

Equal Variances Assumed
Test	Purpose	Following distribution	Statistic calculation
Ryan-Einot-Gabriel-Welsch-Fisher(`REGWF`)	A modification on F test using REGW multiple stepdown procedure from smallest to largest. recommended for balanced designs, which have even numbers of levels	\(\alpha \ when \ k=g \ or \ k=g-1\), or \(\alpha = 1-(1-\alpha)^{k/g} \ when \ k<g-1\) where \(g\) is the number of means in the group being tested, \(k\) is the number of means in a subset,	\(F\)-distribution
Waller-Duncan	Multiple comparison test based on a t statistic; uses a Bayesian approach. It is necessary first makes an ANOVA	-	-

Note-Post-hoc Analysis

Weiquan Luo

2021-03-26