attach(mtcars)

Use of this document

Once an ANOVA has given a statistically significant result (but not neccessary, check 4.2.1), then we want to check which specific pairs of means are different from other, mean, control. There are two main objective of doing multiple comparisons:

1 Three Assumption

The analysis of variance has the same three assumptions to the ANOVA:

1.1 Normality of the dependent variables distribution

# perform a Shapiro-Wilk test of normality with H0 = that a sample x_i came from a normally distributed population
# Test univariate Normality
shapiro.test(mtcars$mpg)
# Test Multivariate Normality
mvnormtest::mshapiro.test(t(mtcars[,3:5]))
# Significant departures from the line suggest violations of normality.
qqnorm(mtcars$mpg)
qqline(mtcars$mpg)

1.2 Homogeneity of variance

The variance in each cell should be similar. Check via Levene’s test or other homogeneity of variance tests which are generally produced as part of the ANOVA statistical output. Sample size: per cell > 20 is preferred; aids robustness to violation of the first two assumptions, and a larger sample size increases power

# Levene's test for Homogeneity of Variance with H0 = equal variance
# less sensitive than the Bartlett test to departures from normality
car::leveneTest(price ~ type, data = Sacramento)
# Bartlett Test of Homogeneity of Variances with H0 = equal variance
bartlett.test(y~G, data=mydata)
# Figner-Killeen Test of Homogeneity of Variances with H0 = equal variance
fligner.test(y~G, data=mydata)
# Homogeneity of Variance Plot
library(HH)
HH::hovPlot(price ~ type, data=Sacramento)

1.3 Independent observations

Scores on one variable or for one group should not be dependent on another variable or group (usually guaranteed by the design of the study)

2 Threshold for positive test

2.1 Confusion matrix

The following is the generalization of confusion matrix from a single hypothesis test to multiple hypothesis tests.

Classification of a single hypothesis test
\(H_A\) is true \(H_0\) is true
Reject \(H_0\) Correct inference (TP) (probability= \(1-\beta\)) Type I error (FP) (probability= \(\alpha\))
Fail to reject \(H_0\) Type II error (FN) (probability= \(\beta\)) Correct inference (TN) (probability= \(1-\alpha\))
Confusion matrix of multiple hypothesis tests
\(H_A\) is true \(H_0\) is true Total \(\frac{H_A \ is \ true}{Total}\) \(\frac{H_0 \ is \ true}{Total}\)
Reject \(H_0\) \(TP\) \(FP\) \(TP+FP\) Positive predictive value, \(FDR = \frac{TP}{TP+FP}\) (Precision) False Discovery Rate, \(FDR = \frac{FP}{TP+FP}\)
Fail to reject \(H_0\) \(FN\) \(TN\) \(FN+TN\) False omission rate, \(FOR = \frac{FN}{FN+TN}\) Negative predictive value, \(\frac{TN}{FN+TN}\)
Total \(TP+FN\) \(FP+TN\) \(TP+FP+FN+TN\) Prevalence, \(\frac{TP+FN}{TP+FP+FN+TN}\)
\(\frac{Reject \ H_0}{Total}\) True Positive Rate, \(TPR = \frac{TP}{TP+FN}\) (Recall, Sensitivity, probability of detection, Power) False Positive Rate, \(FPR= \frac{FP}{FP+TN}\) (Fall-out) Positive likelihood ratio, \(LR+ = \frac{TPR}{FPR}\) F1 score \(= \frac{Precision \times Recall}{Precision + Recall}\)
\(\frac{Fail \ to \ reject H_0}{Total}\) False Negative Rate, \(FNR = \frac{FN}{TP+FN}\) (Miss rate) True Negative Rate, \(TNR = \frac{TN}{FP+TN}\) (Specificity Selectivity) Negative likelihood ratio, \(LR- = \frac{FNR}{TNR}\)
Diagnostic odds ratio, \(DOR = \frac{LR+}{LR-}\)

2.3 The multiple testing problem

When doing multiple hypothesis testing, we are not more concern about the accuracy of single hypothesis testing

  • i.e. For 10000 test, with alpha of 0.01, FP = 100, which is too high.

Instead, we focus on the the percentage of False positive over all positive inference (False Discover Rate).

  • i.e. For 10000 test, we either decrease \(\alpha\) from 0.01 to a small value or limit False Discover Rate < \(\alpha\) for each individual test

2.4 Controlling procedures

Controlling procedures Threshold Discussion
Per-Comparison error rate (PCER) controlling procedures \(Guarantees \ PCER = Pr(FP_i > 0) \leq \alpha \ marginally \ for \ all 1 \leq i \leq m\)
Family-wise error rate (FWER) controlling procedures \(Guarantees \ FWER = Pr(FP > 0) \leq \alpha\) Too conservative (not easy) to reject H0. Reduce the type I error but over-kill Specificity, resulting in Low power (too conservative) to detect TP.
False discovery rate (FDR)-controlling procedures \(Guarantees \ FDR = E[\frac{FP}{TP+FP}] \leq \alpha\)

3 Method

Method when to use? Discription
Fisher’s Least Significant Difference (LSD) use LSD only when Contrast with specific group (i.e. control group, best group). Otherwise, less reliance. The p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value.
Least significant ranges: q-test use q-test when the experience require high precision the q-value gives the expected positive false discovery rate (pFDR) obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.
Least significant ranges: Shortest Significant Ranges (SSR) use SSR in general experimental test

简略选择指南 多重比较的应用条件

3.1 Fisher’s Least Significant Difference (LSD)

[protected vs unprotected])(https://www.graphpad.com/support/faq/fishers-least-significant-difference-lsd-test/)

  • Protection means that you only perform the calculations described above when the overall ANOVA resulted in a P value less than 0.05. If the P value for the ANOVA is greater than 0.05 (or whatever significance level you set), you conclude that the data are consistent with the null hypothesis that all population means are identical, and you don’t look further. The unprotected Fisher’s LSD test is essentially a set of t tests, without any correction for multiple comparisons.
Pairwise comparison with t test
Test Purpose Following distribution Statistic calculation
Fisher’s Least Significant Difference, LSD The first post hoc. Identify which pairs of means are statistically different, by exploring all possible pair-wise t-test on means. The results are not quite the same as truly doing individual t tests, because the Fisher’s LSD test uses the pooled SD from all the groups and not just the two being compared. Student’s \(t\)-distribution \(t_{\alpha/2} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Dunn–Šidák correction (FWER-controled on LSD) Adjustment on confidence level \(alpha\) by applying the concept of multiplicative inequality, that the probability of occurrence of intersection of each event is more than or equal to the probability of occurrence of each event. Student’s \(t\)-distribution with \(\alpha_{adjusted} =1- \sqrt[k]{1-\alpha}\) \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
bonferroni Correction (FWER-controled on LSD) Adjustment on confidence level \(alpha\), less sensitive than \(sidak\), suffers from a lack of statistical power Student’s \(t\)-distribution with \(\alpha_{adjusted} = \frac{\alpha}{m}\) \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Holm-Bonferroni Method (FWER-controled on LSD) A modification of the Bonferroni correction on the lack of statistical power. Student’s \(t\)-distribution with \(\alpha_{adjusted} = \frac{\alpha}{m+1-rank}\) \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Scheffé Method (FWER-controled on LSD) Can be used to examine all possible linear combinations of group means, not just pairwise comparisons. Use as exploratory post-hoc method. \(F\)-distribution \(\sqrt{(k-1)F_{critical}} \sqrt{MSE((\frac{1}{n_1}+\frac{1}{n_2}))}\)
Dunnett’s correction Similar to LSD, but addressing a special case of multiple comparisons problem - pairwise comparisons of multiple treatment groups with a single control group., yielding narrower confidence intervals. Comparing every treatments mean to a control mean Dunnett’s t-distribution, \(t_{Dunnett} = ts \sqrt{\frac{2}{n}}\), where t is draw from Multivariate t-distribution \(t_{Dunnett} \sqrt{\frac{2MS_{S/A}}{n}}\)
Tamhane’s T2 A modification on Sidak for equal variances assumption is violated. Student’s \(t\)-distribution with \(\alpha_{adjusted} =1- \sqrt[k]{1-\alpha}\) \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Dunnett’s T3 A modification on T2. Also for equal variances assumption is violated. provides a narrower CI than T2 quasi-normalized maximum-magnitude distribution (studentized maximum modulus distribution) with using Welch’s degree of freedom \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Benjamini–Hochberg procedure, BH (FDR-controlled) Instead of FWER, use FDR-controlling procedure to obtain higher power. - \(q \ value = p_{(k)} \leq \frac{\alpha k}{m}\)

blog 1 blog 2 Studentized ange Distribution Post-Hoc Tests 如何通俗地理解Family-wise error rate和False discovery rate

3.2 Least significant ranges: q-test

homogeneous subgroup analysis
Test Purpose Following distribution Statistic calculation
Tukey’s Honestly Significant Difference Test, Tukey’s HSD Compare every mean with every other mean, to figure out which groups differ for equal sample sizes Studentized Range Distribution \(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance
Student-Newman-Keuls, SNK A modification of HSD. With equal sample sizes, it also compares pairs of means within homogeneous subsets, using a stepwise procedure. Means are ordered from highest to lowest, and extreme differences are tested first. Reduceing type I error, but increase type II error Studentized Range Distribution \(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance
Tukey's-b make pairwise comparisons between groups. The critical value is the average of the corresponding value for the `HSD and the SNK Studentized Range Distribution -
Hochberg’s GT2 Similar to HSD. Multiple comparison and range test that uses the Studentized maximum modulus - -
Gabriel A modification on GT2. Pairwise comparison test that used the Studentized maximum modulus and is generally more powerful than Hochberg’s GT2 when the cell sizes are unequal. When cell sizes is identical, it equal to GT2 - -
Tukey-Kramer Method Similar to HSD. compare every mean with every other mean for unequal sample sizes, by using the harmonic mean of the cell size of the two comparisons. Studentized Range Distribution \(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n_{harmonic}}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance
Games-Howell A modification of Tukey-Kramer for equal variances assumption is violated. Studentized Range Distribution with using Welch’s degree of freedom \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\)
Ryan-Einot-Gabriel-Welsch-Q(REGWQ) A modification on q-test using REGW multiple stepdown procedure Studentized Range Distribution -
Dunnett’s C Purpose for equal variances assumption is violated. Studentized Range Distribution -

3.3 Least significant ranges: Shortest Significant Ranges (SSR)

homogeneous subgroup analysis
Test Purpose Following distribution Statistic calculation
Duncan’s new multiple range test (MRT) A modification of SNK, using SSR Value instead of q-value. Makes pairwise comparisons using a stepwise order of comparisons identical to the order used by the SNK, but sets a protection level for the error rate for the collection of tests, rather than an error rate for individual tests. Reduceing type II error, but increase type I error Duncan’ s Multiple Range Distribution, \(SSR\) Statistic

3.4 Other

Equal Variances Assumed
Test Purpose Following distribution Statistic calculation
Ryan-Einot-Gabriel-Welsch-Fisher(REGWF) A modification on F test using REGW multiple stepdown procedure from smallest to largest. recommended for balanced designs, which have even numbers of levels \(\alpha \ when \ k=g \ or \ k=g-1\), or \(\alpha = 1-(1-\alpha)^{k/g} \ when \ k<g-1\) where \(g\) is the number of means in the group being tested, \(k\) is the number of means in a subset, \(F\)-distribution
Waller-Duncan Multiple comparison test based on a t statistic; uses a Bayesian approach. It is necessary first makes an ANOVA - -