attach(mtcars)
Once an ANOVA has given a statistically significant result (but not neccessary, check 4.2.1), then we want to check which specific pairs of means are different from other, mean, control. There are two main objective of doing multiple comparisons:
The analysis of variance has the same three assumptions to the ANOVA:
# perform a Shapiro-Wilk test of normality with H0 = that a sample x_i came from a normally distributed population
# Test univariate Normality
shapiro.test(mtcars$mpg)
# Test Multivariate Normality
mvnormtest::mshapiro.test(t(mtcars[,3:5]))
# Significant departures from the line suggest violations of normality.
qqnorm(mtcars$mpg)
qqline(mtcars$mpg)
The variance in each cell should be similar. Check via Levene’s test or other homogeneity of variance tests which are generally produced as part of the ANOVA statistical output. Sample size: per cell > 20 is preferred; aids robustness to violation of the first two assumptions, and a larger sample size increases power
# Levene's test for Homogeneity of Variance with H0 = equal variance
# less sensitive than the Bartlett test to departures from normality
car::leveneTest(price ~ type, data = Sacramento)
# Bartlett Test of Homogeneity of Variances with H0 = equal variance
bartlett.test(y~G, data=mydata)
# Figner-Killeen Test of Homogeneity of Variances with H0 = equal variance
fligner.test(y~G, data=mydata)
# Homogeneity of Variance Plot
library(HH)
HH::hovPlot(price ~ type, data=Sacramento)
Scores on one variable or for one group should not be dependent on another variable or group (usually guaranteed by the design of the study)
The following is the generalization of confusion matrix from a single hypothesis test to multiple hypothesis tests.
\(H_A\) is true | \(H_0\) is true | ||||
---|---|---|---|---|---|
Reject \(H_0\) | Correct inference (TP) (probability= \(1-\beta\)) | Type I error (FP) (probability= \(\alpha\)) | |||
Fail to reject \(H_0\) | Type II error (FN) (probability= \(\beta\)) | Correct inference (TN) (probability= \(1-\alpha\)) |
\(H_A\) is true | \(H_0\) is true | Total | \(\frac{H_A \ is \ true}{Total}\) | \(\frac{H_0 \ is \ true}{Total}\) | |
---|---|---|---|---|---|
Reject \(H_0\) | \(TP\) | \(FP\) | \(TP+FP\) | Positive predictive value, \(FDR = \frac{TP}{TP+FP}\) (Precision) | False Discovery Rate, \(FDR = \frac{FP}{TP+FP}\) |
Fail to reject \(H_0\) | \(FN\) | \(TN\) | \(FN+TN\) | False omission rate, \(FOR = \frac{FN}{FN+TN}\) | Negative predictive value, \(\frac{TN}{FN+TN}\) |
Total | \(TP+FN\) | \(FP+TN\) | \(TP+FP+FN+TN\) | Prevalence, \(\frac{TP+FN}{TP+FP+FN+TN}\) | |
\(\frac{Reject \ H_0}{Total}\) | True Positive Rate, \(TPR = \frac{TP}{TP+FN}\) (Recall, Sensitivity, probability of detection, Power) | False Positive Rate, \(FPR= \frac{FP}{FP+TN}\) (Fall-out) | Positive likelihood ratio, \(LR+ = \frac{TPR}{FPR}\) | F1 score \(= \frac{Precision \times Recall}{Precision + Recall}\) | |
\(\frac{Fail \ to \ reject H_0}{Total}\) | False Negative Rate, \(FNR = \frac{FN}{TP+FN}\) (Miss rate) | True Negative Rate, \(TNR = \frac{TN}{FP+TN}\) (Specificity Selectivity) | Negative likelihood ratio, \(LR- = \frac{FNR}{TNR}\) | ||
Diagnostic odds ratio, \(DOR = \frac{LR+}{LR-}\) |
When doing multiple hypothesis testing, we are not more concern about the accuracy of single hypothesis testing
Instead, we focus on the the percentage of False positive over all positive inference (False Discover Rate).
Controlling procedures | Threshold | Discussion |
---|---|---|
Per-Comparison error rate (PCER) controlling procedures | \(Guarantees \ PCER = Pr(FP_i > 0) \leq \alpha \ marginally \ for \ all 1 \leq i \leq m\) | |
Family-wise error rate (FWER) controlling procedures | \(Guarantees \ FWER = Pr(FP > 0) \leq \alpha\) | Too conservative (not easy) to reject H0. Reduce the type I error but over-kill Specificity, resulting in Low power (too conservative) to detect TP. |
False discovery rate (FDR)-controlling procedures | \(Guarantees \ FDR = E[\frac{FP}{TP+FP}] \leq \alpha\) |
Method | when to use? | Discription |
---|---|---|
Fisher’s Least Significant Difference (LSD) | use LSD only when Contrast with specific group (i.e. control group, best group). Otherwise, less reliance. | The p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value. |
Least significant ranges: q-test | use q-test when the experience require high precision | the q-value gives the expected positive false discovery rate (pFDR) obtained by rejecting the null hypothesis for any result with an equal or smaller q-value. |
Least significant ranges: Shortest Significant Ranges (SSR) | use SSR in general experimental test |
[protected vs unprotected])(https://www.graphpad.com/support/faq/fishers-least-significant-difference-lsd-test/)
Protection
means that you only perform the calculations described above when the overall ANOVA resulted in a P value less than 0.05. If the P value for the ANOVA is greater than 0.05 (or whatever significance level you set), you conclude that the data are consistent with the null hypothesis that all population means are identical, and you don’t look further. The unprotected Fisher’s LSD test is essentially a set of t tests, without any correction for multiple comparisons.Test | Purpose | Following distribution | Statistic calculation |
---|---|---|---|
Fisher’s Least Significant Difference, LSD |
The first post hoc. Identify which pairs of means are statistically different, by exploring all possible pair-wise t-test on means. The results are not quite the same as truly doing individual t tests, because the Fisher’s LSD test uses the pooled SD from all the groups and not just the two being compared. | Student’s \(t\)-distribution | \(t_{\alpha/2} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\) |
Dunn–Šidák correction (FWER-controled on LSD ) |
Adjustment on confidence level \(alpha\) by applying the concept of multiplicative inequality, that the probability of occurrence of intersection of each event is more than or equal to the probability of occurrence of each event. | Student’s \(t\)-distribution with \(\alpha_{adjusted} =1- \sqrt[k]{1-\alpha}\) | \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\) |
bonferroni Correction (FWER-controled on LSD ) |
Adjustment on confidence level \(alpha\), less sensitive than \(sidak\), suffers from a lack of statistical power | Student’s \(t\)-distribution with \(\alpha_{adjusted} = \frac{\alpha}{m}\) | \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\) |
Holm-Bonferroni Method (FWER-controled on LSD ) |
A modification of the Bonferroni correction on the lack of statistical power. | Student’s \(t\)-distribution with \(\alpha_{adjusted} = \frac{\alpha}{m+1-rank}\) | \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\) |
Scheffé Method (FWER-controled on LSD ) |
Can be used to examine all possible linear combinations of group means, not just pairwise comparisons. Use as exploratory post-hoc method. | \(F\)-distribution | \(\sqrt{(k-1)F_{critical}} \sqrt{MSE((\frac{1}{n_1}+\frac{1}{n_2}))}\) |
Dunnett ’s correction |
Similar to LSD , but addressing a special case of multiple comparisons problem - pairwise comparisons of multiple treatment groups with a single control group., yielding narrower confidence intervals. Comparing every treatments mean to a control mean |
Dunnett’s t-distribution, \(t_{Dunnett} = ts \sqrt{\frac{2}{n}}\), where t is draw from Multivariate t-distribution | \(t_{Dunnett} \sqrt{\frac{2MS_{S/A}}{n}}\) |
Tamhane’s T2 |
A modification on Sidak for equal variances assumption is violated. |
Student’s \(t\)-distribution with \(\alpha_{adjusted} =1- \sqrt[k]{1-\alpha}\) | \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\) |
Dunnett’s T3 |
A modification on T2 . Also for equal variances assumption is violated. provides a narrower CI than T2 |
quasi-normalized maximum-magnitude distribution (studentized maximum modulus distribution) with using Welch’s degree of freedom | \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\) |
Benjamini–Hochberg procedure, BH (FDR-controlled) |
Instead of FWER, use FDR-controlling procedure to obtain higher power. | - | \(q \ value = p_{(k)} \leq \frac{\alpha k}{m}\) |
blog 1 blog 2 Studentized ange Distribution Post-Hoc Tests 如何通俗地理解Family-wise error rate和False discovery rate
Test | Purpose | Following distribution | Statistic calculation |
---|---|---|---|
Tukey’s Honestly Significant Difference Test, Tukey’s HSD |
Compare every mean with every other mean, to figure out which groups differ for equal sample sizes | Studentized Range Distribution | \(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance |
Student-Newman-Keuls, SNK |
A modification of HSD . With equal sample sizes, it also compares pairs of means within homogeneous subsets, using a stepwise procedure. Means are ordered from highest to lowest, and extreme differences are tested first. Reduceing type I error, but increase type II error |
Studentized Range Distribution | \(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance |
Tukey's-b |
make pairwise comparisons between groups. The critical value is the average of the corresponding value for the `HSD and the SNK |
Studentized Range Distribution | - |
Hochberg’s GT2 |
Similar to HSD . Multiple comparison and range test that uses the Studentized maximum modulus |
- | - |
Gabriel | A modification on GT2 . Pairwise comparison test that used the Studentized maximum modulus and is generally more powerful than Hochberg’s GT2 when the cell sizes are unequal. When cell sizes is identical, it equal to GT2 |
- | - |
Tukey-Kramer Method |
Similar to HSD . compare every mean with every other mean for unequal sample sizes, by using the harmonic mean of the cell size of the two comparisons. |
Studentized Range Distribution | \(q_{\alpha,k, v} \sqrt{\frac{MS_{within}}{n_{harmonic}}}\) where \(k\) is the number of group, \(v\) is degree of freedom for the pooled variance |
Games-Howell | A modification of Tukey-Kramer for equal variances assumption is violated. |
Studentized Range Distribution with using Welch’s degree of freedom | \(t_{\alpha_{adjusted/2}} \sqrt{2MSE(\frac{1}{n_A} + \frac{1}{n_B})}\) |
Ryan-Einot-Gabriel-Welsch-Q(REGWQ) | A modification on q-test using REGW multiple stepdown procedure | Studentized Range Distribution | - |
Dunnett’s C | Purpose for equal variances assumption is violated. | Studentized Range Distribution | - |
Test | Purpose | Following distribution | Statistic calculation |
---|---|---|---|
Duncan ’s new multiple range test (MRT) |
A modification of SNK , using SSR Value instead of q-value. Makes pairwise comparisons using a stepwise order of comparisons identical to the order used by the SNK , but sets a protection level for the error rate for the collection of tests, rather than an error rate for individual tests. Reduceing type II error, but increase type I error |
Duncan’ s Multiple Range Distribution, \(SSR\) | Statistic |
Test | Purpose | Following distribution | Statistic calculation |
---|---|---|---|
Ryan-Einot-Gabriel-Welsch-Fisher(REGWF ) |
A modification on F test using REGW multiple stepdown procedure from smallest to largest. recommended for balanced designs, which have even numbers of levels | \(\alpha \ when \ k=g \ or \ k=g-1\), or \(\alpha = 1-(1-\alpha)^{k/g} \ when \ k<g-1\) where \(g\) is the number of means in the group being tested, \(k\) is the number of means in a subset, | \(F\)-distribution |
Waller-Duncan | Multiple comparison test based on a t statistic; uses a Bayesian approach. It is necessary first makes an ANOVA | - | - |