This is a study note for:
with using \(pwr\) pakcage. I mainly use this as a cheatsheet for minimal sample size calculation.
The advantage here is the randomization, so that any differences that appear in the posttest should be the result of the experimental variable rather than possible difference between the Controlled to start with. This is the classical type of experimental design and has good internal validity. The external validity or generalizability of the study is limited by the possible effect of pre-testing. The Solomon Four-Group Design accounts for this.
Blocking | Randomly selected Group | A/A testing (pre-test) | features | A/B/n testing (post-test) |
---|---|---|---|---|
A | \(Group_{A,1}\) | O | \(features_1\) | O |
A | \(...\) | O | … | O |
A | \(Group_{A,n}\) | O | \(features_n\) | O |
A | \(Group_{A,control}\) | O | \(control\) | O |
========== | ========== | ========== | ========== | ========== |
B | \(Group_{B,1}\) | O | \(features_1\) | O |
B | \(...\) | O | … | O |
B | \(Group_{B,n}\) | O | \(features_n\) | O |
B | \(Group_{B,control}\) | O | \(control\) | O |
Intend to improve the funnel model
Choosing and characterizing metrics:
Factors should be considered in a A/B/n testing:
Additional notation for intermediate parameters
A/A testing
should be conducted before the following A/B testing
of A/B/n testing
for the following purpose. Reasons are described in the following subsections.
Sanity check
has to be passed before runing A/B testing. It passes when there is no significant difference in all/most metrics among groups.
Passing the sanity check mean there not significant difference in mean
and variance
among groups, which mean mean and variance are identical in each sample. Estimating mean and variance of the control group help to determine the minimal sample size requirement, \(n = (\frac{Z\sigma}{E})^2\). For example, the mean value of in A/A testing can be use as the the baseline, \(x_0\) or \(x_1\), of the lift by experimental groups to control group, therefore the margin of error, \(E = x_1 - x_0\) or \(E = x_2 - x_1\).
Type of Outcome | one sample | Two Independent Samples | match sample |
---|---|---|---|
Dichotomous (Bernoulli) | \(\begin{matrix} n = (\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{p_1-p_0}{\sqrt{p_1(1-p_1)}} \end{matrix}\) | \(\begin{matrix} n = 2(\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{|p_2-p_1|}{\sqrt{p(1-p)}} \\ p = \frac{p_1 + p_2}{2} \end{matrix}\) | Not applicable |
Continuous (Gaussian) | \(\begin{matrix} n = (\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{|\mu_1-\mu_0|}{\sigma} \end{matrix}\) | \(\begin{matrix} n = 2(\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{|\mu_1-\mu_2|}{\sigma} \\ \sigma = S_p = \sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}} \end{matrix}\) | \(\begin{matrix} n = (\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{\mu_d}{\sigma_d} \end{matrix}\) |
Type of Outcome | one sample | Two Independent Samples | match sample |
---|---|---|---|
Dichotomous (Bernoulli) | pwr::pwr.p.test() |
pwr::pwr.2p.test() |
- |
Continuous (Gaussian) | pwr::pwr.t.test() , stats::t.test() |
pwr::pwr.t.test() , stats::t.test() |
pwr::pwr.t.test() , stats::t.test() |
Type of Outcome | one sample | Two Independent Samples | match sample |
---|---|---|---|
Dichotomous (Bernoulli) | pwr::pwr.p.test() |
pwr::pwr.2p.test() , pwr::pwr.2p2n.test() (different sizes) |
- |
Continuous (Gaussian) | pwr::pwr.t.test() |
pwr::pwr.t.test() , pwr::pwr.t2n.test() (different sizes) |
pwr::pwr.t.test() |
Other function:
pwr::pwr.2p2n.test()
: Power calculation for two proportions (different sample sizes)pwr::pwr.t2n.test()
: Power calculations for two samples (different sizes) t-tests of meanspwr::pwr.norm.test()
: Power calculations for the mean of a normal distribution (known variance)pwr::pwr.r.test()
: Power calculations for correlation testpwr::pwr.anova.test()
: Power calculations for balanced one-way analysis of variance testspwr::pwr.chisq.test()
: power calculations for chi-squared testspwr::pwr.f2.test()
: Power calculations for the general linear modelA/B/n testing
is of testing more than two offers (or experiences) against each other, where n is the number of offers that you are testing simultaneously. When conducting multiple analyses on the same dependent variable, the chance of committing a Type I error increases, thus increasing the likelihood of coming about a significant result by pure chance.
To correct for this, or protect from Type I error, a Bonferroni correction
is conducted by altering the p-value to a more stringent value, thus making it less likely to commit Type I Error.
To get the Bonferroni corrected/adjusted p value, divide the original \(\alpha\)-value by the number of analyses on the dependent variable. The researcher assigns a new alpha for the set of dependent variables (or analyses) that does not exceed some critical value of:
\[\alpha_{critical}= 1 – (1 – \alpha_{alter})^k\]
where \(k\) = the number of comparisons on the same dependent variable.
Let’s see how Bonferroni correction avoids the Inflated type I error. For example, if you compare five offers in an A(control)/B/C/D/E test, effectively you form four comparisons:
Term | Without correction | With Bonferroni correction |
---|---|---|
the significance level for an individual test | \(\alpha_{original} = 5\%\) | \(\alpha_{alter} = \frac{\alpha_{original}}{k} = \frac{5\%}{4} = 1.25\%\) |
The confidence level for an individual test, \(1-\alpha\) | \(100\% - 5\% = 95\%\) | \(100\% - 1.25\% = 98.75\%\) |
The significence level for overall test, \(\alpha_{critical}= 1 – (1 – \alpha_{alter})^k\) | \(1 – (1 – 5\%)^k = 18.55\%\) | \(1 – (1 – 1.25\% )^4 = 4.91\% \cong 5\%\) |
The confidence level for the overall test, \(1-\alpha_{critical}\) | \(1-18.55\% = 81.45\%\) | \(100\% - 5\% = 95\%\) |
(source: Bonferroni Correction)
It is extremely hard to design an A/B test in markdetplace or social networks since user are all connected.
(source: Ten common A/B testing pitfalls and how to avoid them)
Pitfalls | Solution |
---|---|
Ignoring the effects of the significance level | use Bonferroni correction to avoid the inflated type I error, \(\alpha_{alter} = \frac{\alpha_{original}}{k}\), \(\alpha_{critical}= 1 – (1 – \alpha_{alter})^k\). |
Ignoring the effects of statistical power | design statistical power, \(1 - \beta\) before the experiment. |
Stopping tests prematurely | determine an adequate sample size, \(n\) |
Using one-tailed tests | use Two-tail test to give each alternative equal chance to prove itself as the winner, or use \(Z_{1-\alpha/2}\) |
Not considering novelty effects | compare the result of subject groups (usually visitors) between the new and returning visitors. |
Not considering differences in the consideration period | allow some time for visitors who were exposed to the test offers to convert after a new entry to the test has been stopped |
Using metrics that do not reflect business objectives | use a metric that more impact to the business goal, if possible. |
Pitfalls | Solution |
---|---|
Monitoring tests | do not draw conclusions or stop the test before the required sample size is reached |
Changing the traffic allocation during the testing period | do not change the traffic allocation percentages during the testing period |
Pitfalls | Solution |
---|---|
Declaring winners of multiple offer tests with no statistically significant difference | consider multiple highest result with not significant difference. |