Use of this document

This is a study note for:

with using \(pwr\) pakcage. I mainly use this as a cheatsheet for minimal sample size calculation.

1. Introduction

1.1 Randomized Block Design for Controlled Experiment

The advantage here is the randomization, so that any differences that appear in the posttest should be the result of the experimental variable rather than possible difference between the Controlled to start with. This is the classical type of experimental design and has good internal validity. The external validity or generalizability of the study is limited by the possible effect of pre-testing. The Solomon Four-Group Design accounts for this.

  • Blocking: Slice the data by distinguishing characteristics (i.e. location, time) to isolate the groups.
  • Random selection: Randomization help to reduce the difference between group within each block.
  • A/A testing: pre-test verifies whether randomly selected groups are identical or not.
  • A/B or A/B/n testing: post-test testify the difference among groups
Controlled, Random selection, Pre-test, Post-test
Blocking Randomly selected Group A/A testing (pre-test) features A/B/n testing (post-test)
A \(Group_{A,1}\) O \(features_1\) O
A \(...\) O O
A \(Group_{A,n}\) O \(features_n\) O
A \(Group_{A,control}\) O \(control\) O
========== ========== ========== ========== ==========
B \(Group_{B,1}\) O \(features_1\) O
B \(...\) O O
B \(Group_{B,n}\) O \(features_n\) O
B \(Group_{B,control}\) O \(control\) O

1.2 Practice in business setting

Intend to improve the funnel model

  • Improvement: new features, additions to ui, different look of website, ranking changes, change backend loading time, test layout of initial page.
  • Policy and ethics for experiments: Risk, Benefits, Alternative, Data sensitivitiy
  • Objective of testing: invariant checking, evaluation

Choosing and characterizing metrics:

  • Define a metric: single metric (probability of event, click-through-rate), composite metric (an objective function or an objective evaluation criterion).
  • Build intuition of metrics: De-bias the data by filtering external (malicious, fraudulent visits, etc.) and internal (change only affect the traffic of a subset of users, for example, change may only apply to english versions of website) reasons.

A/B TESTING by Qianqian Shan

1.3 Type I error and Type II error

1.4 Notation of variables

Factors should be considered in a A/B/n testing:

  • Margin of error: general form \(E = \frac{Z \sigma}{\sqrt{n}}\); \(E = \mu_2-\mu_1\) for continueous outcome; \(E = p_2 - p_1\) for dichotomous outcome;
  • Alpah level, or significance level: usually \(\alpha = 0.05\), meaning 95% probability that avoid a false positive;
  • Beta level: usually \(\beta = 0.8\), meaning 80% probability that avoid a false negative;
  • Sample size: \(n = (\frac{Z\sigma}{E})^2\).

Additional notation for intermediate parameters

  • \(Z\): Standard score or Z-score;
  • \(E\): Margin of error, where is \(\mu_2 - \mu_1\) for continuous outcome, or \(p_2-p_1\) for dichotomous outcome;
  • \(ES\): Effect size, which is \(\frac{E}{\sigma}\), which is on the Z-scale;
  • \(1-\alpha\): confidence level;
  • \(1-\beta\): power of test;
  • \(\mu\): mean of the sample for continuous outcome;
  • \(p\): porportion of success for dichotomous outcome.

2. A/A testing

A/A testing should be conducted before the following A/B testing of A/B/n testing for the following purpose. Reasons are described in the following subsections.

2.1 Sanity check

Sanity check has to be passed before runing A/B testing. It passes when there is no significant difference in all/most metrics among groups.

2.2 Control group statistic

Passing the sanity check mean there not significant difference in mean and variance among groups, which mean mean and variance are identical in each sample. Estimating mean and variance of the control group help to determine the minimal sample size requirement, \(n = (\frac{Z\sigma}{E})^2\). For example, the mean value of in A/A testing can be use as the the baseline, \(x_0\) or \(x_1\), of the lift by experimental groups to control group, therefore the margin of error, \(E = x_1 - x_0\) or \(E = x_2 - x_1\).

3. A/B testing

3.1 Minimal sample size requirement

consider level of significance \(\alpha\) and level of power \(1- \beta\) when Conducting Hypothesis Testing. neglect the level of significance \(\alpha\) when Estimate Confidence Interval. (source: Power and Sample Size Determination)
Type of Outcome one sample Two Independent Samples match sample
Dichotomous (Bernoulli) \(\begin{matrix} n = (\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{p_1-p_0}{\sqrt{p_1(1-p_1)}} \end{matrix}\) \(\begin{matrix} n = 2(\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{|p_2-p_1|}{\sqrt{p(1-p)}} \\ p = \frac{p_1 + p_2}{2} \end{matrix}\) Not applicable
Continuous (Gaussian) \(\begin{matrix} n = (\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{|\mu_1-\mu_0|}{\sigma} \end{matrix}\) \(\begin{matrix} n = 2(\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{|\mu_1-\mu_2|}{\sigma} \\ \sigma = S_p = \sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}} \end{matrix}\) \(\begin{matrix} n = (\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{ES})^2 \\ ES = \frac{\mu_d}{\sigma_d} \end{matrix}\)

3.2 Calculate sample size

R function to Calculate Sample Size for A/B testing.
Type of Outcome one sample Two Independent Samples match sample
Dichotomous (Bernoulli) pwr::pwr.p.test() pwr::pwr.2p.test() -
Continuous (Gaussian) pwr::pwr.t.test(), stats::t.test() pwr::pwr.t.test(), stats::t.test() pwr::pwr.t.test(), stats::t.test()

3.3 Calculate significance of A/B testing

R function to Calculate Significance of A/B Testing.
Type of Outcome one sample Two Independent Samples match sample
Dichotomous (Bernoulli) pwr::pwr.p.test() pwr::pwr.2p.test(), pwr::pwr.2p2n.test() (different sizes) -
Continuous (Gaussian) pwr::pwr.t.test() pwr::pwr.t.test(), pwr::pwr.t2n.test() (different sizes) pwr::pwr.t.test()

Other function:

  • pwr::pwr.2p2n.test(): Power calculation for two proportions (different sample sizes)
  • pwr::pwr.t2n.test(): Power calculations for two samples (different sizes) t-tests of means
  • pwr::pwr.norm.test(): Power calculations for the mean of a normal distribution (known variance)
  • pwr::pwr.r.test(): Power calculations for correlation test
  • pwr::pwr.anova.test(): Power calculations for balanced one-way analysis of variance tests
  • pwr::pwr.chisq.test(): power calculations for chi-squared tests
  • pwr::pwr.f2.test(): Power calculations for the general linear model

4. A/B/n testing

A/B/n testing is of testing more than two offers (or experiences) against each other, where n is the number of offers that you are testing simultaneously. When conducting multiple analyses on the same dependent variable, the chance of committing a Type I error increases, thus increasing the likelihood of coming about a significant result by pure chance.

4.1 Bonferroni correction for multiple comparision

To correct for this, or protect from Type I error, a Bonferroni correction is conducted by altering the p-value to a more stringent value, thus making it less likely to commit Type I Error.

To get the Bonferroni corrected/adjusted p value, divide the original \(\alpha\)-value by the number of analyses on the dependent variable. The researcher assigns a new alpha for the set of dependent variables (or analyses) that does not exceed some critical value of:

\[\alpha_{critical}= 1 – (1 – \alpha_{alter})^k\]

where \(k\) = the number of comparisons on the same dependent variable.

4.2 Comparision of type I error

Let’s see how Bonferroni correction avoids the Inflated type I error. For example, if you compare five offers in an A(control)/B/C/D/E test, effectively you form four comparisons:

  1. control to B,
  2. control to C,
  3. control to D,
  4. control to E.
Comparision of significance level and confidence level between with and without correction for individual testing and overall testing, where k = 4.
Term Without correction With Bonferroni correction
the significance level for an individual test \(\alpha_{original} = 5\%\) \(\alpha_{alter} = \frac{\alpha_{original}}{k} = \frac{5\%}{4} = 1.25\%\)
The confidence level for an individual test, \(1-\alpha\) \(100\% - 5\% = 95\%\) \(100\% - 1.25\% = 98.75\%\)
The significence level for overall test, \(\alpha_{critical}= 1 – (1 – \alpha_{alter})^k\) \(1 – (1 – 5\%)^k = 18.55\%\) \(1 – (1 – 1.25\% )^4 = 4.91\% \cong 5\%\)
The confidence level for the overall test, \(1-\alpha_{critical}\) \(1-18.55\% = 81.45\%\) \(100\% - 5\% = 95\%\)

(source: Bonferroni Correction)

5. Ten common testing pitfalls and how to avoid them

It is extremely hard to design an A/B test in markdetplace or social networks since user are all connected.

(source: Ten common A/B testing pitfalls and how to avoid them)

5.1 Experimental design stage

Pitfalls Solution
Ignoring the effects of the significance level use Bonferroni correction to avoid the inflated type I error, \(\alpha_{alter} = \frac{\alpha_{original}}{k}\), \(\alpha_{critical}= 1 – (1 – \alpha_{alter})^k\).
Ignoring the effects of statistical power design statistical power, \(1 - \beta\) before the experiment.
Stopping tests prematurely determine an adequate sample size, \(n\)
Using one-tailed tests use Two-tail test to give each alternative equal chance to prove itself as the winner, or use \(Z_{1-\alpha/2}\)
Not considering novelty effects compare the result of subject groups (usually visitors) between the new and returning visitors.
Not considering differences in the consideration period allow some time for visitors who were exposed to the test offers to convert after a new entry to the test has been stopped
Using metrics that do not reflect business objectives use a metric that more impact to the business goal, if possible.

5.2 During experiment stage

Pitfalls Solution
Monitoring tests do not draw conclusions or stop the test before the required sample size is reached
Changing the traffic allocation during the testing period do not change the traffic allocation percentages during the testing period

5.3 Drawing conclusion stage

Pitfalls Solution
Declaring winners of multiple offer tests with no statistically significant difference consider multiple highest result with not significant difference.