6/16/2019
The email A/B test we will analyze was conducted by an online wine store.
Source: Total Wine & More
Test setting: email to retailer email list
Unit: email address
Treatments: email version A, email version B, holdout
Reponse: open, click and 1-month purchase ($)
Selection: all active customers
Assignment: randomly assigned (1/3 each)
d <- read.csv("test_data.csv") head(d)
## user_id cpgn_id group email open click purch chard sav_blanc syrah ## 1 1000001 1901Email ctrl FALSE 0 0 0.00 0.00 0.00 33.94 ## 2 1000002 1901Email email_B TRUE 1 0 0.00 0.00 0.00 16.23 ## 3 1000003 1901Email email_A TRUE 1 1 200.51 516.39 0.00 16.63 ## 4 1000004 1901Email email_A TRUE 1 0 0.00 0.00 0.00 0.00 ## 5 1000005 1901Email email_A TRUE 1 1 158.30 426.53 1222.48 0.00 ## 6 1000006 1901Email email_B TRUE 1 0 0.00 0.00 0.00 0.00 ## cab past_purch days_since visits ## 1 0.00 33.94 119 11 ## 2 76.31 92.54 60 3 ## 3 0.00 533.02 9 9 ## 4 41.21 41.21 195 6 ## 5 0.00 1649.01 48 9 ## 6 0.00 0.00 149 6
Everything measured after the randomization that could possibly be affected by the treatment is an outcome.
summary(d$group)
## ctrl email_A email_B ## 41330 41329 41329
This is a completely randomized experiment.
summary(d[,c("open", "click", "purch")])
## open click purch ## Min. :0.0000 Min. :0.00000 Min. : 0.00 ## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 0.00 ## Median :0.0000 Median :0.00000 Median : 0.00 ## Mean :0.4567 Mean :0.07503 Mean : 21.30 ## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 21.86 ## Max. :1.0000 Max. :1.00000 Max. :1607.40
summary(d[,c("days_since", "visits", "past_purch")])
## days_since visits past_purch ## Min. : 0.00 Min. : 0.000 Min. : 0.00 ## 1st Qu.: 26.00 1st Qu.: 4.000 1st Qu.: 0.00 ## Median : 63.00 Median : 6.000 Median : 91.22 ## Mean : 89.98 Mean : 5.946 Mean : 188.79 ## 3rd Qu.:125.00 3rd Qu.: 7.000 3rd Qu.: 246.87 ## Max. :992.00 Max. :51.000 Max. :9636.92
summary(d[, c("chard", "sav_blanc", "syrah", "cab")])
## chard sav_blanc syrah cab ## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 ## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 ## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.00 ## Mean : 73.31 Mean : 72.45 Mean : 26.68 Mean : 16.35 ## 3rd Qu.: 54.06 3rd Qu.: 57.42 3rd Qu.: 20.91 3rd Qu.: 12.96 ## Max. :9636.92 Max. :6609.92 Max. :2880.15 Max. :2365.90
Whoa! That’s a lot of chardonnay for one customer!
What is the first question you should ask about an A/B test?
Did the treatment affect the response?
Was the randomization done correctly?
How could we check the randomization with the data?
Randomization checks confirm that the baseline variables are distributed similarly for the treatment and control groups.
Averages of baseline variables by treatment group
d %>% group_by(group) %>% summarize(mean(days_since), mean(visits), mean(past_purch))
## # A tibble: 3 x 4 ## group `mean(days_since)` `mean(visits)` `mean(past_purch)` ## <fct> <dbl> <dbl> <dbl> ## 1 ctrl 90.0 5.95 188. ## 2 email_A 90.2 5.95 188. ## 3 email_B 89.8 5.94 190.
Group means are are similar between groups.
Purchase incidence by group is also similar.
## # A tibble: 3 x 2 ## group `mean(past_purch > 0)` ## <fct> <dbl> ## 1 ctrl 0.744 ## 2 email_A 0.741 ## 3 email_B 0.741
About 3/4 of email list has purchased in the past and this is similar across randomized treatments.
The full distributions of baseline variables should also be the same between treatment groups.
Compare the past purchases in each wine category (cab, etc.) to confirm that the randomization produced groups with similar distributions.
d %>% group_by(group) %>% summarize(mean(open), mean(click), mean(purch))
## # A tibble: 3 x 4 ## group `mean(open)` `mean(click)` `mean(purch)` ## <fct> <dbl> <dbl> <dbl> ## 1 ctrl 0 0 12.4 ## 2 email_A 0.718 0.132 25.6 ## 3 email_B 0.652 0.0934 25.9
Email A looks better for opens and clicks, but maybe not purchases. Both emails seem to generate higher average purchases than the control.
Create a new data set with just the customers who received emails.
d_treat <- d[d$group != "ctrl",] d_treat$group <- droplevels(d_treat$group) xtabs(~ group + open, data=d_treat)
## open ## group 0 1 ## email_A 11643 29686 ## email_B 14395 26934
Excluding one treatment group from an experiment is legit.
prop.test(xtabs(~ group + open, data=d_treat)[,2:1])
## ## 2-sample test for equality of proportions with continuity correction ## ## data: xtabs(~group + open, data = d_treat)[, 2:1] ## X-squared = 424.32, df = 1, p-value < 2.2e-16 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## 0.06024628 0.07292897 ## sample estimates: ## prop 1 prop 2 ## 0.7182850 0.6516974
mosaicplot(xtabs(~ group + open, data=d_treat), main="Wine Retailer Test: Email Opens")
xtabs(~ group + click, data=d_treat)
## click ## group 0 1 ## email_A 35887 5442 ## email_B 37468 3861
prop.test(xtabs(~ group + click, data=d_treat)[,2:1])
## ## 2-sample test for equality of proportions with continuity correction ## ## data: xtabs(~group + click, data = d_treat)[, 2:1] ## X-squared = 302.38, df = 1, p-value < 2.2e-16 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## 0.03392871 0.04257931 ## sample estimates: ## prop 1 prop 2 ## 0.13167509 0.09342108
Note that we analyze click rate among all who received the email, ignoring whether or not they opened the email. There may be systematic differences in the types of customers who opened email A versus email B.
Average 30-day purchase amount by group
d %>% group_by(group) %>% summarize(mean(purch))
## # A tibble: 3 x 2 ## group `mean(purch)` ## <fct> <dbl> ## 1 ctrl 12.4 ## 2 email_A 25.6 ## 3 email_B 25.9
Visualization: boxplot (old school)
Visualization: Violin plots with log scale
Visualization: Dotplot with log scale
t.test(purch ~ group, data=d[d$group != "ctrl",])
## ## Welch Two Sample t-test ## ## data: purch by group ## t = -0.59169, df = 82644, p-value = 0.5541 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.0498820 0.5629813 ## sample estimates: ## mean in group email_A mean in group email_B ## 25.62284 25.86629
There is not a significant difference in average purchases between email A and email B.
t.test(purch ~ email, data=d)
## ## Welch Two Sample t-test ## ## data: purch by email ## t = -44.823, df = 107015, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -13.90691 -12.74164 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 12.42029 25.74456
Those who received an email have higher average purchases (95% CI = [3.47, 4.42]).
If you can answer these questions, you have a test plan.
Business questions: Does email work? If so which email is better?
Test setting: email to retailer customers
Unit: email address
Treatments: email version A, email version B, holdout
Reponse: open, click and 30-day purchase ($)
Selection: all active emails on email list (open in last 12 months)
Assignment: randomly assigned (1/3 each)
Sample size: 123,988 emails
Business question: Which version of a webpage?
Test setting: website (field)
Unit of analysis: visitor (cookie-tracked)
Treatments: versions A and B
Response variable: clicks, conversions
Selection of units: all who visit
Assignment to treatments: random (by testing sw)
Sample size: ???
Significance tests will erroneously detect effects that aren’t there, if you repeatedly test for significance as the data comes in and stop when you get a significant difference.
sig <- rep(0, 1000) for (r in 1:1000) { A <- rnorm(101); B <- rnorm(101) pval <- rep(NA, 100) for (n in 1:100) pval[n] <- t.test(A[1:(n+1)], B[1:(n+1)])$p.value # repeated testing if (min(pval) < 0.05) sig[r] <- 1 # any significance along the way } mean(sig) # bigger than the nominal significance of 5%
## [1] 0.359
The standard recommendation is to set the sample size in advance and not test for significance until the data comes in.
WTF? Seriously? More on this later.
The recommended sample size is:
\(n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 s^2}{d^2} \right)\)
\(n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 s^2}{d^2} \right)\)
\(n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 s^2}{d^2} \right)\)
Sample size to detect at $1 difference in average 30-day purchases:
sd(d$purch)
## [1] 54.82613
#power.t.test(sd=sd(d$purch), delta=1, sig.level=0.95, power=0.80)
We need 2,387 in each group.
There is a slightly different formula for:
Continous response (eg money, time-on-site)
\(n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 s^2}{d^2} \right)\)
Binary response (eg conversions)
\(n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 p (1-p)}{d^2} \right)\)
Binary response
power.prop.test(p1=0.07, p2=0.07 + 0.01, sig.level=0.05, power=0.80)
## ## Two-sample comparison of proportions power calculation ## ## n = 10889.14 ## p1 = 0.07 ## p2 = 0.08 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group
There are different sample size formulas floating around. These formulas differ on what assumptions they may about what you are trying to do, but it can be very hard to figure out what assumptions are being made (even for experts).
A decent sample size calculation will help you identify whether you are likely to end up with way too much or too little data.
prop.test()
for significancet.test()
for significance