Advanced AB Testing

6/16/2019

Test data

Example email A/B test

The email A/B test we will analyze was conducted by an online wine store.

Source: Total Wine & More

Wine retailer email test

Test setting: email to retailer email list

Unit: email address

Treatments: email version A, email version B, holdout

Reponse: open, click and 1-month purchase ($)

Selection: all active customers

Assignment: randomly assigned (1/3 each)

Wine retailer email test data

d <- read.csv("test_data.csv")
head(d)

##   user_id   cpgn_id   group email open click  purch  chard sav_blanc syrah
## 1 1000001 1901Email    ctrl FALSE    0     0   0.00   0.00      0.00 33.94
## 2 1000002 1901Email email_B  TRUE    1     0   0.00   0.00      0.00 16.23
## 3 1000003 1901Email email_A  TRUE    1     1 200.51 516.39      0.00 16.63
## 4 1000004 1901Email email_A  TRUE    1     0   0.00   0.00      0.00  0.00
## 5 1000005 1901Email email_A  TRUE    1     1 158.30 426.53   1222.48  0.00
## 6 1000006 1901Email email_B  TRUE    1     0   0.00   0.00      0.00  0.00
##     cab past_purch days_since visits
## 1  0.00      33.94        119     11
## 2 76.31      92.54         60      3
## 3  0.00     533.02          9      9
## 4 41.21      41.21        195      6
## 5  0.00    1649.01         48      9
## 6  0.00       0.00        149      6

Types of variables associated with a test

Treatment indicator (x)
- Which (randomized) treatment was received
Response (y’s)
- Outcome(s) measured for each customer, AKA the dependant variable
Baseline variables (z’s)
- Other stuff we know about customers prior to the randomization
- Sometimes called “pre-randomization covariates” or “observables”

Everything measured after the randomization that could possibly be affected by the treatment is an outcome.

Treatment indicator (x)

summary(d$group)

##    ctrl email_A email_B 
##   41330   41329   41329

This is a completely randomized experiment.

Responses (y’s)

open test email (load images)
click test email to visit website
purchases ($) in 30 days after email sent

summary(d[,c("open", "click", "purch")])

##       open            click             purch        
##  Min.   :0.0000   Min.   :0.00000   Min.   :   0.00  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:   0.00  
##  Median :0.0000   Median :0.00000   Median :   0.00  
##  Mean   :0.4567   Mean   :0.07503   Mean   :  21.30  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:  21.86  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1607.40

Baseline variables (z)

days since last activity
website visits
total past purchases ($)

summary(d[,c("days_since", "visits", "past_purch")])

##    days_since         visits         past_purch     
##  Min.   :  0.00   Min.   : 0.000   Min.   :   0.00  
##  1st Qu.: 26.00   1st Qu.: 4.000   1st Qu.:   0.00  
##  Median : 63.00   Median : 6.000   Median :  91.22  
##  Mean   : 89.98   Mean   : 5.946   Mean   : 188.79  
##  3rd Qu.:125.00   3rd Qu.: 7.000   3rd Qu.: 246.87  
##  Max.   :992.00   Max.   :51.000   Max.   :9636.92

More baseline variables

total past purchases by category ($)

summary(d[, c("chard", "sav_blanc", "syrah", "cab")])

##      chard           sav_blanc           syrah              cab         
##  Min.   :   0.00   Min.   :   0.00   Min.   :   0.00   Min.   :   0.00  
##  1st Qu.:   0.00   1st Qu.:   0.00   1st Qu.:   0.00   1st Qu.:   0.00  
##  Median :   0.00   Median :   0.00   Median :   0.00   Median :   0.00  
##  Mean   :  73.31   Mean   :  72.45   Mean   :  26.68   Mean   :  16.35  
##  3rd Qu.:  54.06   3rd Qu.:  57.42   3rd Qu.:  20.91   3rd Qu.:  12.96  
##  Max.   :9636.92   Max.   :6609.92   Max.   :2880.15   Max.   :2365.90

Whoa! That’s a lot of chardonnay for one customer!

Analysis of A/B tests

What is the first question you should ask about an A/B test?

~~Did the treatment affect the response?~~

Was the randomization done correctly?

How could we check the randomization with the data?

Randomization checks

Randomization checks confirm that the baseline variables are distributed similarly for the treatment and control groups.

Averages of baseline variables by treatment group

d %>% group_by(group) %>% summarize(mean(days_since), mean(visits), mean(past_purch))

## # A tibble: 3 x 4
##   group   `mean(days_since)` `mean(visits)` `mean(past_purch)`
##   <fct>                <dbl>          <dbl>              <dbl>
## 1 ctrl                  90.0           5.95               188.
## 2 email_A               90.2           5.95               188.
## 3 email_B               89.8           5.94               190.

Group means are are similar between groups.

Randomization checks

Purchase incidence by group is also similar.

## # A tibble: 3 x 2
##   group   `mean(past_purch > 0)`
##   <fct>                    <dbl>
## 1 ctrl                     0.744
## 2 email_A                  0.741
## 3 email_B                  0.741

About 3/4 of email list has purchased in the past and this is similar across randomized treatments.

Randomization checks

The full distributions of baseline variables should also be the same between treatment groups.

Randomization checks

Exercise

Compare the past purchases in each wine category (cab, etc.) to confirm that the randomization produced groups with similar distributions.

Randomization checks out! On to the treatment effects.

Did the treatments affect the responses?

d %>% group_by(group) %>% summarize(mean(open), mean(click), mean(purch))

## # A tibble: 3 x 4
##   group   `mean(open)` `mean(click)` `mean(purch)`
##   <fct>          <dbl>         <dbl>         <dbl>
## 1 ctrl           0            0               12.4
## 2 email_A        0.718        0.132           25.6
## 3 email_B        0.652        0.0934          25.9

Email A looks better for opens and clicks, but maybe not purchases. Both emails seem to generate higher average purchases than the control.

Does email A have higher open rate than B?

Create a new data set with just the customers who received emails.

d_treat <- d[d$group != "ctrl",]
d_treat$group <- droplevels(d_treat$group)
xtabs(~ group + open, data=d_treat)

##          open
## group         0     1
##   email_A 11643 29686
##   email_B 14395 26934

Excluding one treatment group from an experiment is legit.

Confirm significance with proportions test

prop.test(xtabs(~ group + open, data=d_treat)[,2:1])

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  xtabs(~group + open, data = d_treat)[, 2:1]
## X-squared = 424.32, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.06024628 0.07292897
## sample estimates:
##    prop 1    prop 2 
## 0.7182850 0.6516974

Visualization: open rates for emails A & B

mosaicplot(xtabs(~ group + open, data=d_treat), 
           main="Wine Retailer Test: Email Opens")

Does email A have a higher click rate than B?

xtabs(~ group + click, data=d_treat)

##          click
## group         0     1
##   email_A 35887  5442
##   email_B 37468  3861

Confirm significance with proportions test

prop.test(xtabs(~ group + click, data=d_treat)[,2:1])

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  xtabs(~group + click, data = d_treat)[, 2:1]
## X-squared = 302.38, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.03392871 0.04257931
## sample estimates:
##     prop 1     prop 2 
## 0.13167509 0.09342108

Note that we analyze click rate among all who received the email, ignoring whether or not they opened the email. There may be systematic differences in the types of customers who opened email A versus email B.

Visualization: barplot of clicks and opens for emails A & B

Do any groups have higher average purchases?

Average 30-day purchase amount by group

d %>% group_by(group) %>% summarize(mean(purch))

## # A tibble: 3 x 2
##   group   `mean(purch)`
##   <fct>           <dbl>
## 1 ctrl             12.4
## 2 email_A          25.6
## 3 email_B          25.9

Do any groups have higher average purchases?

Visualization: boxplot (old school)

Do any groups have higher average purchases?

Visualization: Violin plots with log scale

Do any groups have higher average purchases?

Visualization: Dotplot with log scale

Test significance with a t-test

t.test(purch ~ group, data=d[d$group != "ctrl",])

## 
##  Welch Two Sample t-test
## 
## data:  purch by group
## t = -0.59169, df = 82644, p-value = 0.5541
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.0498820  0.5629813
## sample estimates:
## mean in group email_A mean in group email_B 
##              25.62284              25.86629

There is not a significant difference in average purchases between email A and email B.

Do emails generate higher purchases?

t.test(purch ~ email, data=d)

## 
##  Welch Two Sample t-test
## 
## data:  purch by email
## t = -44.823, df = 107015, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.90691 -12.74164
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            12.42029            25.74456

Those who received an email have higher average purchases (95% CI = [3.47, 4.42]).

Summary of findings (suitable for texting)

Email A has significantly higher opens and clicks than email B, but purchase are similar for both emails $\rightarrow$ Send email A!
Both emails generate higher average purchases than the control $\rightarrow$ Send emails!

Design of A/B tests

Seven key questions

Business question
Test setting (lab v. field)
Unit of analysis (visit, customer, store)
Treatments
Response variable(s)
Selection of units
Assignment to treatments
Sample size

If you can answer these questions, you have a test plan.

Email test

Business questions: Does email work? If so which email is better?

Test setting: email to retailer customers

Unit: email address

Treatments: email version A, email version B, holdout

Reponse: open, click and 30-day purchase ($)

Selection: all active emails on email list (open in last 12 months)

Assignment: randomly assigned (1/3 each)

Sample size: 123,988 emails

Typical website test

Business question: Which version of a webpage?

Test setting: website (field)

Unit of analysis: visitor (cookie-tracked)

Treatments: versions A and B

Response variable: clicks, conversions

Selection of units: all who visit

Assignment to treatments: random (by testing sw)

Sample size: ???

Sample size planning

Significance tests will erroneously detect effects that aren’t there, if you repeatedly test for significance as the data comes in and stop when you get a significant difference.

sig <- rep(0, 1000)
for (r in 1:1000) {
  A <- rnorm(101); B <- rnorm(101)
  pval <- rep(NA, 100)
  for (n in 1:100) pval[n] <- t.test(A[1:(n+1)], B[1:(n+1)])$p.value  # repeated testing
  if (min(pval) < 0.05) sig[r] <- 1  # any significance along the way
}
mean(sig)   # bigger than the nominal significance of 5%

## [1] 0.359

Sample size planning {.bigger, .build}

The standard recommendation is to set the sample size in advance and not test for significance until the data comes in.

WTF? Seriously? More on this later.

The recommended sample size is:

$n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 s^2}{d^2} \right)$

Sample size planning: key ideas

My data is noisy, so the group with the higher average in the test not always have the higher long-run response.
There are two mistakes you can make:
- Declare the treatments different, when they are the same (Type I)
- Declare the treatment the same, when they are different (Type II)
I want a low probability of both of those mistakes ($\alpha$, $\beta$) given a specific known difference between treatments ($d$) and noise in my response ($s$)

$n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 s^2}{d^2} \right)$

Interpreting the sample size formula

$n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 s^2}{d^2} \right)$

More noise $\rightarrow$ larger sample size
Smaller difference to detect$\rightarrow$ larger sample size
Fewer errors $\rightarrow$ larger sample size

Sample size calculator in R

Sample size to detect at $1 difference in average 30-day purchases:

sd(d$purch)

## [1] 54.82613

#power.t.test(sd=sd(d$purch), delta=1, sig.level=0.95, power=0.80)

We need 2,387 in each group.

Sample size planning

There is a slightly different formula for:

Continous response (eg money, time-on-site)

$n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 s^2}{d^2} \right)$

Binary response (eg conversions)

$n_1 = n_2 \approx (z_{1-\alpha/2} + z_\beta)^2 \left( \frac{2 p (1-p)}{d^2} \right)$

Sample size calculator in R

Binary response

power.prop.test(p1=0.07, p2=0.07 + 0.01, sig.level=0.05, power=0.80)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 10889.14
##              p1 = 0.07
##              p2 = 0.08
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Sample size calculator

A word of caution about sample size calculators

There are different sample size formulas floating around. These formulas differ on what assumptions they may about what you are trying to do, but it can be very hard to figure out what assumptions are being made (even for experts).

A decent sample size calculation will help you identify whether you are likely to end up with way too much or too little data.

Tips for getting started with A/B testing

Keep it simple
Be prepared to find no effect
Choose “strong” treatments
Run many tests in fast succession
You are searching for a few “golden tickets”

Things you just learned (or reviewed)

Three types of variables in test data
- Treatment (x’s)
- Response (y’s)
- Baseline variables (z’s)
Analyzing tests with binary response
- Bar plot or mosaic plot
- prop.test() for significance
Analyzing tests with continuous response
- Dot plots or violin plots
- t.test() for significance
Eight key questions that define a test plan
Sample size calculations
- Continous responses
- Binary responses