7/16/2020

Thanks to Drexel & Wharton

A/B Testing

Holdout Testing

Workshop Goals

In this session, I’d like to introduce you to an alternative way of planning and analyzing an A/B test, which is based on Bayesian decision theory. We think this approach has a lot of advantages, so Ron and I gave it a name: Test & Roll.

We are marketing professors, after all!

Workshop materials

Slides

Those familiar with R Markdown can open the R Markdown file that created the slides in R Studio and run the code as we go along.
- Requires rstan and dplyr packages
- Requires nn_functions.R with some additional functions I wrote.

Those unfamiliar with R Markdown (or just tired) should follow the slides without trying to run the code.

What I expect participants already know

I just told you about A/B tests, but ask questions!

I will assume you know what a probability distribution like the normal distribution is, but ask questions!

I expect you are comfortable reading R. I will use mathematical calculations, for loops, and plotting, *but ask questions!

You do not need to know how plan and analyze an A/B test using hypothesis testing, but see previous R Ladies Philly workshop on A/B testing.

How to analyze an A/B test using Bayesian inference

Bayesian Analysis for A/B Tests

The data from an A/B test comparing the time users spend on your website for two versions of the homepage is in the data frame test_data. A summary of the data looks like this:

test_data %>% 
  group_by(group) %>% summarize(mean=mean(time_on_site), sd=sd(time_on_site), n=n())
## # A tibble: 2 x 4
##   group  mean    sd     n
##   <chr> <dbl> <dbl> <int>
## 1 A      5.33  1.95   500
## 2 B      5.37  2.28   500

It looks like the B version keeps users on the site a bit longer, but how sure are we that B produces longer visits on average? We’ve only seen 500 visitors in each group.

Prior beliefs

I would like to know what the mean time-on-site is for the A group and the B group.

Before I saw this data, I knew nothing about how long people might spend on this website. They might stay for 5 seconds or they might stay for 5 hours.

Formally, I can describe my prior beliefs with a prior distribution: \[\textrm{mean time-on-site for group} \sim N(0, 100^2)\]

Prior beliefs

I find it easier to draw a picture of my prior.

Posterior beliefs

Bayes rule tells you how you should update your beliefs after you see some data. This is also easier to see with a picture.

Model details (mathematically)

If we assume that the time-on-site for each customer is distributed normally: \[\textrm{time-on-site} \sim \mathcal{N}(\textrm{mean time-on-site for group}, s^2)\] Then Bayes rule tells us that the posterior distribution for mean time-on-site for each group should be: \[\textrm{mean time-on-site for group given data} \sim \mathcal{N}\left(\overline{y}, \left(\frac{1}{100^2} + \frac{n}{s^2}\right)^{-1}\right)\]

I’m skipping the derivation. Hope you don’t mind!

Code for posterior updating

n_A <- sum(test_data$group=="A") # obs for A
n_B <- sum(test_data$group=="B") # obs for B
s <- sd(test_data$time_on_site) # standard deviation of data (approx)

# Posterior mean is just the mean for each group
post_mean_A <- mean(test_data[test_data$group=="A", "time_on_site"])
post_mean_B <- mean(test_data[test_data$group=="B", "time_on_site"])

# Posterior standard deviation follows this formula
post_sd_A <- (1/100^2 + n_A/s^2)^-(1/2)
post_sd_B <- (1/100^2 + n_B/s^2)^-(1/2)

Credible intervals for groups

If you like intervals, we can compute a 95% credible intervals for each group, by cutting off the left and right 2.5% of the posterior distribution. In this case, our posterior is normal, so we use the qnorm() function, which computes quantiles of the normal distribution.

There is a 95% probability that the average time-on-site for treatment A is:

qnorm(c(0.025, 0.975), mean=post_mean_A, sd=post_sd_A) # CI for A
## [1] 5.139694 5.511620

There is a 95% probability that the average time-on-site for treatment B is:

qnorm(c(0.025, 0.975), mean=post_mean_B, sd=post_sd_B) # CI for B
## [1] 5.184622 5.556548

Posterior for the difference between A and B

We can also compute the posterior distribution for the difference between the mean time-on-site for the B group and the mean time-on-site for the A group.

Posterior for the difference between A and B

Since the posterior for A and the posterior for B are both normal, the posterior for the difference is also normal with mean and standard deviation:

post_mean_diff <- post_mean_B - post_mean_A
post_sd_diff <- sqrt(post_sd_B^2 + post_sd_A^2)

Once we have the distribution for the difference in the mean time-on-site, we can compute the probability that the mean of B is greater than the mean of A.

1-pnorm(0, mean=post_mean_diff, sd=post_sd_diff)
## [1] 0.6311218

It is up to the decision maker to decide whether they would like to use version B knowing that there is a 63% change that B is better than A. This depends on how costly it is to deploy B.

More on prior beliefs

Many people get hung up on priors. Don’t!

As you get more data, the posterior becomes more and more influenced by the data. In most practical situations you have enough data that the priors do not affect the analysis much.

If you have prior information, priors are a principled way to bring it in.

Priors are also useful when planning an A/B test (more later).

More on prior beliefs

When we don’t want our priors to influence the outcome, we use “flat” priors. The prior we jused puts nearly equal weight on 4.5 minutes versus 6 minutes, so it is pretty flat.

More on prior beliefs

In this analysis I used the same prior for both A and B because I know nothing about A and B.

Important: using the same prior for A and B is not the same as assuming A and B have the same mean time-on-site. Priors reflect our uncertainty about A and B. Becuase we are uncertain, A might be better than B or B better than A.

You can use priors that reflect your (justifiable) prior beliefs. For instance, if A is a discount and B is no discount and our outcome is how much you purchase, then I’m pretty sure A will be as good as or better than B.

How to analyze an A/B test using Bayesian analysis

  1. Pick treatments and an outcome to measure
  2. Randomly assign A and B to some users and record outcomes
  3. Quantify your beliefs about the outcome with a prior distribution
  4. Update your beliefs according to Bayes rule (posterior distribution)
  5. Plot your updated belief distribution; compute intervals and probabilities
  6. Make a decision or go to 3

Questions?

How to Test & Roll

Typical A/B email test setup screen

A/B test planning as a decision problem

Test

Choose \(n_1^*\) and \(n_2^*\) customers to send the treatments.
Collect data on profit for both treatments.

Roll

Choose a treatment to deploy to the remaining \(N - n_1^* - n_2^*\) customers.

Objective

Maximize combined profit for test stage and the roll stage.

Profit-maximizing sample size

If you have a test where the profit earned for each customer is:
\(y \sim \mathcal{N}(m_1, s)\) for group 1
\(y \sim \mathcal{N}(m_2, s)\) for group 2

and your priors are
(\(m_1, m_1 \sim N(\mu, \sigma)\))

the profit maximizing sample size is:
\[n_1 = n_2 = \sqrt{\frac{N}{4}\left( \frac{s}{\sigma} \right)^2 + \left( \frac{3}{4} \left( \frac{s}{\sigma} \right)^2 \right)^2 } - \frac{3}{4} \left(\frac{s}{\sigma} \right)^2\] This new sample size formula is derived in Feit and Berman (2019) Marketing Science.

Compute the sample size in R

source("~/Documents/GitHub/testandroll/nn_functions.R") # some functions I wrote
N <- 100000 # available population
mu <- 0.68  # average conversion rate across previous treatments
sigma <- 0.03 # range of expected conversation rates across previous treatments
s <- mu*(1-mu) # binomial approximation
test_size_nn(N=N, s=s, mu=mu, sigma=sigma) # compute the optimal test size
## [1] 1108.073 1108.073

Why is this the profit-maximizing test size?

test_eval_nn() computes the profit of a Test & Roll.

# Optimal test size
n_star <- test_size_nn(N=N, s=s, mu=mu, sigma=sigma)
test_eval_nn(n=n_star, N=N, s=s, mu=mu, sigma=sigma)
##         n1       n2 profit_per_cust   profit profit_test profit_deploy
## 1 1108.073 1108.073       0.6961711 69617.11    1506.979      68110.13
##   profit_rand profit_perfect profit_gain      regret error_rate deploy_1_rate
## 1       68000       69692.57   0.9554201 0.001082677 0.06829165           0.5
##   tie_rate
## 1        0

Why is this the profit-maximizing test size?

# Bigger test
test_eval_nn(n=c(10000, 10000), N=N, s=s, mu=mu, sigma=sigma)
##      n1    n2 profit_per_cust   profit profit_test profit_deploy profit_rand
## 1 10000 10000       0.6935051 69350.51       13600      55750.51       68000
##   profit_perfect profit_gain      regret error_rate deploy_1_rate tie_rate
## 1       69692.57   0.7979038 0.004908151 0.02304771           0.5        0
# Smaller test
test_eval_nn(n=c(100, 100), N=N, s=s, mu=mu, sigma=sigma)
##    n1  n2 profit_per_cust   profit profit_test profit_deploy profit_rand
## 1 100 100       0.6936736 69367.36         136      69231.36       68000
##   profit_perfect profit_gain      regret error_rate deploy_1_rate tie_rate
## 1       69692.57   0.8078632 0.004666275  0.1997479           0.5        0

Why is this the profit-maximizing test size?