6/16/2019
Choose \(n_1^*\) and \(n_2^*\) customers to send the treatments.
Collect data on response.
Choose a treatment to deploy to the remaining \(N - n_1^* - n_2^*\).
Maximize combined profit for test stage and the roll stage.
For the case where response is normally distributed with variance \(s\) and a symmetric normal prior on the mean response (\(m_1, m_2 \sim N(\mu, \sigma)\)), the profit maximizing sample size is
\[n_1 = n_2 = \sqrt{\frac{N}{4}\left( \frac{s}{\sigma} \right)^2 + \left( \frac{3}{4} \left( \frac{s}{\sigma} \right)^2 \right)^2 } - \frac{3}{4} \left(\frac{s}{\sigma} \right)^2\] If the priors are different for each group (eg a holdout test), the optimal sample sizes can be found numerically. This new sample size formula was recently derived by Feit and Berman (2019) Marketing Science.
Response
\[y_1 \sim N(m_1, s) \,\,\,\,\,\,\, y_2 \sim N(m_2, s)\]
Priors
\[m_1 \sim N(\mu, \sigma) \,\,\,\,\,\,\, m_2 \sim N(\mu, \sigma)\]
Profit-maximizing sample size \[n_1 = n_2 = \sqrt{\frac{N}{4}\left( \frac{s}{\sigma} \right)^2 + \left( \frac{3}{4} \left( \frac{s}{\sigma} \right)^2 \right)^2 } - \frac{3}{4} \left(\frac{s}{\sigma} \right)^2\]
Bigger population (\(N\)) \(\rightarrow\) bigger test
More noise in the repsonse (\(s\)) \(\rightarrow\) bigger test
More prior difference between treatments (\(\sigma\)) \(\rightarrow\) smaller test
\[n_1 = n_2 = \sqrt{\frac{N}{4}\left( \frac{s}{\sigma} \right)^2 + \left( \frac{3}{4} \left( \frac{s}{\sigma} \right)^2 \right)^2 } - \frac{3}{4} \left(\frac{s}{\sigma} \right)^2\]
Hierarchical Stan model for past experiments
// Stan code for Lewis and Rao 2015 data // L&R only report the mean and standard deviation for the control group for each experiment data { int<lower=1> nexpt; // number of experiments real<lower=2> nobs[nexpt]; // sample size for control group real ybar[nexpt]; // observed mean for control group real<lower=0> s[nexpt]; // observed standard deviation for experiment (pooled) } parameters { real m[nexpt]; // true mean for control group in experiment real mu; // mean across experiments real<lower=0> sigma; //standard deviation across experiments } model { // priors mu ~ normal(0, 10); sigma ~ normal(0, 3); // likelihood for (i in 1:nexpt) { m[i] ~ normal(mu, sigma); ybar[i] ~ normal(m[i], s[i]/sqrt(nobs[i])); } }
lr <- read.csv("display_LewisRao2015Retail.csv") # data taken from tables 1 and 2 of Lewis and Rao (2015) c <- c(1:3,5:6) # include only advertiser 1 and eliminate exp 4 d1 <- list(nexpt=length(c), nobs=lr$n1[c], ybar=lr$m[c], s=lr$s[c]) m1 <- stan(file="test_roll_model.stan", data=d1, seed=20030601, iter=10000)
## ## SAMPLING FOR MODEL 'test_roll_model' NOW (CHAIN 1). ## Chain 1: ## Chain 1: Gradient evaluation took 2.5e-05 seconds ## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds. ## Chain 1: Adjust your expectations accordingly! ## Chain 1: ## Chain 1: ## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup) ## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup) ## Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup) ## Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup) ## Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup) ## Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup) ## Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling) ## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling) ## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling) ## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling) ## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling) ## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling) ## Chain 1: ## Chain 1: Elapsed Time: 0.215839 seconds (Warm-up) ## Chain 1: 0.221814 seconds (Sampling) ## Chain 1: 0.437653 seconds (Total) ## Chain 1: ## ## SAMPLING FOR MODEL 'test_roll_model' NOW (CHAIN 2). ## Chain 2: ## Chain 2: Gradient evaluation took 8e-06 seconds ## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds. ## Chain 2: Adjust your expectations accordingly! ## Chain 2: ## Chain 2: ## Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup) ## Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup) ## Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup) ## Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup) ## Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup) ## Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup) ## Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling) ## Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling) ## Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling) ## Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling) ## Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling) ## Chain 2: Iteration: 10000 / 10000 [100%] (Sampling) ## Chain 2: ## Chain 2: Elapsed Time: 0.212856 seconds (Warm-up) ## Chain 2: 0.239993 seconds (Sampling) ## Chain 2: 0.452849 seconds (Total) ## Chain 2: ## ## SAMPLING FOR MODEL 'test_roll_model' NOW (CHAIN 3). ## Chain 3: ## Chain 3: Gradient evaluation took 9e-06 seconds ## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds. ## Chain 3: Adjust your expectations accordingly! ## Chain 3: ## Chain 3: ## Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup) ## Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup) ## Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup) ## Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup) ## Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup) ## Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup) ## Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling) ## Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling) ## Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling) ## Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling) ## Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling) ## Chain 3: Iteration: 10000 / 10000 [100%] (Sampling) ## Chain 3: ## Chain 3: Elapsed Time: 0.18011 seconds (Warm-up) ## Chain 3: 0.188699 seconds (Sampling) ## Chain 3: 0.368809 seconds (Total) ## Chain 3: ## ## SAMPLING FOR MODEL 'test_roll_model' NOW (CHAIN 4). ## Chain 4: ## Chain 4: Gradient evaluation took 9e-06 seconds ## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds. ## Chain 4: Adjust your expectations accordingly! ## Chain 4: ## Chain 4: ## Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup) ## Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup) ## Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup) ## Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup) ## Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup) ## Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup) ## Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling) ## Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling) ## Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling) ## Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling) ## Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling) ## Chain 4: Iteration: 10000 / 10000 [100%] (Sampling) ## Chain 4: ## Chain 4: Elapsed Time: 0.231227 seconds (Warm-up) ## Chain 4: 0.242515 seconds (Sampling) ## Chain 4: 0.473742 seconds (Total) ## Chain 4:
summary(m1)$summary[,c(1,3,5,8)]
## mean sd 25% 97.5% ## m[1] 9.490377 0.08467993 9.433258 9.655464 ## m[2] 10.500796 0.10008523 10.433861 10.698501 ## m[3] 4.860131 0.06181156 4.818066 4.981305 ## m[4] 11.470157 0.07017636 11.422815 11.609507 ## m[5] 17.615434 0.09047702 17.553716 17.792088 ## mu 10.352358 2.00230021 9.138484 14.169450 ## sigma 4.398852 1.17817514 3.540116 7.193112 ## lp__ -13.182626 1.90558956 -14.218789 -10.511587
source("nn_functions.R")
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
(n <- test_size_nn(N=1000000, s=mean(d1$s), mu=10.36044, sigma=4.39646))
## [1] 11390.89 11390.89
(eval <- test_eval_nn(n=n, N=1000000, s=mean(d1$s), mu=10.36044, sigma=4.3964))
## n1 n2 profit_per_cust profit profit_test profit_deploy ## 1 11390.89 11390.89 12.72715 12727151 236029.3 12491122 ## profit_rand profit_perfect profit_gain regret error_rate deploy_1_rate ## 1 10360440 12840843 0.9541639 0.008853939 0.06927924 0.5 ## tie_rate ## 1 0
Null hypothesis test size to detect difference between:
- display ads that have no effect - display ads that are exactly worth the costs (ROI = 0 versus ROI = -100).
margin <- 0.5 d <- mean(lr$cost[c])/margin (n_nht <- test_size_nht(s=mean(d1$s), d=d))
## [1] 4782433
(n_fpc <- test_size_nht(s=mean(d1$s), d=d, N=1000000))
## [1] 452673.4
(eval_fpc <- test_eval_nn(c(n_fpc, n_fpc), N=1000000, s=mean(d1$s), mu=10.36044, sigma=4.39646))
## n1 n2 profit_per_cust profit profit_test profit_deploy ## 1 452673.4 452673.4 10.59508 10595077 9379790 1215287 ## profit_rand profit_perfect profit_gain regret error_rate deploy_1_rate ## 1 10360440 12840877 0.09459509 0.1748946 0.01116195 0.5 ## tie_rate ## 1 0
Multi-armed bandits are a dynamic profit-maximizing approach that is more flexible than a test & roll experiment. They are often referred to as the “machine learning for the A/B testing world.”
Source: personal photo from Ceasar’s Palace, Las Vegas
A popular approach multi-armed bandit problems was proposed by Thompson in 1933.
There are other methods that work better in specific contexts, but Thompson sampling is very robust.
Source: eigenfoo.xyz
Both methods are profit-maximizing. We can compare them based on how much profit they generate.
Thompson sampling is less constrained, so will always produce more profit on average.
Statisticans are a pessimistic lot, so we prefer to compute regret for an algorithm, which is the difference between profit with perfect information and profit with the algorithm.
Source: Feit and Berman 2019
Test & Roll profit-maximizing sample size can be used as a conservative estimate of how long to run a bandit algorithm.