“When customers are randomly assigned to treatment and control groups, and there are many customers in each group, then you may effectively have multiple experiments to analyze.”
6/16/2019
“When customers are randomly assigned to treatment and control groups, and there are many customers in each group, then you may effectively have multiple experiments to analyze.”
Test setting: email to retailer mailing list
Unit: email address
Treatments: email version A, email version B, holdout
Reponse: open, click and 1-month purchase ($)
Selection: all active customers
Assignment: randomly assigned (1/3 each)
Consider the customers who have been active in the last 60 days.
Within that subset, customers were randomly assigned to recieve email A, email B or no email.
So, we can analyze the data for a subgroup as it’s own test test by slicing down and then re-analyzing.
However, we will only find signifcant results if we have enough sample in the subgroup.
d %>% group_by((days_since < 60), group) %>% summarize(mean(open), mean(click), mean(purch))
## # A tibble: 6 x 5 ## # Groups: (days_since < 60) [2] ## `(days_since < 60)` group `mean(open)` `mean(click)` `mean(purch)` ## <lgl> <fct> <dbl> <dbl> <dbl> ## 1 FALSE ctrl 0 0 6.80 ## 2 FALSE email_A 0.582 0.106 17.1 ## 3 FALSE email_B 0.503 0.0715 17.0 ## 4 TRUE ctrl 0 0 18.5 ## 5 TRUE email_A 0.865 0.160 34.8 ## 6 TRUE email_B 0.812 0.117 35.4
d %>% filter(email==TRUE) %>% ggplot(aes(y=purch, x=group)) + geom_dotplot(binaxis='y', stackdir='center', stackratio=0.1, dotsize=0.1, binwidth=0.1) + ylab("30-Day Purchases ($)") + xlab("") + scale_y_log10()
t.test(purch ~ email, data=d[d$days_since < 60,])
## ## Welch Two Sample t-test ## ## data: purch by email ## t = -33.51, df = 50513, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -17.5776 -15.6350 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 18.48809 35.09439
t.test(purch ~ email, data=d[d$days_since > 60,])
## ## Welch Two Sample t-test ## ## data: purch by email ## t = -30.257, df = 56220, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -10.752048 -9.443798 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 6.792411 16.890335
Anyone who keeps historic data on customers or visitors has lots of baseline variables available for slicing and dicing:
Re-analyze the opens, clicks and purchases for people who have bought syrah in the past.
summary(d$syrah > 0)
## Mode FALSE TRUE ## logical 88359 35629
mean(d$syrah > 0)
## [1] 0.2873585
Slicing and dicing means you will run many significance tests.
You may remember from intro stats that 1 in 20 significance tests at 95% confidence will be significant, when there is no effect. You will get false positives, especially when slicing and dicing.
When you think you’ve found a golden ticket, re-test before betting the company.
Slicing and dicing will reveal two things about subgroups of customers.
“Experiments are used because they provide credible estimates of the effect of an intervention for a sample population. But underlying this average effect for a sample may be substantial variation in how particular respondents respond to treatments: there may be heterogeneous treatment effects.”
– Athey and Imbens, 2015
Marketers should be interested in heterogeneous treatment effects when there is opportunity to apply different treatments to each subgroup (ie targeting).
email \(\rightarrow\) high potential for targeting
website \(\rightarrow\) less potential for targeting
We use a regression model to define a relationship between the response (\(y\)) and the treatment (\(x\)).
\(y = a + b \times x + \varepsilon\)
The model literally says that we get the average response by multiplying the treatment indicator \(x\) by \(b\) and adding that to \(a\). When we fit a model, we use data to estimate \(a\) and \(b\).
In R, we shorthand the model equation with an R formula:
purch ~ email
This means exactly the same thing as:
purch
\(= a + b \times\) email
\(+ \varepsilon\)
where we estimate \(a\) and \(b\) from data.
m1 <- lm(purch ~ email, data=d) summary(m1)
## ## Call: ## lm(formula = purch ~ email, data = d) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.74 -25.74 -12.42 -1.23 1581.66 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 12.4203 0.2679 46.36 <2e-16 *** ## emailTRUE 13.3243 0.3281 40.61 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 54.47 on 123986 degrees of freedom ## Multiple R-squared: 0.01313, Adjusted R-squared: 0.01312 ## F-statistic: 1649 on 1 and 123986 DF, p-value: < 2.2e-16
Email increasaes sales by ~$6.42 on average across all email addresses.
Regression model
summary(m1)$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 12.42029 0.2679092 46.36005 0 ## emailTRUE 13.32428 0.3281218 40.60772 0
Significance test
t.test(purch ~ email, data=d, var.equal=TRUE)
## ## Two Sample t-test ## ## data: purch by email ## t = -40.608, df = 123986, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -13.96739 -12.68116 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 12.42029 25.74456
If you like regression, you can use regression to analyze all your tests.
If you don’t like regression, you should try it because it gives you the ability to pull in baseline variables.
m2 <- lm(purch ~ email + (days_since < 60), data=d) summary(m2)$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.764104 0.3031659 15.71451 1.361618e-55 ## emailTRUE 13.301929 0.3246010 40.97932 0.000000e+00 ## days_since < 60TRUE 15.929019 0.3062459 52.01382 0.000000e+00
Aged customers in the control group purchased on average $5.55 in the 30-days after the email was sent. Recently active customers in the control group purchased an additional $13.55. The average effect of the email was $6.44.
Controlling for baseline variables increases the likelihood of finding significant effects. This is sometimes called “regression correction.”
To incorporate heterogeneous treatment effects, we need an interaction between the treatment effect (\(x\)) and a baseline variable (\(z\)).
When we interact to terms, we are defining a model that multiplies the two terms:
\(y = a + b x + c z + d (x z) + \varepsilon\)
The R formula for this model is:
purch ~ email + (days_since < 60) + email:(days_since < 60)
or equivalently
purch ~ email*(days_since < 60)
m3 <- lm(purch ~ email + (days_since < 60) + email:(days_since < 60), data=d) summary(m3)$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.804775 0.3676230 18.510196 2.167184e-76 ## emailTRUE 10.238134 0.4504495 22.728707 3.994762e-114 ## days_since < 60TRUE 11.683315 0.5302628 22.033066 2.234772e-107 ## emailTRUE:days_since < 60TRUE 6.368163 0.6494166 9.805976 1.081541e-22
The email effect is $5.36 for aged customers plus an additional $2.23 recent customers (total of $7.59).
An uplift model is a regression model that incorporates many baseline variables. For example:
m4 <- lm(purch ~ email*(days_since < 60) + email*(past_purch > 50) + email*(visits > 3), data=d) summary(m4)$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.685228 0.7058724 -0.9707534 3.316730e-01 ## emailTRUE -1.978801 0.8626010 -2.2939930 2.179256e-02 ## days_since < 60TRUE 11.743406 0.5236114 22.4277149 3.520186e-111 ## past_purch > 50TRUE 8.759740 0.5406815 16.2012939 5.678508e-59 ## visits > 3TRUE 2.653066 0.6819410 3.8904628 1.001054e-04 ## emailTRUE:days_since < 60TRUE 6.321222 0.6412638 9.8574453 6.489627e-23 ## emailTRUE:past_purch > 50TRUE 7.795652 0.6624463 11.7679759 5.935940e-32 ## emailTRUE:visits > 3TRUE 9.260021 0.8351709 11.0875773 1.486693e-28
If you have someone who wasn’t in the test, but you know their baseline variables, you can use an uplift model to predict likely treatment effect.
new_cust <- data.frame(past_purch=rep(38.12,2), days_since=rep(19,2), visits=rep(3,2)) (pred <- predict(m4, cbind(email=c(TRUE, FALSE), new_cust)))
## 1 2 ## 15.40060 11.05818
(lift <- pred[1] - pred[2])
## 1 ## 4.342422
This new customer is predicted to buy $13.03 if they get an email or $12.40 without, for a uplift of $0.63.
new_cust <- data.frame(past_purch=rep(127.88,2), days_since=rep(19,2), visits=rep(40,2)) (pred <- predict(m4, cbind(email=c(TRUE, FALSE), new_cust)))
## 1 2 ## 43.86908 22.47098
(lift <- pred[1] - pred[2])
## 1 ## 21.39809
This is a better target with an uplift of 11.61.
For costly treatments (eg catalogs, discounts) we should target customers customers that we predict will have a positive effect that exceeds costs.
Source: Predictive Analytics Times
We can also build an uplift model for click probability, but we should use a logistic regression for binary outcomes.
m5 <- glm(click ~ group*(days_since < 60) + group*(past_purch > 50) + group*(visits > 3) + group*(syrah > 0) + group*(cab > 0) + group*(sav_blanc > 0) + group*(chard > 0), family = binomial, data=d[d$group != "ctrl",])
While email B has lower overall click rate, customers who have purchased syrah in the past are more likely to click if they get email B (which promoted syrah).
summary(m5)$coef
## Estimate Std. Error z value ## (Intercept) -2.48230897 0.04332267 -57.2981492 ## groupemail_B -0.67064544 0.06848652 -9.7923721 ## days_since < 60TRUE 0.48161494 0.02956872 16.2879878 ## past_purch > 50TRUE 0.41923694 0.04474487 9.3694975 ## visits > 3TRUE 0.01528802 0.03920836 0.3899174 ## syrah > 0TRUE 0.04624640 0.03409196 1.3565191 ## cab > 0TRUE -0.01426477 0.03400629 -0.4194747 ## sav_blanc > 0TRUE 0.04214018 0.03699615 1.1390423 ## chard > 0TRUE 0.11199369 0.03684271 3.0397784 ## groupemail_B:days_since < 60TRUE 0.06953462 0.04562008 1.5242109 ## groupemail_B:past_purch > 50TRUE -0.07434038 0.06864675 -1.0829410 ## groupemail_B:visits > 3TRUE 0.01599524 0.06104585 0.2620200 ## groupemail_B:syrah > 0TRUE 0.51265960 0.05076735 10.0982156 ## groupemail_B:cab > 0TRUE 0.08589696 0.05166828 1.6624698 ## groupemail_B:sav_blanc > 0TRUE 0.08704272 0.05553677 1.5672989 ## groupemail_B:chard > 0TRUE 0.11339008 0.05543572 2.0454336 ## Pr(>|z|) ## (Intercept) 0.000000e+00 ## groupemail_B 1.214135e-22 ## days_since < 60TRUE 1.201167e-59 ## past_purch > 50TRUE 7.287869e-21 ## visits > 3TRUE 6.965976e-01 ## syrah > 0TRUE 1.749341e-01 ## cab > 0TRUE 6.748693e-01 ## sav_blanc > 0TRUE 2.546855e-01 ## chard > 0TRUE 2.367523e-03 ## groupemail_B:days_since < 60TRUE 1.274560e-01 ## groupemail_B:past_purch > 50TRUE 2.788346e-01 ## groupemail_B:visits > 3TRUE 7.933060e-01 ## groupemail_B:syrah > 0TRUE 5.625649e-24 ## groupemail_B:cab > 0TRUE 9.641861e-02 ## groupemail_B:sav_blanc > 0TRUE 1.170448e-01 ## groupemail_B:chard > 0TRUE 4.081212e-02
Uplift models can include many, many baseline variables. Creating these variables from source data (CRM, web analytics data, etc) is called feature engineering.
Causal forests are an alternative to regression for identifying heterogeneous treatment effects and scoring customers based on predicted treatment effect uplift.
Where regression models predict customer outcomes with a linear equation, cart trees predict customer outcomes using a tree structure. CARTs are estimated by finding the tree structure that seems to classify people correctly most of the time.
Random forests are collections of different CARTs each fit to a subset of the data. Each tree in the forest classifies customers slightly differently. Unlike a regression, a random forest can pick up non-linear relationships.
Causal forests are random forests designed to categorize customers according to their treatment effect in an experiment. The customers in each leaf are assumed to have homogeneous treatment effects, with heterogeneous treatment effects between leaves.
Advantages
- Works well with a large number of baseline variables
- Doesn’t require the analyst to define cut-offs for continuous baseline variables
- Will fit non-linear relationships between baseline variables and uplift
treat <- d$email response <- d$purch baseline <- d[, c("days_since", "past_purch", "visits", "chard", "sav_blanc", "syrah", "cab")] cf <- causal_forest(baseline, response, treat) print(cf)
## GRF forest object of type causal_forest ## Number of trees: 2000 ## Number of training samples: 123988 ## Variable importance: ## 1 2 3 4 5 6 7 ## 0.165 0.667 0.093 0.023 0.031 0.012 0.008
average_treatment_effect(cf, method="AIPW")
## estimate std.err ## 13.2946030 0.2876234
This is similar to the estimate from our simple regression which was 6.42 (0.30).
Just like any uplift model, we can use the model to predict the email effect for new customers.
new_cust <- data.frame(chard=38.12, sav_blanc=0, syrah=0, cab=0, past_purch=38.12, days_since=19, visits=3) predict(cf, new_cust, estimate.variance = TRUE)
## predictions variance.estimates ## 1 -0.3514193 10.04156
hist(predict(cf)$predictions, main="Histogram of Purchase Lift", xlab="Purchase Lift for Email", ylab="Customers")
trans_gray <- rgb(0.1, 0.1, 0.1, alpha=0.1) plot(d$past_purch, predict(cf)$predictions, cex=0.5, col=trans_gray, xlab="Past Purchase Amount ($)", ylab="Predicted Treatment Effect ($)")
trans_gray <- rgb(0.1, 0.1, 0.1, alpha=0.1) plot(d$days_since, predict(cf)$predictions, cex=0.5, col=trans_gray, xlab="Days Since Last Active", ylab="Predicted Treatment Effect ($)")