Outcome Modeling

Here are slides

Because the causal effect of A on Y is identified by adjusting for the confounders L1 and L2, we can estimate by outcome modeling.

  1. Model \(E(Y\mid A, L_1, L_2)\), the conditional mean of \(Y\) given the treatment and confounders
  2. Predict potential outcomes
    • set A = 1 for every unit. Predict \(Y^1\)
    • set A = 0 for every unit. Predict \(Y^0\)
  3. Aggregate to the average causal effect

First, we load simulated data.

library(tidyverse)
data <- read_csv("https://ilundberg.github.io/causalestimators/data/data.csv")

1) Model

The code below uses Ordinary Least Squares to estimate an outcome model.

model <- lm(Y ~ A*(L1 + L2), data = data)

Call:
lm(formula = Y ~ A * (L1 + L2), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1448 -0.7105  0.0097  0.6998  3.1743 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01606    0.05699   0.282  0.77827    
A            1.11555    0.18021   6.190 1.26e-09 ***
L1           1.06333    0.05938  17.907  < 2e-16 ***
L2           1.11199    0.05951  18.685  < 2e-16 ***
A:L1        -0.39475    0.14279  -2.765  0.00591 ** 
A:L2        -0.28935    0.13940  -2.076  0.03844 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.111 on 494 degrees of freedom
Multiple R-squared:  0.6732,    Adjusted R-squared:  0.6699 
F-statistic: 203.6 on 5 and 494 DF,  p-value: < 2.2e-16

We chose a model where treatment A is interacted with an additive function of confounders L1 + L2. This is also known as a t-learner (Kunzel et al. 2019) because it is equivalent to estimating two separate regression models of outcome on confounders, one among those for whom A == 1 and among those for whom A == 0.

2) Predict

The code below predicts the conditional average potential outcome under treatment and control at the confounder values of each observation.

First, we create data with A set to the value 1.

data_1 <- data |>
  mutate(A = 1)
# A tibble: 500 × 7
        L1     L2      Y0    Y1 propensity_score     A       Y
     <dbl>  <dbl>   <dbl> <dbl>            <dbl> <dbl>   <dbl>
1  0.00304  1.03   0.677   1.59          0.276       1  0.677 
2 -2.35    -1.66  -4.09   -3.53          0.00244     1 -4.09  
3  0.104   -0.912  0.0659  1.31          0.0569      1  0.0659
# ℹ 497 more rows

Then, we create data with A set to the value 0.

data_0 <- data |>
  mutate(A = 0)
# A tibble: 500 × 7
        L1     L2      Y0    Y1 propensity_score     A       Y
     <dbl>  <dbl>   <dbl> <dbl>            <dbl> <dbl>   <dbl>
1  0.00304  1.03   0.677   1.59          0.276       0  0.677 
2 -2.35    -1.66  -4.09   -3.53          0.00244     0 -4.09  
3  0.104   -0.912  0.0659  1.31          0.0569      0  0.0659
# ℹ 497 more rows

We use our outcome model to predict the conditional mean of the potential outcome under each scenario.

predicted <- data |>
  mutate(
    Y1_predicted = predict(model, newdata = data_1),
    Y0_predicted = predict(model, newdata = data_0),
    effect_predicted = Y1_predicted - Y0_predicted
  )
# A tibble: 500 × 10
        L1     L2      Y0    Y1 propensity_score     A       Y Y1_predicted
     <dbl>  <dbl>   <dbl> <dbl>            <dbl> <dbl>   <dbl>        <dbl>
1  0.00304  1.03   0.677   1.59          0.276       0  0.677         1.98 
2 -2.35    -1.66  -4.09   -3.53          0.00244     0 -4.09         -1.81 
3  0.104   -0.912  0.0659  1.31          0.0569      0  0.0659        0.451
# ℹ 497 more rows
# ℹ 2 more variables: Y0_predicted <dbl>, effect_predicted <dbl>

3) Aggregate

The final step is to aggregate to an average causal effect estimate.

aggregated <- predicted |>
  summarize(average_effect_estimate = mean(effect_predicted))
# A tibble: 1 × 1
  average_effect_estimate
                    <dbl>
1                    1.13

Closing thoughts

Outcome modeling is a powerful strategy because it bridges nonparametric causal identification to longstanding strategies where outcomes are modeled by parametric regression.

Here are a few things you could try next:

  • replace step (1) with another approach to estimate conditional mean outcomes, such as a different functional form or a machine learning method
  • evaluate performance over many repeated simulations
  • evaluate performance at different simulated sample sizes

Exercise

Try outcome modeling with the realistic simulation data.