Inverse Probability of Treatment Weighting

Concept of weighting

Consider an experiment with 5 people.

  • Persons (1,2) are a subgroup. Person 1 is treated.
  • Persons (3,4,5) are a subgroup. Person 3 is treated.
ID \(x_i\) \(\text{P}(A = 1\mid X = x_i)\) \(A\) \(Y^1\) \(Y^0\)
1 1 \(\frac{1}{2}\) 1 10 ?
2 1 \(\frac{1}{2}\) 0 ? 8
3 2 \(\frac{1}{3}\) 1 15 ?
4 2 \(\frac{1}{3}\) 0 ? 9
5 2 \(\frac{1}{3}\) 0 ? 12

What is the average outcome under treatment \(Y^1\) (taken over all 5 people)? We do not observe all 5 values. But we observe a sample (Persons 1 and 3). This sample is drawn with unequal probabilities.

How many \(Y^1\) values does person 1’s outcome \(Y_1^1\) represent?

Person 1 represents \(w_1 = 2\) people.

  • We observed their treated outcome \(Y_1^1\) with probability \(1 / 2\).
  • Their inverse probability of treatment weight is \(\frac{1}{1 / 2} = 2\).

How many \(Y^0\) values does person 3’s outcome \(Y_3^0\) represent?

Person 3 represents \(w_3 = 3\) people.

  • We observed their treated outcome \(Y^1\) with probability \(1 / 3\).
  • Their inverse probability of treatment weight is \(\frac{1}{1 / 3} = 3\).

When estimating the mean outcome under control \(\text{E}(Y^0)\), we would do the same thing except that the denominator is the probability of being untreated.

In math

More generally, define a function that takes a treatment value \(a\) and confounder vector \(\vec{x}\) and returns an inverse probability of treatment weight.

\[ w(a,\vec{x}) = \frac{1}{\text{P}(A = a \mid\vec{X} = \vec{x})} \]

The weight \(w(a,\vec{x})\) answers the question: if I see a person \(i\) with \(A_i = a\) and I want to study a counterfactual pseudo-population in which everyone receives treatment value \(a\), how many people does person \(i\) represent?

Inverse probability of treatment weights produce an estimator for the potential outcome under any particular treatment value \(a\),

\[ \hat{\text{E}}\left(Y^a\right) = \frac{\sum_{i:A_i=a} Y_i w(A_i,\vec{X}_i)}{\sum_{i:A_i=a}w(A_i,\vec{X}_i)} \]

which is the weighted sample mean of \(Y_i\) among the units with \(A_i = a\).

Observational studies

In randomized experiments, the treatment assignment probabilities \(\text{P}(A = a\mid \vec{X} = \vec{x})\) are known. In observational studies, these probabilities are not known. Instead, we have to estimate them.

A common way to estimate for binary treatments is with logistic regression.

\[ \text{logit}\left(\text{P}(A = 1\mid\vec{X} = \vec{x})\right) = \alpha + \vec{x}'\vec\beta \]

But this is not the only way. If you have any prediction function \(f:\vec{x}\rightarrow\hat{\text{P}}(A = 1\mid \vec{X} = \vec{x})\) that takes \(\vec{x}\) and returns a predicted probability, you can use that function. For example, you can use a random forest. An example of this is Lee et al. (2009).

Weighting in code

Because the causal effect of A on Y is identified by adjusting for the confounders L1 and L2, we can estimate by inverse probability of treatment weighting.

  1. Model \(P(A = 1\mid L_1, L_2)\), the conditional probability of treatment given confounders
  2. Predict the conditional probability \(\pi\) of each unit’s observed treatment
    • if A = 1, predict \(\pi = P(A = 1 \mid L_1,L_2)\)
    • if A = 0, predict \(\pi = P(A = 0 \mid L_1,L_2)\)
  3. Aggregate to the average causal effect
    • estimate the expected outcome under treatment \(E(Y^1)\) by a weighted mean of treated units’ outcomes, weighted by \(\frac{1}{\pi}\)
    • estimate the expected outcome under control \(E(Y^0)\) by a weighted mean of untreated units’ outcomes, weighted by \(\frac{1}{\pi}\)
    • estimate the average causal effect by the difference

First, we load simulated data.

library(tidyverse)
data <- read_csv("https://ilundberg.github.io/causalestimators/data/data.csv")

1) Model

The code below uses logistic regression to model the conditional probability of treatment.

model <- glm(
  A ~ L1 + L2, 
  data = data, 
  family = binomial
)

Call:
glm(formula = A ~ L1 + L2, family = binomial, data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.0740     0.1737 -11.938  < 2e-16 ***
L1            1.0845     0.1639   6.618 3.65e-11 ***
L2            1.1877     0.1548   7.673 1.68e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 474.41  on 499  degrees of freedom
Residual deviance: 354.86  on 497  degrees of freedom
AIC: 360.86

Number of Fisher Scoring iterations: 5

2) Predict

The code below predicts the conditional probability of each unit’s observed treatment value, also known as the propensity score.

predicted <- data |>
  # Predict the probabilities that A = 1 and A = 0
  mutate(
    p_A_equals_1 = predict(model, type = "response"),
    p_A_equals_0 = 1 - p_A_equals_1
  ) |>
  # Assign the propensity score based on the observed treatment
  mutate(
    pi = case_when(
      A == 1 ~ p_A_equals_1,
      A == 0 ~ p_A_equals_0
    )
  )
# A tibble: 500 × 10
        L1     L2      Y0    Y1 propensity_score     A       Y p_A_equals_1
     <dbl>  <dbl>   <dbl> <dbl>            <dbl> <dbl>   <dbl>        <dbl>
1  0.00304  1.03   0.677   1.59          0.276       0  0.677       0.300  
2 -2.35    -1.66  -4.09   -3.53          0.00244     0 -4.09        0.00136
3  0.104   -0.912  0.0659  1.31          0.0569      0  0.0659      0.0455 
# ℹ 497 more rows
# ℹ 2 more variables: p_A_equals_0 <dbl>, pi <dbl>

3) Aggregate

The final step is to aggregate to an average causal effect estimate.

aggregated_Y1 <- predicted |>
  # Restrict to cases with A == 1
  filter(A == 1) |>
  # Calculate the weighted mean outcome
  summarize(estimate = weighted.mean(Y, w = 1 / pi))

aggregated_Y0 <- predicted |>
  # Restrict to cases with A == 1
  filter(A == 0) |>
  # Calculate the weighted mean outcome
  summarize(estimate = weighted.mean(Y, w = 1 / pi))

average_effect_estimate <- aggregated_Y1 - aggregated_Y0
  estimate
1 1.288555

Reminder: Sampling connection

Inverse probability of treatment weighting is a powerful strategy because it bridges nonparametric causal identification to longstanding strategies from survey sampling.

  • In sampling, unit \(i\) is sampled (\(S_i =1\)) with probability \(\pi_i\) and has inverse probability weight \(\frac{1}{\pi_i}\) (number of population members’ \(Y\) values represented)
  • In causal inference, unit \(i\) is treated (\(A_i=1\)) with probability \(\pi_i\) and has inverse probability of treatment weight \(\frac{1}{\pi_i}\) (number of sample members’ \(Y^1\) values represented)

You can also put the two together: if a unit is sampled with probability \(\pi_i\) and assigned treatment with probability \(\eta_i\), they represent \(\frac{1}{\pi_i\eta_i}\) \(Y^1\) values in the population. This reasoning is useful for generalizing from sample average causal effects to population average causal effects. If you are interested in generalizing sample average causal effects to a population average or to other populations, you may be interested in the framework of Egami & Hartman (2023). You may also be interested in examples motivated by specific applications; one such example is Cole and Stuart (2010).

Exercise

Try treatment weighting with the realistic simulation data.