library(dplyr)
Simulate Data
The code below will generate a dataset of \(n = 100\) observations. Each observation contains several observed variables:
L1
A numeric confounderL2
A numeric confounderA
A binary treatmentY
A numeric outcome
Each observation also contains outcomes that we know only because the data are simulated. These variables are useful as ground truth in simulations.
propensity_score
The true propensity score \(P(A = 1 \mid \vec{L})\)Y0
The potential outcome under controlY1
The potential outcome under treatment
To run this code, you will need the dplyr
package. If you don’t have it, first run the line install.packages("dplyr")
in your R console. Then, add this line to your R script to load the package.
If you want your simulation to match our numbers exactly, add a line to set your seed.
set.seed(90095)
<- 500
n <- tibble(
data L1 = rnorm(n),
L2 = rnorm(n)
|>
) # Generate potential outcomes as functions of L
mutate(Y0 = rnorm(n(), mean = L1 + L2, sd = 1),
Y1 = rnorm(n(), mean = Y0 + 1, sd = 1)) |>
# Generate treatment as a function of L
mutate(propensity_score = plogis(-2 + L1 + L2)) |>
mutate(A = rbinom(n(), 1, propensity_score)) |>
# Generate factual outcome
mutate(Y = case_when(A == 0 ~ Y0,
== 1 ~ Y1)) A
A simulation is nice because the answer is known. In this simulation, the conditional average causal effect of A
on Y
equals 1 at any value of L1
and L_2
.