Simulate Data

The code below will generate a dataset of \(n = 100\) observations. Each observation contains several observed variables:

Each observation also contains outcomes that we know only because the data are simulated. These variables are useful as ground truth in simulations.

To run this code, you will need the dplyr package. If you don’t have it, first run the line install.packages("dplyr") in your R console. Then, add this line to your R script to load the package.

library(dplyr)

If you want your simulation to match our numbers exactly, add a line to set your seed.

set.seed(90095)
n <- 500
data <- tibble(
  L1 = rnorm(n),
  L2 = rnorm(n)
) |>
  # Generate potential outcomes as functions of L
  mutate(Y0 = rnorm(n(), mean = L1 + L2, sd = 1),
         Y1 = rnorm(n(), mean = Y0 + 1, sd = 1)) |>
  # Generate treatment as a function of L
  mutate(propensity_score = plogis(-2 + L1 + L2)) |>
  mutate(A = rbinom(n(), 1, propensity_score)) |>
  # Generate factual outcome
  mutate(Y = case_when(A == 0 ~ Y0,
                       A == 1 ~ Y1))

A simulation is nice because the answer is known. In this simulation, the conditional average causal effect of A on Y equals 1 at any value of L1 and L_2.