Machine Learning as an Outcome Model

Here are slides on the content and slides on an exercise.

What is machine learning?

A supervised machine learning algorithm \(\hat{f}\) is an input-output machine.

  • Input is predictors, e.g. some variables \(\{\vec{X},A\}\)
  • Output is a prediction, e.g. \(f(\vec{X},A) = \hat{Y}\)

My favorite machine learning algorithm is Ordinary Least Squares (OLS). Why is this a machine learning algorithm?

  1. Using some data, it learns the values of the coefficients \(\hat{\vec\beta}\)
  2. It has a loss function: minimize mean squared prediction error $_{i}(_i - Y_i)^2
    • Thus Ordinary Least Squares
  3. With \(\hat{\vec\beta}\), we can make predictions in new data
    • For example, with a treatment value changed

OLS is one example of a broad set of machine learning algorithms that minimize mean squared prediction error. These algorithms can all be interpreted as estimators of the conditional mean function. Why? Because the function \(\hat{Y}\) that minimizes mean squared prediction error is the true (but unknown) conditional mean function \(\text{E}(Y\mid\vec{X})\).

\[ \hat{f}(\vec{X},A) \approx \text{E}(Y\mid \vec{X}, A) \]

Machine learning for causal inference

For causal inference under measured confounding, we need to estimate \(\text{E}(Y\mid A,\vec{X})\). A machine learning algorithm can be our estimator.

  1. Use data to learn the algorithm \(\hat{f}\)
  2. Predict \(\hat{Y}_i^a = \hat{f}(\vec{X}_i,a)\) for each unit \(i\) under treatment value \(a\)
  3. Difference and average over units \[ \hat{\text{E}}(Y^1 - Y^0) = \frac{1}{n}\sum_i \bigg(\underbrace{\hat{f}(1,\vec{x}_i)}_{\substack{\text{Predicted Outcome}\\\text{Under Treatment}}} - \underbrace{\hat{f}(0,\vec{x}_i)}_{\substack{\text{Predicted Outcome}\\\text{Under Control}}}\bigg) \]

Note that this all rests on the assumption that \(\vec{X}\) is a sufficient adjustment set to block all confounding (see page on Directed Acyclic Graphs).

An example

A widely-cited early application was Hill (2011), which used Bayesian Additive Regression Trees (BART) to model the response surface and then predict to estimate average causal effects and many conditional average effects. By outsourcing the functional form to an algorithm, approaches like this free the researcher to focus on the causal question and the DAG rather than the assumed functional form of statistical relationships. These algorithmic approaches often performed well in competitions where statisticians applied a series of estimators to simulated data to see who would come closest to the true causal effects (known in simulation, see Dorie et al. 2019). Recently, new developments have expanded tree and forest estimators to explicitly address causal questions (e.g., Athey & Imbens 2016).

Exercise

The file dorie.csv contains one of the simulated datasets from Dorie et al. (2019). I updated this file to randomly assign a variable set taking the values train or test so that you can easily practice a train-test split. The task with these data is to estimate the Sample Average Treatment Effect. The code dorie_example_code.R will get you started if you want to attempt this task with various estimators.

What to learn next

You may want to learn more algorithms for prediction, or you may want to learn more advanced ways to use them for causal inference.

Algorithms for prediction

This is the topic of many data science courses, and is not included on this website. But you can see the Soc 212B page on algorithms for prediction for an intro to some prediction algorithms.

Advanced ways to use algorithms for causal inference

Machine learning estimators are typically biased. In order to avoid high-variance predictions, these algorithms typically regularize estimates toward the overall mean. A statistical estimator with this kind of bias is a hierarchical or multilevel linear model.

Biased prediction algorithms can make causal effect estimates biased, and the bias may disappear slowly as the sample size grows. For this reason, more advanced approaches combine a model for treatment probabilities with a model for the conditional mean outcomes to make the bias disappear faster. This is not covered in this workshop, but you can learn more on the Soc 212B course page on doubly robust estimation.