Description: Using models to describe
This page introduces an ongoing collaborative project by Ian Lundberg and Kristin Liao at UCLA.
Descriptive research summarizes the world as it exists. Description may not require a model—the mean of an outcome in a simple random sample can be a powerful form of description. This tutorial first considers model-free description and then pivots to a view of model-based description.
We take a \(\hat{Y}\) view as opposed to a \(\hat\beta\) view of model-based description. This view is in some sense both radical and conventional.
- radical: we will never report \(\hat\beta\), and only \(\hat{Y}\)
- conventional: we consider a model as a tool to estimate subgroup means
Because our view pushes beyond \(\hat\beta\), an additional upside is that it opens the door to machine learning estimators for description that may not be parameterized by coefficients.
Concrete setting
Using data from the 2010–2019 American Community Survey (ACS), we describe sex gaps in pay. We focus on the subgroup of adults ages 30–50 who worked for pay full-time (35+ hours per week) and for the full year (50+ weeks). Our outcome \(Y\) is annual wage and salary income. We summarize by the geometric mean (the exponentiated mean of log income), and we report the female / male ratio of geometric mean pay.
Model-free description
Let \(Y\) be the income of a randomly sampled person from our population. With a large sample, one could summarize the geometric mean of \(Y\) by a sample mean estimator.
\[\widehat{\text{GM}}(Y) = \text{exp}\left(\frac{1}{n}\sum_{i=1}^n \text{log}(y_i)\right)\]
We next consider a subgroup summary: the geometric mean among female respondents age 30. Letting \(\vec{X}\) denote the values of these two features for a randomly sampled person and \(\vec{x}\) denoting the particular values of interest, we could estimate by the sample mean of the target subgroup.
\[\widehat{GM}(Y\mid\vec{X} = \vec{x}) = \text{exp}\left(\frac{1}{n_\vec{x}}\sum_{i:\vec{X}_i=\vec{x}} \text{log}(y_i)\right)\] where the sum is over people whose feature vector \(\vec{X}\) equals the target value \(\vec{x}\) (e.g., female respondents age 30) and the number of people in the subgroup is \(n_{\vec{x}}\).
Small sample sizes become a problem for model-free subgroup description. Even in a large sample, there may be few female respondents who are 30 years old.
Model-based description
In a sample with very few 30-year-old female respondents, one might consider whether other respondents might be informative. Perhaps 31-year-old female respondents or 30-year-old male respondents provide data that could be informative about the pay of 30-year-old female respondents.
For us, a model is a tool to pool information from units outside the target subgroup in order to produce a better estimate within the target subgroup.
Formally, let \(\hat{f}()\) be a learned model: a function that maps a feature vector \(\vec{x}\) to a predicted outcome \(\hat{f}(\vec{x})\). The predicted value is an estimate of some summary of the conditional distribution of \(Y\) among those with the feature set \(\vec{X} = \vec{x}\).
For example, we might fit a linear regression model for log income.
\[\begin{aligned} &\widehat{E}(\text{log}(Y)\mid \vec{X} = \vec{x}) \\&= \vec{x}'\hat{\vec\beta} \\&= \hat\beta_0 + \hat\beta_1(\text{Female}) + \hat\beta_2(\text{Age}) + \hat\beta_3(\text{Female}\times\text{Age}) \end{aligned}\]
The prediction function for geometric mean pay would then be the exponentiated value of predicted log pay.
\[\widehat{\text{GM}}(Y\mid \vec{X} = \vec{x}) = \hat{f}(\vec{x}) = \text{exp}(\vec{x}'\hat{\vec\beta})\]
The choice
We would prefer
- model-free description when there are enough cases
- model-based description when data are scarce
as long as our model pools information effectively. Data can help us decide!
What comes next:
- first generate some simulated data
- then apply a model-free estimator
- then apply an OLS model-based estimator
- then apply a more flexible spline estimator