library(tidyverse)
Evaluate Models
This page presents a \(\hat{Y}\) view of what it means for one model to outperform another model. We first discuss in a simulated setting where we generate many samples from the population and directly observe performance of estimators across those simulated samples. Then, we discuss how one can evaluate performance in the more realistic setting in which only one sample from the population is available.
To run the code on this page, you will need the tidyverse
package.
We will also set the seed so that it is possible to exactly reproduce these results.
set.seed(90095)
Simulation: Many samples from the population
- simulate a sample of 100 cases
- apply each model
- compare to the full-data estimate for the target subgroup
See the previous page for code to simulate a sample.
Estimator functions
The functions below is an estimator: it take data in and returns estimates. This function can be applied with the flat, linear, and quadratic models defiend on the previous page.
<- function(
estimator # from simulate()
data, # one of "flat", "linear", "quadratic"
model_name
) {# Estimate a regression model
if (model_name == "flat") {
<- lm(log(income) ~ sex, data = data)
fit else if (model_name == "linear") {
} <- lm(log(income) ~ sex * age, data = data)
fit else if (model_name == "quadratic") {
} <- lm(log(income) ~ sex * poly(age,2), data = data)
fit
}# Define x-values at which to make predictions
<- tibble(
to_predict sex = c("female","male"),
age = c(30,30)
)# Make predictions
<- to_predict |>
predicted mutate(estimate = predict(fit, newdata = to_predict)) |>
# Transform from log scale to dollars scale
mutate(estimate = exp(estimate)) |>
# Append information for summarizing later
mutate(
model_name = model_name,
sample_size = nrow(data)
)# Return the predicted estimates
return(predicted)
}
As an illustration, here is the linear estimator applied to a simulated sample
<- simulate(n = 100) simulated
estimator(data = simulated, model_name = "linear")
# A tibble: 2 × 5
sex age estimate model_name sample_size
<chr> <dbl> <dbl> <chr> <int>
1 female 30 38436. linear 100
2 male 30 51877. linear 100
Apply estimators in repeated samples
How do our estimators perform in repeated samples? In actual problems, one only has one sample. But we created this exercise so that we can use simulate()
to simulate many samples from a known data generating process. The code below applies the estimator to many samples of size 100.
We first prepare for parallel computing.
library(foreach)
library(doParallel)
library(doRNG)
<- makeCluster(detectCores())
cl registerDoParallel(cl)
Then we apply the estimator many times at each of a series of sample sizes.
<- foreach(
simulations repetition = 1:1000,
.combine = "rbind",
.packages = "tidyverse"
%dorng% {
) foreach(n_value = c(50,100,200,500,1000), .combine = "rbind") %do% {
# Simulate data
<- simulate(n = n_value)
simulated # Apply the three estimators
<- estimator(data = simulated, model_name = "flat")
flat <- estimator(data = simulated, model_name = "linear")
linear <- estimator(data = simulated, model_name = "quadratic")
quadratic # Return estimates
<- rbind(flat, linear, quadratic)
all_estimates return(all_estimates)
} }
Visualize estimator performance
For simplicity, we first focus on one estimand and sample size: modeling the geometric mean of 30-year-old female incomes with a sample size of \(n = 100\). Despite having good performance in the population, the quadratic model has poor performance at this sample size because it has high variance!
To aggregate simulations to a summary score, we use mean squared error (MSE) on the scale of log incomes.
\[\begin{aligned} \theta(\vec{x}) &= \text{True geometric mean in subgroup }\vec{x} \\ \hat\theta_r(\vec{x}) &= \text{Estimated geometric mean in subgroup }\vec{x}\text{ in simulated sample }r \\ \widehat{\text{MSE}}\bigg(\hat\theta(\vec{x})\bigg) &= \frac{1}{R}\sum_{r=1}^R \left(\text{log}(\hat\theta) - \text{log}(\theta)\right)^2 \end{aligned}\]
To calculate the MSE, we need the true values from our simulation. The code below is somewhat opaque because it requires you to know something about how we generated the data. In this code, meanlog
is the mean log income in each sex
\(\times\) age
\(\times\) year
subgroup, estimated in the full ACS. The weight
corresponds to the size of the subgroup; later years have greater weight because the total population was greater then, for example. We aggregate to truth within sex
\(\times\) age
subgroups.
<- read_csv("https://ilundberg.github.io/description/assets/truth.csv") |>
truth group_by(sex, age) |>
summarize(
truth = exp(weighted.mean(meanlog, w = weight)),
.groups = "drop"
)
The code below merges the truth with the simulated estimates and calculates aggregate performance metrics: bias, variance, and mean squared error.
<- simulations |>
aggregate_performance # Merge in the true values from the simulation
left_join(truth, by = join_by(sex, age)) |>
# Convert to log scale
mutate(
estimate = log(estimate),
truth = log(truth)
|>
) # Calculate aggregate performance within
# population subgroups (sex, age)
# and simulation settings (model_name, sample_size)
group_by(sex, age, model_name, sample_size) |>
summarize(
bias = mean(estimate - truth),
variance = var(estimate),
mse = mean((estimate - truth) ^ 2),
.groups = "drop"
)
The graph below visualizes aggregate performance measured by mean squared error.
Conclusions: Many-samples simulation
The results show how the more complex models perform better than simpler models only at larger sample sizes.
For the male subgroup,
- a flat model is best at \(n = 50\)
- a linear model becomes best at \(n = 100\)
- a quadratic model becomes best at $n ={}$1,000
For the female subgroup,
- a flat model is best at \(n = 50, 100, 200\)
- a linear model becomes best at \(n = 500\)
- a quadratic model may become best at an untested sample size greater than $n ={}$1,000
There are two main conclusions from this illustration. Which estimator is best is a question that
- depends on the estimand (male or female subgroup), and
- depends on the sample size
Further, although the quadratic fit is best in the population (previous page), a very large sample size is needed before it is best in a sample. This is a reminder that more complex models do not necessarily outperform simpler models, especially in small samples.
Realistic: One sample from the population
In the real world, we cannot evaluate performance over many samples from the population. We typically have only one sample! This section is about evaluating performance when only one sample is available.
The code below generates one sample.
<- simulate(n = 100) |>
simulated # To track the units in this sample, we will add an id column
mutate(id = 1:n())
One way to evaluate in one sample is to use a split-sample procedure:
- Split into learning and evaluation sets
- Estimate models in the learning set
- Evaluate in the evaluation set
The advantage of this procedure is that a complex model that performs well in the learning set may not generalize well to the evaluation set. Through sample splitting, we can select a model that performs well at roughly our actual sample size (at least, at 50% of our sample size).
Split into learning and evaluation sets
Create a learning set with half of the cases.
<- simulated |>
learning slice_sample(prop = .5)
Create an evaluation set with the other half.
<- simulated |>
evaluation anti_join(learning, by = join_by(id))
Estimate models in the learning set
Next, estimate the models in the learning set.
<- lm(log(income) ~ sex, data = learning)
flat <- lm(log(income) ~ sex * age, data = learning)
linear <- lm(log(income) ~ sex * poly(age,2), data = learning) quadratic
Evaluate in the evaluation set
Use them to predict in the evaluation set.
<- evaluation |>
predicted mutate(
flat = predict(flat, newdata = evaluation),
linear = predict(linear, newdata = evaluation),
quadratic = predict(quadratic, newdata = evaluation)
)
Aggregate prediction errors in the evaluation set to produce mean squared error estimates for each model.
<- predicted |>
performance # Select the actual and predicted values
mutate(actual = log(income)) |>
select(actual, flat, linear, quadratic) |>
# Make a long dataset for ease of analysis
pivot_longer(cols = -actual, names_to = "model_name", values_to = "prediction") |>
# Create a column with errors
mutate(squared_error = (actual - prediction) ^ 2) |>
# Summarize mean squared error
group_by(model_name) |>
summarize(mse = mean(squared_error)) |>
print()
# A tibble: 3 × 2
model_name mse
<chr> <dbl>
1 flat 0.635
2 linear 0.632
3 quadratic 0.711
Our split-sample procedure estimates that the linear model has the best performance!
Difficulties in split-sample model evaluation
- the estimated performance may itself be statistically uncertain
- if the best model in one subgroup is different from the best in another subgroup, then a sample-average MSE may not be optimal for selecting for our task