Challenge Exercise

This exercise uses models to describe where we have no data at all.

Generate data as on the Simulate Data page. You can set the sample size however large you want. You can use any model you want.

Report your prediction in this Google Form.

Things you might discuss

You might discuss methodological choices:

  • how would you generate an evaluation set for this problem?
  • what models do you think would work well?

You might also discuss conceptual issues:

  • why might we be hesitant to carry out this extrapolation?

Example R code to get you started

To run the code on this page, you will need the tidyverse package.

library(tidyverse)

We will also set the seed so that it is possible to exactly reproduce these results.

set.seed(90095)

As a simple example, you might simulate a sample of size 100,

simulated <- simulate(n = 100)

estimate a linear model on those data,

fit <- lm(log(income) ~ sex * year, data = simulated)

and report predictions in 2022.

to_predict <- tibble(
  sex = c("female","male"),
  year = c(2022,2022)
)
to_predict |>
  mutate(
    # Make prediction
    estimate = predict(fit, newdata = to_predict),
    # Exponentiate to dollars
    estimate = exp(estimate)
  )
# A tibble: 2 × 3
  sex     year estimate
  <chr>  <dbl>    <dbl>
1 female  2022   81777.
2 male    2022   76336.

Example Stata code to get you started

First generate your learning dataset. Use the Stata code at the bottom of Simulate Data. Save this file.

save learning

Then generate your dataset in which to make predictions.

use learning
* Update the year to 2022
replace year = 2022
* Keep only the year and sex variables
keep year sex
* Keep only one observation in each group
bysort year sex: gen index = _n
keep if index == 1

Fit a regression model in the learning set.

clear all
use learning
reg log_income year##sex

Load the predict set and make predictions from that fitted model.

clear all
use to_predict
predict predicted

What to try next

You might consider different functional forms, the overall mean, or machine learning estimators.