library(tidyverse)
Challenge Exercise
This exercise uses models to describe where we have no data at all.
- data: male and female incomes at age 30–50 in 2010–2019
- task: forecast male and female geometric mean pay at age 30–50 in 2022
Generate data as on the Simulate Data page. You can set the sample size however large you want. You can use any model you want.
Report your prediction in this Google Form.
Things you might discuss
You might discuss methodological choices:
- how would you generate an evaluation set for this problem?
- what models do you think would work well?
You might also discuss conceptual issues:
- why might we be hesitant to carry out this extrapolation?
Example R code to get you started
To run the code on this page, you will need the tidyverse
package.
We will also set the seed so that it is possible to exactly reproduce these results.
set.seed(90095)
As a simple example, you might simulate a sample of size 100,
<- simulate(n = 100) simulated
estimate a linear model on those data,
<- lm(log(income) ~ sex * year, data = simulated) fit
and report predictions in 2022.
<- tibble(
to_predict sex = c("female","male"),
year = c(2022,2022)
)|>
to_predict mutate(
# Make prediction
estimate = predict(fit, newdata = to_predict),
# Exponentiate to dollars
estimate = exp(estimate)
)
# A tibble: 2 × 3
sex year estimate
<chr> <dbl> <dbl>
1 female 2022 81777.
2 male 2022 76336.
Example Stata code to get you started
First generate your learning dataset. Use the Stata code at the bottom of Simulate Data. Save this file.
save learning
Then generate your dataset in which to make predictions.
use learning* Update the year to 2022
= 2022
replace year * Keep only the year and sex variables
keep year sex* Keep only one observation in each group
: gen index = _n
bysort year sexif index == 1 keep
Fit a regression model in the learning set.
clear all
use learning##sex reg log_income year
Load the predict set and make predictions from that fitted model.
clear all
use to_predict
predict predicted
What to try next
You might consider different functional forms, the overall mean, or machine learning estimators.