library(tidyverse)Simulate Data
This exercise works with simulated samples. Taking the nonparametric estimates from 5 million cases as the truth, you will generate a simulated sample of a much smaller size using the code below.
If you are a Stata user, see the bottom of this page for code. The page mainly supports coding in R.
Prepare the environment by loading the tidyverse package.
The function below simulates a sample of 100 cases.
simulate <- function(n = 100) {
  read_csv("https://ilundberg.github.io/description/assets/truth.csv") |>
    slice_sample(n = n, weight_by = weight, replace = T) |>
    mutate(income = exp(rnorm(n(), meanlog, sdlog))) |>
    select(year, age, sex, income)
}We can see how it works below,
simulated <- simulate(n = 100)Rows: 420 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sex
dbl (5): year, age, meanlog, sdlog, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.and can print a bit of the output.
simulated |> print(n = 3)# A tibble: 100 × 4
   year   age sex    income
  <dbl> <dbl> <chr>   <dbl>
1  2018    49 male   34994.
2  2015    48 female 40603.
3  2012    46 male   51381.
# ℹ 97 more rowsCode for Stata users
I am mostly not a Stata user, and this is provided for secondary pedadogical purposes in case some people do not use R. If you are a Stata user, feel free to let me know how to improve this code.
set seed 90095
* Load true population data
import delimited https://ilundberg.github.io/description/assets/truth.csv
* Draw a sample of 100 X-values
* Need two supporting packages
*ssc install moremata
*ssc install gsample
* Draw the sample
gsample 100 [w = weight]
* Simulate individual income data
gen log_income = meanlog + sdlog * rnormal()
gen income = exp(log_income)
* Keep variables to work with
encode sex, gen(factorsex)
keep year age factorsex log_income income
rename factorsex sex