library(tidyverse)
Simulate Data
This exercise works with simulated samples. Taking the nonparametric estimates from 5 million cases as the truth, you will generate a simulated sample of a much smaller size using the code below.
If you are a Stata user, see the bottom of this page for code. The page mainly supports coding in R.
Prepare the environment by loading the tidyverse
package.
The function below simulates a sample of 100 cases.
<- function(n = 100) {
simulate read_csv("https://ilundberg.github.io/description/assets/truth.csv") |>
slice_sample(n = n, weight_by = weight, replace = T) |>
mutate(income = exp(rnorm(n(), meanlog, sdlog))) |>
select(year, age, sex, income)
}
We can see how it works below,
<- simulate(n = 100) simulated
Rows: 420 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sex
dbl (5): year, age, meanlog, sdlog, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
and can print a bit of the output.
|> print(n = 3) simulated
# A tibble: 100 × 4
year age sex income
<dbl> <dbl> <chr> <dbl>
1 2017 30 female 28993.
2 2017 41 female 31110.
3 2012 34 male 271444.
# ℹ 97 more rows
Code for Stata users
I am mostly not a Stata user, and this is provided for secondary pedadogical purposes in case some people do not use R. If you are a Stata user, feel free to let me know how to improve this code.
90095
set seed
* Load true population data
://ilundberg.github.io/description/assets/truth.csv
import delimited https
* Draw a sample of 100 X-values
* Need two supporting packages
*ssc install moremata
*ssc install gsample
* Draw the sample
100 [w = weight]
gsample
* Simulate individual income data
= meanlog + sdlog * rnormal()
gen log_income = exp(log_income)
gen income
* Keep variables to work with
gen(factorsex)
encode sex,
keep year age factorsex log_income income rename factorsex sex