Built using Zelig version 220.127.116.11000
Many researchers integrate tools from the “tidyverse” into their data workflows. The tidyverse includes a number of packages to read data into R–readr–, transform it for analysis– tibble, tidyr, purr, and dplyr, and visualise the results with ggplot2. All of these packages can be loaded in R by loading the tidyverse package:
Zelig can be easily slotted into this workflow for model estimation and identifying quantities of interest. Most Zelig estimation models require underlying data to be in the “tidy” format that the tidyverse creates. Zelig’s three step workflow can be organized very well using pipes (
%>%) that the tidyverse advocates. Using pipes with Zelig will make your code more legible and computationally efficient.
“Tidy” data is defined as having three qualities (Wickham 2014, 4):
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Data formatted in this way is ideal for most Zelig models.
This example shows one way to integrate the Tidyverse and Zelig into one workflow.
Imagine we used the Tidyverse to create a data set in tidy format that we want to analyse with Zelig:
## Fertility Agriculture Examination ## Courtelary 80.2 17.0 15 ## Delemont 83.1 45.1 6 ## Franches-Mnt 92.5 39.7 5 ## Moutier 85.8 36.5 12 ## Neuveville 76.9 43.5 17 ## Porrentruy 76.1 35.3 9
We could then analyse it with Zelig and plot quantities of interest using Tidyverse’s pipe operator. Note the use of
data = .. This tells the pipe to enter the
swiss data object as the value of
zelig’s data argument:
We could also find first differences by including
setx1 in the workflow:
Note that piping does not work with Zelig 5’s reference class syntax.
Zelig’s in-house plots are built using base R plotting functionality. You may want to take the quantities of interest simulated by Zelig and create custom plots with other tools, such as ggplot2.
zelig_qi_to_df extracts quantities of interest simulated by
sim and returns them in a data frame formatted to follow tidy data principles. This output could be easily plotted with ggplot2.
Using the example analysis from before:
sims.full <- swiss %>% zelig(Fertility ~ Agriculture + Examination, model = 'ls', data = ., cite = FALSE) %>% setx(Agriculture = seq(1, 90, by = 5)) %>% sim() %>% zelig_qi_to_df() head(sims.full)
## setx_value Agriculture Examination expected_value predicted_value ## 1 x 1 16.48936 79.52943 81.78669 ## 2 x 1 16.48936 79.02154 76.65962 ## 3 x 1 16.48936 81.17413 73.50355 ## 4 x 1 16.48936 76.50860 84.55060 ## 5 x 1 16.48936 72.53544 70.23523 ## 6 x 1 16.48936 70.64250 65.17009
Each row contains information for an individual “observation”, i.e. a quantity of interest calculated from one draw of the model parameters from the multivariate normal distribution.
The first three columns of the object returned by
zelig_qi_to_df in this case contain information that identifies the scenario that the quantity of interest is from. For example, the first row is from a scenario where
Agriculture is fitted at
Examination is fitted at the sample mean 16.4893617. The
setx_value column identifies if the values were fitted by a
setx call as
x1–a contrasting scenario used to find first differences.
The final two columns contain the expected and predicted value of the quantity of interest, respectively.1
Typically when we plot simulated quantities of interest, we would be interested in showing the central interval of the simulated distribution. This involves finding the central interval for each simulation scenario for some range, e.g. the central 95% of simulations. Zelig now includes a helper function
qi_slimmer that takes the output of
zelig_qi_to_df and finds the desired central interval for each scenario. For example:
sims.slimmed <- qi_slimmer(sims.full)
## Slimming Expected Values . . .
## setx_value Agriculture Examination qi_ci_min qi_ci_median qi_ci_max ## 1 x 1 16.48936 66.28557 74.87334 83.65955 ## 2 x 6 16.48936 66.57396 74.38637 82.46757 ## 3 x 11 16.48936 66.84942 73.92589 81.21538 ## 4 x 16 16.48936 67.09725 73.46384 79.97733 ## 5 x 21 16.48936 67.44999 73.03007 78.77376 ## 6 x 26 16.48936 67.67556 72.52319 77.62708
Now we have one row for each scenario. This contains the columns
qi_ci_max for the central interval of the simulations from each scenario. By default
qi_ci_max contain central interval bounds at the lower 2.5 and upper 97.5 percentiles, respectively.
qi_ci_median contains the simulated distribution’s median. The interval can be changed with the
ci argument. You can choose expected value (
ev) or predicted value (
pv) with the
Now we have everything to make a ggplot2 plot summarizing the simulated quantity of interest:
ggplot(sims.slimmed, aes(Agriculture, qi_ci_median)) + geom_ribbon(aes(ymin = qi_ci_min, ymax = qi_ci_max), alpha = 0.3) + geom_line() + ylab('Expected Fertility') + theme_bw()
“Depending on the issue being studied, the expected or mean value of the dependent variable may be more interesting than a predicted value. The difference is subtle but important. A predicted value contains both fundamental and estimation uncertainty, whereas an expected value averages over the fundamental variability arising from sheer randomness in the world, leaving only the estimation uncertainty caused by not having an infinite number of observations. Thus, predicted values have a larger variance than expected values, even though the average should be nearly the same in both cases.” From King, Tomz, and Wittenberg (2000, 350)↩