Sunday, April 14, 2013

Stan as a unified statistical estimation and intepretation engine

Applied researchers who are used to Stata or R need a reason to learn and use Stan. Besides the usual Bayesian vs. frequentist discussions, there is also a practical one.

Stan provides a unified interface for statistical estimation and interpretation. I have been using R with the Zelig package for estimation and interpretation during the past years, which is great. The problem is that, because Zelig is built upon a large number of existing R packages written by many different researchers and the quality of these packages vary greatly, working with Zelig means that you are working with all these other packages and researchers as well. In addition, even though in theory you can modify the source of these packages to suite your needs, but applied researches rarely have the energy or skills to tweak the FORTRAN or C code.

Stan provides a modeling language, which makes it easy for user to tweak their model (of course the underlying C++ code is also available). It is simulation-based and uses posterior distribution for inference, which means that there is no need for an additional simulation step (as what Zelig brings to frequentist models). After some testing, I have come to the conclusion that Stan is fast and stable enough for my daily data analysis work.

The best of all, this package comes from a research group with very good reputation and their discussion list is unbelievably helpful.

These are good enough reasons for me to switch to Stan.


Danilo said...
This comment has been removed by the author.
Danilo said...

Hello Shige! Congratulation for your great blog! I'm also a former Zelig use who's slowly moving to Stan. I'm still quite a newbie, so I'm having a few difficulties along the way. Could you please share an example where you can simulate quantities of interest with Stan the way we do with Zelig's setx() and sim() commands? Thanks a lot!

Shige said...

Not in the same way. Zelig simulates the sampling distribution whereas Stan simulates the posterior distribution. The Stan way is to first treat the predicted values as missing values and let Stan imputed them, and then you can use the imputed posteriors to compute first difference, second difference, etc.

I have an example here:

Danilo said...

Thank you very much for your reply. Your example is indeed very good, and I managed to simulate the probabilities I needed from a logit model using the 'generated quantities' block and the inv_logit function. Thanks a lot for your help!