Monday, December 31, 2007


The idea underlying Sweave is very appealing: you can compose a complete research paper with equations, tables, figures without manual intervention (copy, paste, adjust margin, etc.).

The package "odfWeave" extends this idea from LaTeX to OpenOffice. I just gave it a test-drive, and it rocks!

Sunday, December 30, 2007


As a demographer, it is really exciting to find a toolkit designed for demographic analysis:

This is a short paper explaining the usage of it:

Saturday, December 29, 2007

How to "reshape" a data set in R

Here is a helpful post explaining how to do "reshape" a data, a convenient feature in Stata, using R:


Gary King's crew is doing some really nice work with Zelig. The first goal is to standardize statistical analysis syntax in R, which is very important. It also has the ability to analyze multiply imputed data, to simulate posterior distribution, etc.

Keep up the good work, it looks really promising.

Thursday, December 27, 2007

The "per thousand" symbol

In OpenOffice, the "per thousand" symbol can be input as equation or special character. The special character version can usually be found under the "general punctuation" in most fonts.

Saturday, December 22, 2007

Sunday, December 09, 2007

Here is how I choose between Stata, Mplus and aML

in dealing with a specific research question:
  1. Use Stata (or R) for routine statistical analysis, they are getting better and better;
  2. Use Mplus for multilevel Cox regression; the results can be compared with that from Stata or R;
  3. Use aML to handle multiple clock situation (i.e. APC model).

Have some fun with GreenFoot

A agent-based simulation environment and a good teaching tool for java programming:

Thursday, November 29, 2007


The one complain I have about OpenOffice is its math presentation. Compared to LaTeX or Mathtype, it does not look very nice. Here comes the rescue:

Here is how they look:

Wednesday, November 21, 2007

Just Another Gibbs Sampler

This project ( looks very promising.

Saturday, November 10, 2007


of a Freebasic debugging session on Linux.

Saturday, November 03, 2007

Geany the tiny IDE

Geany is a tiny IDE for C/C++. It is convenient to use in cases when one needs to develop short single-file programs like the ones presented in the book "Simulating Ecological and Evolutionary Systems in C".

The address is here:

It also supports FreeBasic and FreePascal.

Thursday, November 01, 2007

Zotero again

Cool. Zotero 1.0 is out. It fixes the problem with the ASA reference style. My next paper will be done in OpenOffice+Zotero!

Also a screenshot of OpenOffice and Zotero.

Tuesday, October 30, 2007

Chinese fonts

High quality Chinese fonts:

Free and open source.

Saturday, October 27, 2007

Linux & virus

Interesting post of virus on Linux:

Screenshot of my new desktop

Looks pretty cool.

Wednesday, October 24, 2007

Wow, it's finally here...

A reference manager integrated seamlessly with OpenOffice, on all platforms! Its name is Zotero: There is no need to stay with MS-Office just for the convenience of Endnote, and there is no need to stay with Windows just for the combination of Office and Endnote.

It has some very nice features that neither Endnote nor NoteExpress has, such as grab more than one references from google scholars. I realize this is something serious ...

Monday, October 22, 2007

Upgradeing the RGL package on Ubuntu

Remember first to type "sudo apt-get build-dep r-cran-rgl" to get all the necessary files, then do the upgrading.

Sunday, October 21, 2007

Ubuntu 7.10 and Stata 10

Ubuntu 7.10 is no doubt the best Linux distro so far. I had some problem installing Stata 10 on my Ubuntu box. When I typed "xstata-se", I got "./xstata: error while loading shared libraries: cannot open shared object file: No such file or directory". Google search pointed me to this url:

and this:

Fortunately, the solution is simple: just download the file " tiff-3.7.4.tar.gz", compile and install it, Stata works just fine.

Saturday, October 06, 2007

Period effect in Stata

If the time variable is age, how to get Stata to estimate period effect? For a long time, a thought the only way was to use a discrete-time approach: create person-year data format, then estimate either LOGIT or CLOGLOG model. The problem with this approach is: when dealing with large data sets, it becomes impossible to expand the data 10 to 20 times or even more (depends on the duration and the time scale). The only feasible way to get period effect is to use aML's multiple clock capability.

By carefully reading the manual, I realize there is a way to do this. In my schizophenia case, I do the following:
  • stset dura_cal, f(event) origin(time birth) id(id)
  • stsplit p, at(0, 16, 26) after(time=1949)
Now the STSPLIT command with the "after()" option has correctly splited in data into four segments that represent "before 1949", "1949-1965", "1966-1976", and "after 1976". A simple Cox model can be used to see if the rate of schizophenia is particularly higher in any of the periods:
  • xi: stcox i.p

Wednesday, October 03, 2007

Split-population model (cure model, long-term survivor model)

When there are a portion of respondents who will never experience the event (immortal), ordinary survival modeling techniques are not adequate. Special models designed to handle this kind of situations are called split-population model, cure model, or long-term survivor model.

aML does not handle split-population model; Mplus handles it by imposing constraints on a two-class mixture model; Stata has the following some facilities:

  1. lncure: log-normal model with split-population;
  2. spsurv: discrete time split-population model;
  3. cureregr: split-population model with weibull, lognormal, logistic, gamma, and exponential distribution;
  4. strxmix and strsnmix: split-population model with weibull, lognormal, gamma, and some mixture distribution.

Among the above, 1-3 are not well documented, while 4 is described in the most recent issue of Stata Journal (7-3).

For discrete-time models, there are only two alternatives: Mplus or spsurv.

Monday, October 01, 2007

Research on schizophrenia

This article is very interesting:

  • St Clair, D., M. Xu, P. Wang, Y. Yu, Y. Fang, F. Zhang, X. Zheng, N. Gu, G. Feng, and P. Sham. 2005. "Rates of Adult Schizophrenia Following Prenatal Exposure to the Chinese Famine of 1959-1961." Journal of American Medical Association 294:557-562.

But my results do not support their findings. Worth some further exploration.

Sunday, September 30, 2007

Enter and die at the same time

Stata does not allow a subject enters the risk set and dies at the same time, here is an explanation:

Saturday, September 08, 2007

Human interface device

Sometimes not all tray icons show up (after a new boot). This can be caused by a unexpected stop in the human interface device. Re-start that service and reboot the computer should solve the problem.

Friday, August 31, 2007


It is getting cooler here in Beijing. Hopefully Hangzhou will be cool off in the next half month so that we can fully enjoy our bike trip between Hangzhou and Suzhou.

Wednesday, August 29, 2007

Stata 10 graphics editor, bad idea!

The new graphics editor provided in Stata 10, to me, is a bad idea. Instead of focusing on trying to get the figure right in a do file, many people, especially beginners, will take shortcuts and rely heavily on the new editor. This, in long run, is a very bad idea. Once people start to do that, they lose the capability to be able to replicate exactly what they do. If you spend 10 minutes editing a figure, then realize the data is not quite right and need to run the figure command again, then you need another 10 minutes to edit it... you get the idea.

It is a bit disappointing to find that unicode support is still not there. I have been waiting for this for too long...

Wednesday, August 22, 2007

aML reported random effects

The random effects that aML reports are standard deviation and correlation, not variance and covariance.

Tuesday, August 21, 2007

Age, period, and cohort effect

A model with age, period, and cohort can only be identified if:

1) Two or more of the remaining age, period, or cohort coefficients to be equal;
2) Use a proxy variable approach that assumes the cohort (e.g. cohort size) or period effects are proportional to certain measured variables;
3) Transform at least one of the age, period, or cohort variables so that its relationship to other is nonlinear.

A piecewise linear hazard rate model can usually be identified because of (3).

Friday, August 10, 2007

Ideal calender solution

The combination of Google calender, Mozilla Sunbird (or Thunderbird + Lightning), and Google calender provider ( provides an ideal calender solution. More details can be found here:

Thursday, August 09, 2007

My working paper

My most recent working paper "Does Son Preference Influence Children's Growth in Height? A Comparative Study of Chinese and Filipino Children" is available online:

Another one (impact of famine on mortality) will be available soon.

Fall begins



By the way, Linux is a superior platform for number crunching compared to Windows. One needs to install lots of extra stuff to have a more or less comparable working environment under Windows. This is one of reasons I am so happy after getting aML working on my Ubuntu box.

Wednesday, August 08, 2007

Compiling aML under Ubuntu 7.04, again!

I decided to give to it another try. I wrote to Stan, reporting the problem I had when trying to compiling aML on my Ubuntu 7.04 box. He pointed it out that it is likely to be a bug in the compiler. Then I ask myself: why not try a different compiler?

Some google search shows that alternative FORTRAN 77 under linux includes PGI, Intel, Absoft, among others. PGI offers 30 days trial, so I decided to give it a try. The pgf77 generates correct binary for "aml", "bigaml", and "hugeaml", but it creates problems for "mktab", a utility to create tabular results out of aML output files. I need to re-compile this utility using the old g77 compiler.

In short, an easy solution to the problem will be to use pgf77 to generate the main binaries and to use the g77 to generate the auxiliary binaries, then put them together in one place (in the system path).

I have checked using both the provided samples and my own data, so far so good.

Tuesday, August 07, 2007

Access blogpost in China

I just realize that this site can be assessed in China without using VPN or proxy. This just happened today.

Monday, August 06, 2007

More aML stuff

This software review is also a good introduction of aML:

Dan Powers ( also has some aML-related meterials, scattering around at different places.

Saturday, August 04, 2007

Fun blogs

These blogs look fun:

"Censoring Due to Death"

Interesting blog post I found:

New data wave from CLHNS

New wave of the Cebu Longitudinal Health and Nutrition Survey is out. I need to update my data and run the Mplus program for catch-up growth and see if anything changes.

Tuesday, July 31, 2007

Mplus vs. aML

I am working on a joint model of miscarriage and child mortality. The idea is that higher rate of miscarriage may end up with a more highly selected newborns, who are less vulnerable to premture death. This is a Heckman selection type of model, with the exception that the main equation is a hazard model instead of a continuous model. I have been trying to make Mplus to do the job, as hinted by Bengt ( Then I got a direct response from Linda that this type of model cannot be estimated using Mplus ( I guess this leaves me no other alternatives but to go to aML.

aML is a fine software. I used it in 2004 for a chapter of my dissertation (at that time it cost about 1,000 bucks). Now it is free and open-source, and everyone can look at the code (in FORTRAN) to see how certain things are done. I am a little surprised that it has not attracted more attention.

What I would really like to see is a wrapper program in Stata and can 1) export data to aML, 2) automatically generate sensible starting values and feed them to aML, and 3) gather coefficients estimated in aML for post-estimation manipulation.

If nobody has already started working on this, I might end up doing it myself.

Friday, July 27, 2007

LyX 1.50 is out, with unicode support

LyX 1.50 seems to be a real alternative to MS Office and OpenOffice for serious writing. Now it has unicode support built in. I have tried it on both Windows and Linux and it works jus fine. What I have not figured out is how to get the file compiled with LaTeX into DVI file correctly when there are double-byte characters (Chinese, for example).

Saturday, June 16, 2007

Thursday, June 07, 2007


An easy-to-use programming language is an invaluable tool for quantitative social scientists. I have learned several programming languages including C/C++, Java, Python, Pascal. My favorite language right now is FreeBasic ( It is highly compatible with the once widely used QBasic and is 100% free. There are several editors that can be used with FreeBasic on Windows, including the one I am using, FbEdit ( The picture on the left shows how to debug a FreeBasic program using Insight, a frontend of GNU debugging tool, GDB.

Sunday, June 03, 2007

Editing and comparing huge text files

I need to work on several huge text files, each being around 30M in size. The work involves cleaning them and comparing to each other. I have worked with several pretty good text editors before, some are free while others are commercial. For this particular work, I tried ultraedit, emeditor, vedit, madedit, and multiedit. For text comparison, I tried beyond compare and ultracompare. Overall, I would say that the winner is multiedit, for several reasons. First of all, even though all the editors I tested can handle large files, the process is not painless. Most editors show a significant slowdown after loading the files, even on my core 2 duo machine with 2 GB memory. Multiedit does not have this problem. The 30 MB file loads instantly and can easily scroll to anywhere in the file without delays. Second, most editors have very limited file comparison functions built-in, and is not suitable for the work I have at hand. I tried ultracompare, but it did not work in the way it is supposed to, and I gave it up after trying for several times (not a very patient man). Beyond compare delivers good results, then I realized that multiedit has a copy of beyond compare built-in!

The price for multiedit is a bit steep. When most other editors cost around $50 or less, it costs three times of that price ($149). That is probably why it is not used as widely as it could have been...

Friday, June 01, 2007

Best Linux Distro

I began using Linux in 2000, when I was a graduate student at UCLA. Since I have tried various distros including SuSE (open SuSE), RedHat (Fedora), Mandrake, Turbolinux, Debian, Slackware. I even bought a copy of a now discontinuted distro named "libranet". I settled on OpenSuSE for the past several years until I discovered Ubuntu. I have been running Ubuntu 7.04 on my desktop for more than a month and it has been a very pleasant experience so far.

The next distro I want to try is PCLinuxOS. It looks very nice and (from various reviewers) it runs very fast. I am going to install it once I have a new machine. Maybe at that time, I can have a good answer about which is, between Ubuntu and PCLinuxOS, the best Linux distro.

Sunday, May 06, 2007

Drawing maps using Stata

Don't need expensive specialized GIS anymore, just Stata.

Saturday, April 21, 2007

A good cross-platform text editor

I use both Linux and Windows and have been searching for a good cross-platform text editor for a long time. Now this small piece software called "MadEdit" seems to be very promising. It is cross-platform, unicode enabled, free, and open source. Check it out at:

Sunday, April 15, 2007

A good new book on Monte Carlo methods

Title: Simulation and Monte Carlo: With applications in finance and MCMC
Author: J. S. Dagpunar
Publisher: Wiley

This new book seems really cool and I will try to remember to buy it next time I visit US.

Wednesday, April 11, 2007

Maple, a good tool to teach maximum likelihood method

The new Maple has very convenient tool built-in to demonstrate the process of maximum likelihood estimation in an intuitive way. Unlike software like Stata, SAS, or Matlab, where the computation is done numerically under the surface, Maple solves the problem analytically with the whole process showing up on the screen. This way, students can clearly see what is happening: how to get log likelihood function, how to solve it , what the Hessian matrix looks like, etc.

Very good teaching tool.

Wednesday, March 21, 2007

Multiple imputation with growth modeling

Several days ago, a new user-contributed Stata module "MIM" appears in the Stata software repository. This module can automate the process of combing estimates from multiply imputed data sets and calculating confidence intervals for a wide array of Stata estimation commands, including "XTMIXED". This means that the complete process of imputation, estimation, and post-estimation can be done without leaving Stata.

Saturday, March 10, 2007

Multiple imputation with longitudinal data

Have been working with Sarah on revising our comparative paper. One problem we need to solve in this revision is the presence of large amount of missing values. I use "ICE", a user contributed module in Stata to do the imputation. Unfortunately, the built-in estimation procedures does not include "XTMIXED" or "GLLAMM". I have to import the imputed data sets into Mplus and do the estimation there.

Thursday, February 22, 2007

A list of survey data that measure well-being

Here is a list of (US) survey data that measure well-being:

Sunday, February 11, 2007

Latent Growth Model and Repeated Measurement

I had a brief conversation with Linda Adair on the past Wednesday about the issues of catch-up growth. She mentioned that she would have done the analysis differently, had she known the method of latent growth modeling methodology in the 1990s. The idea of standardizing the measurement and calculating the difference score does not make much sense.

I am thinking about doing a Monte Carlo simulation to demonstrate the relationship between two sets of relationships: that revealed by latent growth model, and that revealed by standardized difference scores.

Tuesday, February 06, 2007

Inequality dropped

Alas, since the mechanism of inequality-obesity has not been fully developed in the literature, and it will be too ambitious for us to tackle this issue in this small project, have to drop it for now.

Will get it back, soon.

Wednesday, January 31, 2007

Economic inequality as an important moderator

Barry Popkin gave a talk at CCPR seminar today on the global nutritional transition. His research is inspiring in many ways. I had a brief chat with him after the seminar about my idea of economic inequality as an important moderator between the link between obesity and SES and economic development. He thinks it is a good idea and certainly worth the effort to explore it further. I am glad.