Shige's Research Blog: 2016

Apache Bigtop is by far the easiest to install Hadoop distribution. I was able to get it work on an
CentOS VM. Unfortunately the bundled Spark is 1.5.1, which does not work well with the sparklyr package. I have not figured out how to make the yarn manager work with a third-party copy of Spark yet. Need to wait for the Bigtop distribution to upgrade.

Saturday, November 26, 2016

The importance of googling ...

Trying to submit a manuscript today. The main manuscript was written in Rmarkdown whereas the supplementary materials in Word. The online submission system kept complaining abut my pdf file. After some googling and tweaking, it turns out the problem was caused by indicator function "\mathbbm{1}". Following the suggestion given here, I replaced it with "\mathds{1}". Problem solved!

Sunday, November 20, 2016

Building Scalable Data Pipelines with Microsoft R Server and Azure Data Factory

Useful information on big data computation using Microsoft platform.

Saturday, November 12, 2016

RStudio IDE Easy Tricks You Might’ve Missed

I missed a few of the tricks mentioned here. Very neat indeed!

Saturday, November 05, 2016

Tidy Text Mining with R

Cool text mining book.

Rmazon

This is a simple but useful package for downloading product information and reviews from Amazon.com.

Saturday, October 22, 2016

Scalable R on Spark with SparkR, sparklyr and RevoScaleR

Useful tutorial here.

Monday, October 17, 2016

Interview with J.J. Allaire

Very informative interview here.

Sunday, October 09, 2016

Real-World Machine Learning

I find this book very helpful. The introduction chapter is freely available.

Monday, August 15, 2016

Sparklyr

The new sparklyr package from rstudio provides a convenient interface between R/Rstudio and Spark. It runs well on Linux; it also works on Windows for Spark 1.6.2 and lower. For some reasons, it does not work with Spark 2.0 on Windows. I assume it will get fixed in subsequent releases.

Saturday, July 16, 2016

LaplacesDemon is back

Looks like LaplacesDemon package is back. Now we have a pure R-based Bayesian computation platform.

Friday, July 01, 2016

Microsoft Analytics in 2016

Here is a thorough introduction of data science solution offered by Microsoft.

Saturday, June 18, 2016

Making Causal Impact Analysis Easy

Very helpful blog post regarding the CausalImpact package and MarketMatching package.

Wednesday, May 25, 2016

Setting Up New R Notebook

This is extremely cool!

Wednesday, April 06, 2016

Microsoft and Linux: True Romance or Toxic Love?

Insightful article.

Thursday, March 31, 2016

tm reading in data frame and keep text's id

Here are some useful discussions.

Friday, February 26, 2016

Multiple imputation using R

R has a long list of packages for multiple imputation. The main problem is integration: statistical procedures in other packages may or may not work with the imputation procedures. I have been using Amelia together with Zelig. Because they were written by the same group, they work well together. However, I have been having trouble with making multiple imputation to work with the plm package. After searching the internet, here comes the solution:

Impute the missing data using Amelia or Mice.
Estimate the model on each imputed data.
Use the mitools package to extract and combine results.

For example, here is a simple example:

...

imp <- mice(d)

mydata <- imputationList(lapply(1:5, complete, x = imp))

fit <- lapply(mydata$imputations, function(x){

plm(cog3pl ~ oc + grade9 + boy + han + ruralbirth, data = x,

index = c("schids"), model = "pooling")})

betas <- MIextract(fit, fun = coef)

vars <- MIextract(fit, fun = vcov)

summary(MIcombine(betas, vars))

I bet this will work for most, if not all, estimation procedures in R.

Sunday, February 21, 2016

Another text analysis package

Quanteda seems to be a serious contender for analyzing textual data using R.

Wednesday, February 10, 2016

Rstudio 0.99.878 becomes official

Rstudio 0.99.878 becomes official. There is server version in AUR that can be used. However, the pandoc that comes with that version seems to have problems on Arch/Manjora. The discussion here is very helpful. One easy solution is to replace the built-in pandoc with the one that comes with the system:

sudo mv /usr/lib/rstudio-server/bin/pandoc/pandoc /usr/lib/rstudio-server/bin/pandoc/pandoc_old
sudo mv /usr/lib/rstudio-server/bin/pandoc/pandoc-citeproc /usr/lib/rstudio-server/bin/pandoc/pandoc-citeproc_old

sudo ln -s /usr/bin/pandoc /usr/lib/rstudio-server/bin/pandoc/pandoc
sudo ln -s /usr/bin/pandoc-citeproc /usr/lib/rstudio-server/bin/pandoc/pandoc-citeproc

Friday, February 05, 2016

Alternate R Markdown Templates

These alternative R Markdown templates look great.

Tuesday, February 02, 2016

ggplot2 extensions

Here is a list of ggplot2 extension packages.

Tuesday, January 19, 2016

Notebook interface for everything

Zeppelin provides a unified interface for nearly all the major data process engines, including Spark. I quickly set it up on a virtual machine and gave it a test run. It works great.

This one has SparkR support built-in.

I was never really into the ipython/jupyter notebook mainly because there is nothing they can do that a good IDE such as PyCharm or Rodeo cannot. Zeppelin is different because its capability of tightly integrating different Spark front-ends, including Scala, Python, and R is uniquely powerful. I would call this revolutionary.

Thursday, January 14, 2016

R Users Will Now Inevitably Become Bayesians

Good post here. I would also add that the rethinking package is a third option that helps R users to become a Bayesian.

Sunday, December 25, 2016

Tuesday, December 13, 2016

Sunday, December 11, 2016

Monday, December 05, 2016

Thursday, December 01, 2016

Wednesday, November 30, 2016