Sunday, June 14, 2015

How to use SparkR within Rstudio?

Setting up Spark and SparkR is quite easy (assume you are running v.1.4): just grab one of the pre-built binaries and unzip to a folder. There is also a shell script to start SparkR from command line. The document suggest to put the following lines

Sys.setenv(SPARK_HOME="/home/shige/bin/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local")

into the .Rprofile file. This, however, has the undesirable side effect of adding yet another directory to which R packages can be installed.

My solution is:

1. Create a soft link of the SparkR directory in the the directory where other R packages are installed (ln -s /home/shige/bin/spark/R/lib/SparkR /home/shige/R/x86_64-pc-linux-gnu-library/3.2)
2. Add (Sys.setenv(SPARK_HOME="/home/shige/bin/spark")) to the .Rprofile file.
3. Add (Sys.setenv(SPARKR_SUBMIT_ARGS ='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')) to the .Rprofile.

All set.

10 comments:

Unknown said...

How to setup in Mac environment?
Thanks a lot.

Shige said...

I assume it'll also work on Windows and Mac. Give it a try and let me know.

Unknown said...

Hi, thanks for replying me.

I have download "spark-1.4.0-bin-hadoop2.6.tgz" and unzip it.

ln -s /Users/frankie/Desktop/spark-1.4.0-bin-without-hadoop/R/lib/SparkR /????? (i don't know this path for?)

Thanks

Unknown said...

I got SparkR running on RStudio, but I have no idea how to read a .csv file from RStudio.
When launching de sparkRshell
./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3
I can read a file as follows:
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
But in RStudio this is no longer true, I get the following error:
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
Do you have any idea how to solve this? Can I import com.databricks:spark-csv_2.10:1.0.3 in .Rprofile or somewhere else? You can also check my question on stackOverflow: http://stackoverflow.com/questions/30870379/loading-com-databricks-spark-csv-via-rstudio

Shige said...

Hi Wannes,

Thanks for the post. Unfortunately, I do not have an answer. Hopefully someone can answer your question on StackOverflow.

Shige

Unknown said...

Hi Shige,

I've found a solution, if you're interested, check stack overflow.

Shige said...

Thanks, I'll give it a try.

Kenahoo said...

On a Mac, you can install using Homebrew. You do have to adjust the path a little:

https://gist.github.com/kenahoo/0f4c08fe10337a53836d

Unknown said...

I got error when sc <- sparkR.init():

Error in socketConnection(port = monitorPort) :
cannot open the connection
In addition: Warning message:
In socketConnection(port = monitorPort) : localhost:64143 cannot be opened

Here's the info of hadoop and spark. Any idea? Thanks!

$ brew info hadoop
hadoop: stable 2.7.2
Framework for distributed processing of large data sets
https://hadoop.apache.org/
/usr/local/Cellar/hadoop/2.7.2 (6,304 files, 310M) *
Built from source on 2016-05-27 at 02:34:15
From: https://github.com/Homebrew/ho...
==> Caveats
In Hadoop's config file:
/usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/hadoop-env.sh,
/usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/mapred-env.sh and
/usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/yarn-env.sh
$JAVA_HOME has been set to be the output of:
/usr/libexec/java_home

$ brew info apache-spark
apache-spark: stable 1.6.1, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/1.6.1 (736 files, 372M) *
Built from source on 2016-05-27 at 02:24:45
From: https://github.com/Homebrew/ho...

Shige said...

A much easier solution is provided by Rstudio in the form of a new package called "sparklyr". It allows you to automatically download and setup Spark on your local machine without manually tweaking any settings.

Counter