Sunday, June 14, 2015

How to use SparkR within Rstudio?

Setting up Spark and SparkR is quite easy (assume you are running v.1.4): just grab one of the pre-built binaries and unzip to a folder. There is also a shell script to start SparkR from command line. The document suggest to put the following lines

Sys.setenv(SPARK_HOME="/home/shige/bin/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local")

into the .Rprofile file. This, however, has the undesirable side effect of adding yet another directory to which R packages can be installed.

My solution is:

1. Create a soft link of the SparkR directory in the the directory where other R packages are installed (ln -s /home/shige/bin/spark/R/lib/SparkR /home/shige/R/x86_64-pc-linux-gnu-library/3.2)
2. Add (Sys.setenv(SPARK_HOME="/home/shige/bin/spark")) to the .Rprofile file.
3. Add (Sys.setenv(SPARKR_SUBMIT_ARGS ='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')) to the .Rprofile.

All set.

10 comments:

  1. How to setup in Mac environment?
    Thanks a lot.

    ReplyDelete
  2. I assume it'll also work on Windows and Mac. Give it a try and let me know.

    ReplyDelete
  3. Hi, thanks for replying me.

    I have download "spark-1.4.0-bin-hadoop2.6.tgz" and unzip it.

    ln -s /Users/frankie/Desktop/spark-1.4.0-bin-without-hadoop/R/lib/SparkR /????? (i don't know this path for?)

    Thanks

    ReplyDelete
  4. I got SparkR running on RStudio, but I have no idea how to read a .csv file from RStudio.
    When launching de sparkRshell
    ./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3
    I can read a file as follows:
    flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
    But in RStudio this is no longer true, I get the following error:
    Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
    Do you have any idea how to solve this? Can I import com.databricks:spark-csv_2.10:1.0.3 in .Rprofile or somewhere else? You can also check my question on stackOverflow: http://stackoverflow.com/questions/30870379/loading-com-databricks-spark-csv-via-rstudio

    ReplyDelete
  5. Hi Wannes,

    Thanks for the post. Unfortunately, I do not have an answer. Hopefully someone can answer your question on StackOverflow.

    Shige

    ReplyDelete
  6. Hi Shige,

    I've found a solution, if you're interested, check stack overflow.

    ReplyDelete
  7. Thanks, I'll give it a try.

    ReplyDelete
  8. On a Mac, you can install using Homebrew. You do have to adjust the path a little:

    https://gist.github.com/kenahoo/0f4c08fe10337a53836d

    ReplyDelete
  9. I got error when sc <- sparkR.init():

    Error in socketConnection(port = monitorPort) :
    cannot open the connection
    In addition: Warning message:
    In socketConnection(port = monitorPort) : localhost:64143 cannot be opened

    Here's the info of hadoop and spark. Any idea? Thanks!

    $ brew info hadoop
    hadoop: stable 2.7.2
    Framework for distributed processing of large data sets
    https://hadoop.apache.org/
    /usr/local/Cellar/hadoop/2.7.2 (6,304 files, 310M) *
    Built from source on 2016-05-27 at 02:34:15
    From: https://github.com/Homebrew/ho...
    ==> Caveats
    In Hadoop's config file:
    /usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/hadoop-env.sh,
    /usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/mapred-env.sh and
    /usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/yarn-env.sh
    $JAVA_HOME has been set to be the output of:
    /usr/libexec/java_home

    $ brew info apache-spark
    apache-spark: stable 1.6.1, HEAD
    Engine for large-scale data processing
    https://spark.apache.org/
    /usr/local/Cellar/apache-spark/1.6.1 (736 files, 372M) *
    Built from source on 2016-05-27 at 02:24:45
    From: https://github.com/Homebrew/ho...

    ReplyDelete
  10. A much easier solution is provided by Rstudio in the form of a new package called "sparklyr". It allows you to automatically download and setup Spark on your local machine without manually tweaking any settings.

    ReplyDelete