Working with R studio and a remote Spark cluster (SPARK R)

  1. Download and install RStudio server:

After the EMR cluster is up and running, ssh to the master node with user ‘hadoop@’ and download RStudio server and then install using ‘yum install` as:

$ wget

$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.442-x86_64.rpm

$ wget

$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.383-x86_64.rpm

finally add a user to access RStudio Web console as:

$ sudo su   $ sudo useradd <username>

$ sudo echo <username>:<password> | chpasswd

  1. To access RStudio Web console you need to create a SSH tunnel from your machine to the EMR master node for local port forwarding like below:

$ ssh -NL 8787:ec2-<emr-master-node-ip> hadoop@ec2-<emr-master-node-ip>

3. Now open any browser and type `http://localhost:8787` to go the RStudio Web console and use the `<username>:<password>` combo to login.


  1. To install the required R packages you need to first install `libcurl` into the master node like below:

sudo yum update

sudo yum install -y curl

sudo yum install -y openssl

sudo yum -y install libcurl-devel

sudo yum -y install openssl-devel

sudo yum -y install libssh2-devel

  1. Resolve permission issues with:

$ sudo -u hdfs hadoop fs -mkdir /user/<username>

$ sudo -u hdfs hadoop fs -chown <username> /user/<username>

Otherwise, you may see errors like below while trying to create a Spark session from RStudio:


Error: Failed during initialize_connection() Permission denied: user=<username>, access=WRITE, inode=”/user/<username>/.sparkStaging/application_1476072880868_0008″:hdfs:hadoop:drwxr-xr-x


  1. Install all the necessary packages in RStudio (exacly as you would on local machine)


devtools::install_github(‘apache/spark@v2.2.1′, subdir=’R/pkg’)





you might need to install additional dependencies with the above as my setup may vary to yours, so I am leaving this open.


  1. Once the required packages are installed and loaded you can create a Spark session to remote EMR/Spark cluster and interact with your SparkR application using the below commands:

> sc <- spark_connect(master = “yourEMRmasterNodeDNS:8998”, method = “livy”)

> copy_to(sc, iris)

# Source:   table<iris> [?? x 5]

# Database: spark_connection

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species

      <dbl>    <dbl>     <dbl>    <dbl> <chr>  

1      5.10     3.50      1.40    0.200 setosa


Need to learn more about aws big data (demystified)?


I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

Leave a Reply