Spark

Working with R studio and a remote Spark cluster (SPARK R)

  1. Download and install RStudio server:

After the EMR cluster is up and running, ssh to the master node with user ‘hadoop@’ and download RStudio server and then install using ‘yum install` as:

$ wget https://download2.rstudio.org/rstudio-server-rhel-1.1.442-x86_64.rpm

$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.442-x86_64.rpm

$ wget https://download2.rstudio.org/rstudio-server-rhel-1.1.383-x86_64.rpm

$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.383-x86_64.rpm

finally add a user to access RStudio Web console as:

$ sudo su   $ sudo useradd <username>

$ sudo echo <username>:<password> | chpasswd

  1. To access RStudio Web console you need to create a SSH tunnel from your machine to the EMR master node for local port forwarding like below:

$ ssh -NL 8787:ec2-<emr-master-node-ip>.compute-1.amazonaws.com:8787 hadoop@ec2-<emr-master-node-ip>.compute-1.amazonaws.com&

3. Now open any browser and type `http://localhost:8787` to go the RStudio Web console and use the `<username>:<password>` combo to login.

 

  1. To install the required R packages you need to first install `libcurl` into the master node like below:

sudo yum update

sudo yum install -y curl

sudo yum install -y openssl

sudo yum -y install libcurl-devel

sudo yum -y install openssl-devel

sudo yum -y install libssh2-devel

  1. Resolve permission issues with:

$ sudo -u hdfs hadoop fs -mkdir /user/<username>

$ sudo -u hdfs hadoop fs -chown <username> /user/<username>

Otherwise, you may see errors like below while trying to create a Spark session from RStudio:

—————————

Error: Failed during initialize_connection() org.apache.hadoop.security.AccessControlException: Permission denied: user=<username>, access=WRITE, inode=”/user/<username>/.sparkStaging/application_1476072880868_0008″:hdfs:hadoop:drwxr-xr-x

 

  1. Install all the necessary packages in RStudio (exacly as you would on local machine)

install.packages(‘devtools’)

devtools::install_github(‘apache/spark@v2.2.1′, subdir=’R/pkg’)

install.packages(‘sparklyr’)

library(SparkR)

library(sparklyr)

library(dplyr)

you might need to install additional dependencies with the above as my setup may vary to yours, so I am leaving this open.

 

  1. Once the required packages are installed and loaded you can create a Spark session to remote EMR/Spark cluster and interact with your SparkR application using the below commands:

> sc <- spark_connect(master = “yourEMRmasterNodeDNS:8998”, method = “livy”)

> copy_to(sc, iris)

# Source:   table<iris> [?? x 5]

# Database: spark_connection

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species

      <dbl>    <dbl>     <dbl>    <dbl> <chr>  

1      5.10     3.50      1.40    0.200 setosa

 

Need to learn more about aws big data (demystified)?



——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/



Leave a Reply