Working with R studio and a remote Spark cluster (SPARK R)

Download and install RStudio server:

After the EMR cluster is up and running, ssh to the master node with user ‘hadoop@’ and download RStudio server and then install using ‘yum install` as:

$ wget https://download2.rstudio.org/rstudio-server-rhel-1.1.442-x86_64.rpm

$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.442-x86_64.rpm

$ wget https://download2.rstudio.org/rstudio-server-rhel-1.1.383-x86_64.rpm

$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.383-x86_64.rpm

finally add a user to access RStudio Web console as:

$ sudo su $ sudo useradd <username>

$ sudo echo <username>:<password> | chpasswd

To access RStudio Web console you need to create a SSH tunnel from your machine to the EMR master node for local port forwarding like below:

$ ssh -NL 8787:ec2-<emr-master-node-ip>.compute-1.amazonaws.com:8787 hadoop@ec2-<emr-master-node-ip>.compute-1.amazonaws.com&

3. Now open any browser and type `http://localhost:8787` to go the RStudio Web console and use the `<username>:<password>` combo to login.

To install the required R packages you need to first install `libcurl` into the master node like below:

sudo yum update

sudo yum install -y curl

sudo yum install -y openssl

sudo yum -y install libcurl-devel

sudo yum -y install openssl-devel

sudo yum -y install libssh2-devel

Resolve permission issues with:

$ sudo -u hdfs hadoop fs -mkdir /user/<username>

$ sudo -u hdfs hadoop fs -chown <username> /user/<username>

Otherwise, you may see errors like below while trying to create a Spark session from RStudio:

—————————

Error: Failed during initialize_connection() org.apache.hadoop.security.AccessControlException: Permission denied: user=<username>, access=WRITE, inode=”/user/<username>/.sparkStaging/application_1476072880868_0008″:hdfs:hadoop:drwxr-xr-x

Install all the necessary packages in RStudio (exacly as you would on local machine)

install.packages(‘devtools’)

devtools::install_github(‘apache/spark@v2.2.1′, subdir=’R/pkg’)

install.packages(‘sparklyr’)

library(SparkR)

library(sparklyr)

library(dplyr)

you might need to install additional dependencies with the above as my setup may vary to yours, so I am leaving this open.

Once the required packages are installed and loaded you can create a Spark session to remote EMR/Spark cluster and interact with your SparkR application using the below commands:

> sc <- spark_connect(master = “yourEMRmasterNodeDNS:8998”, method = “livy”)

> copy_to(sc, iris)

# Source:   table<iris> [?? x 5]

# Database: spark_connection

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species

      <dbl>    <dbl>     <dbl>    <dbl> <chr>

1      5.10     3.50      1.40    0.200 setosa

Need to learn more about aws big data (demystified)?

Contact me via linked in Omid Vahdaty
website: https://amazon-aws-big-data-demystified.ninja/
Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Need to learn more about aws big data (demystified)?

Leave a ReplyCancel reply

Discover more from Big Data Demystified