- Download and install RStudio server:
After the EMR cluster is up and running, ssh to the master node with user ‘hadoop@’ and download RStudio server and then install using ‘yum install` as:
$ wget https://download2.rstudio.org/rstudio-server-rhel-1.1.442-x86_64.rpm
$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.442-x86_64.rpm
$ wget https://download2.rstudio.org/rstudio-server-rhel-1.1.383-x86_64.rpm
$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.383-x86_64.rpm
finally add a user to access RStudio Web console as:
$ sudo su $ sudo useradd <username>
$ sudo echo <username>:<password> | chpasswd
- To access RStudio Web console you need to create a SSH tunnel from your machine to the EMR master node for local port forwarding like below:
$ ssh -NL 8787:ec2-<emr-master-node-ip>.compute-1.amazonaws.com:8787 hadoop@ec2-<emr-master-node-ip>.compute-1.amazonaws.com&
3. Now open any browser and type `http://localhost:8787` to go the RStudio Web console and use the `<username>:<password>` combo to login.
- To install the required R packages you need to first install `libcurl` into the master node like below:
sudo yum update
sudo yum install -y curl
sudo yum install -y openssl
sudo yum -y install libcurl-devel
sudo yum -y install openssl-devel
sudo yum -y install libssh2-devel
- Resolve permission issues with:
$ sudo -u hdfs hadoop fs -mkdir /user/<username>
$ sudo -u hdfs hadoop fs -chown <username> /user/<username>
Otherwise, you may see errors like below while trying to create a Spark session from RStudio:
—————————
Error: Failed during initialize_connection() org.apache.hadoop.security.AccessControlException: Permission denied: user=<username>, access=WRITE, inode=”/user/<username>/.sparkStaging/application_1476072880868_0008″:hdfs:hadoop:drwxr-xr-x
- Install all the necessary packages in RStudio (exacly as you would on local machine)
install.packages(‘devtools’)
devtools::install_github(‘apache/spark@v2.2.1′, subdir=’R/pkg’)
install.packages(‘sparklyr’)
library(SparkR)
library(sparklyr)
library(dplyr)
you might need to install additional dependencies with the above as my setup may vary to yours, so I am leaving this open.
- Once the required packages are installed and loaded you can create a Spark session to remote EMR/Spark cluster and interact with your SparkR application using the below commands:
> sc <- spark_connect(master = “yourEMRmasterNodeDNS:8998”, method = “livy”)
> copy_to(sc, iris)
# Source: table<iris> [?? x 5]
# Database: spark_connection
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.10 3.50 1.40 0.200 setosa
Need to learn more about aws big data (demystified)?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:
https://www.linkedin.com/in/omid-vahdaty/