EMR and watch dog: service-nanny?

Want to have a watchdog to start the service if it crashing for any reason?

There are many ways to solve this. Some of them are these. just for the record, the reason i needed it, is b/c i need to start the Spark Thrift service fore JDBC which crashes every-time there is an out of memory.

Solution 1: Linux CRON

Using the good old cron entry which executes a small script in every 5 min(easily customizable).  This checks if this process is there. If not start this process. If you need this Thrift server to be started only in Master node then you can step to do that.You can use Script runner for running a custom script (which is stored in s3) [1]
Advantage of this is simple to code and maintain. Additionally is not dependent on a particular EMR version/service.

Example to create an EMR cluster with script runner step.

aws emr create-cluster –name “Test cluster” –-release-label emr-5.16.0 –applications Name=Hive Name=Pig –use-default-roles –ec2-attributes KeyName=myKey –instance-type m4.large –instance-count 3 –steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[“s3://mybucket/script-path/”]

Solution 2 service-nanny (Note: This is untested)

This solution utilizes the service-nanny which is a service watchdog in all EMR cluster.
Create a service-nanny configuration (/etc/service-nanny/yourservice.conf) This conf file will have some basic info regarding the process. So you can create a conf file. Put this in s3. Download it via step. (If you only want to execute in Master node). Once the files are in place, then restart the service-nanny. You can start and stop service nanny using the command below :

sudo /etc/init.d/service-nanny stop
sudo /etc/init.d/service-nanny start

You can see some sample about service-nanny in this path /usr/lib/service-nanny/example. The possible disadvantage for this would be if EMR decided to remove service-nanny in some future release you may need to fall back to Solution 1.

Note: Solution 2 is untested. So please test it thoroughly before using this in production.





I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

Leave a Reply