If you are looking for options to speed up your queries which are using subsets of the same data and you would like to know if there is any AWS solution that fits the requirement of caching frequently accessed data.
If you are using Hive, you may use LLAP(If not already). LLAP effectively is a daemon that caches metadata as well as the data itself. There is an AWS blog on enabling LLAP using a bootstrap action and then executing your queries. Please look at [1] and let me know if you have any questions regarding the same. LLAP daemons are launched under YARN management to ensure that the nodes don’t get overloaded with the compute resources of these daemons. You may specify the number of instances you want the daemon to run, the memory allocation, number of executors per instance and so forth. But it does have its default values as well.
# –instances – number of LLAP daemon instances, defaults to the number of slave nodes # –cache – LLAP cache for each daemon, defaults to 20% of physical memory
# –executors – number of executors per daemon, defaults to the number of CPU cores
# –iothreads – number of IO threads, defaults to the number of CPU cores
# –size – YARN container memory, defaults to 50% of available memory on a node
# –xmx – LLAP daemon memory, defaults to 50% of container memory
# –log-level – log level, defaults to INFO If you are using Spark, RDD Persistence is one of the configurations that you may use to cache data in memory across operations. There are multiple levels at which you can choose to cache the data. It could be Memory Only, or caching in Memory and Disk both amongst other in [2]. You can mark an RDD to be persisted using the persist() or cache() methods on it.
Tachyon(Alluxio) is basically similar. It sits between HDFS and Spark to provide in-memory file-system, like a virtual distributed storage. Integration of Alluxio in EMR is currently in dev stages. [3]
I personally have not tested the above solution, but i am planning too, and will update on this post in the future. tested this yourself? please contact me for you feedback.
References
[1] AWS Blog LLAP – https://aws.amazon.com/blogs/big-data/turbocharge-your-apache-hive-queries-on-amazon-emr-using-llap/
[2] RDD Persistence – https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence [3] LLAP Wiki – https://cwiki.apache.org/confluence/display/Hive/LLAP#LLAP-Caching
[3] Alluxio Docs – http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html#class-alluxiohadoopfilesystem-not-found-issues-with-sparksql-and-hive-metastore
[4] LLAP benchmark: https://www.slideshare.net/Hadoop_Summit/hadoop-query-performance-smackdown
[5] Hive LLAP benchmark VS Impala: https://dzone.com/articles/3x-faster-interactive-query-with-apache-hive-llap
Need to learn more about aws big data (demystified)?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Need to learn more about aws big data (demystified)?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:
https://www.linkedin.com/in/omid-vahdaty/