Azure HDInsights features and advantages:
- Hive with LLAP
- Spark , SparlSQL, ML,Steaming
- Pig
- Hbase,
- Storm
- U-sql – c# and SQL
- federated query across several data sources
- Kafka! (with rack awareness to azure), replication with mirror maker
- Microsoft R server!
- Zeppelin and jupiter integration.
- Apache Ambari View.
- Sqoop
- Oozie
- Zookeeper, for leader election of head nodes (master node)
- Mahut, discontinued in v4.0
- phoenix (SQL over Hbase)
- mono – open source C# .net implementation.
- Apache Slider – like yarn. https://www.slideshare.net/duttashivaji/apache-slider, not in the new version. discontinued in v4.0
- Apache Livy
- Security – kerberos, and active directory, apache ranger
- External Hive metastore
- very rich documentations: https://docs.microsoft.com/en-us/azure/hdinsight/
- Rich Developer plugins
- Zeppelin
- intellij
- Eclipse
- R
- Visual studio
- Jupiter
Ecosystem
- data lake analytics
- machine learning
- Power BI!!
- Azure Cosmos DB – extensions of Azure documentdB, basically noSQL
- Azure data factory – orchestration
- Azure Event Hub
- ISV data science
- H2o
- data iku
More advantages
- each worker can be configure for different sizes.
- Hive ODBC
- Hive add on for excel
- Auto scaling.
Architecture
- Gateway nodes – management and security.
- Head nodes – like name node, in High availability
- Edge nodes – not for data processing, it is for developer and data scientist job testing.
- worker nodes – like data nodes.
- zoo keeper nodes – for leader election of head nodes.
- nimbus nodes – with storm.
- Hive meta store – Azure SQL
- Azure Data lake store and Azure blob
Deployment
- Azure cli to create clusters
- Airflow – open source.
- TBD.
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me: