AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive

200KM/h overview on Big Data in AWS – Part 1

12th February 20206th August 2020 Omid

in this lecture we are going to cover AWS Big Data PaaS technologies used to ingest and transform data. Moreover, we are going to demonstrate a business use case, suggested architecture, some basic big data architecture rule of thumbs.

AWS Big Data in 200 km h v1.1 from Omid Vahdaty

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS EMR, Hive

Cherry pick source files in Hive external table example

6th December 201915th January 2020 Omid

Cool way to filter files on your bucket for an external table on hive !

CREATE EXTERNAL TABLE mytable1 ( a string, b string, c string )
STORED AS TEXTFILE
LOCATION 's3://my.bucket/' 'folder/2009.*\.bz2$';

——————————————————————————————————————————

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS athena, AWS EMR, Cloud, Data Engineering, Spark

AWS Big Data in 200KM/h

4th August 201912th May 2022 Omid

AWS Big Data in 200KM/h

Lecturer: Omid Vahdaty ,10.5.2022

AWS Big Data ecosystem and architecture best practices. We will provide a quick overview of all the different big data services in AWS.

Video

Slides

Lecturer: Omid Vahdaty ,4.8.2019

Hebrew meetup

How to transform data (TXT, CSV, TSV, JSON) into Parquet, Which technology should we use to model the data? EMR, Athena, Redshift, Spectrum, Glue, Spark, or SparkSQL? How to handle streaming? How to manage costs? Performance tips, Security tip and cloud best practices tips

Hebrew Video

Slides

——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn – Omid Vahdaty:

architecture, AWS athena, AWS Big Data Demystified, AWS EMR, AWS Redshift, Data Engineering, EMR, Spark

AWS Big Data Demystified – Part 1 [English]

2nd April 20199th August 2020 Omid

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…

Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.

how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?

In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically – if it is related to big data – this is THE meetup.

Some of our online materials (mixed content from several cloud vendor):

Website:

https://big-data-demystified.ninja (under construction)

Meetups:

Big Data Demystified

Tel Aviv-Yafo, IL
494 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Next Meetup

Big Data Demystified | From Redshift to SnowFlake

Sunday, May 12, 2019, 6:00 PM
23 Attending

Check out this Meetup Group →

AWS Big Data Demystified

Tel Aviv-Yafo, IL
635 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Check out this Meetup Group →

You tube channels:

https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber

https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

Audience:

Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D

——————————————————————————————————————————

https://www.linkedin.com/in/omid-vahdaty/

AWS EMR

Working Cloud Formation example with EMR – adding volumes to Core group and Master group and increasing root partition

12th August 20188th April 2020 Omid

The below is a complete working example of an EMR cluster

1 X master node, on demand

2X core nodes on demand.

no task group, not auto scaling.

and mydomain.

notice the MasterInstanceGroup, CoreInstanceGroup section in the json.

adding 320 GB to both core and master, and increasing the root partition to 100GB (maximum supported). The reason i am sharing this, as the example provided by AWS are not good enough, and it is very confusing to connect the dots.

If this helps you , please “like” and subscribe 🙂

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Conditions": {
    "Hbase": {
      "Fn::Equals": [
        {
          "Ref": "Applications"
        },
        "Hbase"
      ]
    },
    "Spark": {
      "Fn::Equals": [
        {
          "Ref": "Applications"
        },
        "Spark"
      ]
    }
  },
  "Description": "myProdCluster",
  "Mappings": {
    
  },
  "Outputs": {
    
  },
  "Parameters": {
    "Applications": {
      "AllowedValues": [
        "Spark",
        "TBD"
      ],
      "Description": "Cluster setup:",
      "Type": "String"
    },
    "CoreInstanceType": {
      "Default": "r4.4xlarge",
      "Description": "Instance type to be used for core instances.",
      "Type": "String"
    },
    "EMRClusterName": {
      "Default": "myProdCluster",
      "Description": "Name of the cluster",
      "Type": "String"
    },
    "KeyName": {
      "Default": "walla_omid",
      "Description": "Must be an existing Keyname",
      "Type": "String"
    },
    "LogUri": {
      "Default": "s3://aws-logs-1123-eu-west-1/elasticmapreduce/",
      "Description": "Must be a valid S3 URL",
      "Type": "String"
    },
    "MasterInstacneType": {
      "Default": "r4.4xlarge",
      "Description": "Instance type to be used for the master instance.",
      "Type": "String"
    },
    "NumberOfCoreInstances": {
      "Default": 2,
      "Description": "Must be a valid number",
      "Type": "Number"
    },
    "ReleaseLabel": {
      "Default": "emr-5.13.0",
      "Description": "Must be a valid EMR release version",
      "Type": "String"
    },
    "S3DataUri": {
      "Default": "s3://aws-logs-1234-eu-west-1/elasticmapreduce/",
      "Description": "Must be a valid S3 bucket URL ",
      "Type": "String"
    },
    "SubnetID": {
      "Default": "subnet-0647325e",
      "Description": "Must be Valid public subnet ID",
      "Type": "String"
    }
  },
  "Resources": {
    "EMRCluster": {
      "DependsOn": [
        "EMRClusterServiceRole",
        "EMRClusterinstanceProfileRole",
        "EMRClusterinstanceProfile"
      ],
      "Properties": {
        "Applications": [
          {
            "Name": "Ganglia"
          },
          {
            "Name": "Spark"
          },
          {
            "Name": "Hive"
          },
          {
            "Name": "Tez"
          },
          {
            "Name": "Zeppelin"
          },
          {
            "Name": "Oozie"
          },
          {
            "Name": "Hue"
          },
          {
            "Name": "Presto"
          },
          {
            "Name": "Livy"
          }
        ],
        "AutoScalingRole": "EMR_AutoScaling_DefaultRole",
        "Configurations": [
          {
            "Classification": "hive-site",
            "ConfigurationProperties": {
              "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
            }
          },
          {
            "Classification": "spark",
            "ConfigurationProperties": {
              "maximizeResourceAllocation": "true"
            }
          },
          {
            "Classification": "presto-connector-hive",
            "ConfigurationProperties": {
              "hive.metastore.glue.datacatalog.enabled": "true"
            }
          },
          {
            "Classification": "spark-hive-site",
            "ConfigurationProperties": {
              "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
            }
          }
        ],
        "EbsRootVolumeSize": 100,
        "Instances": {
          "AdditionalMasterSecurityGroups": [
            "sg-111"
          ],
          "AdditionalSlaveSecurityGroups": [
            "sg-112"
          ],
          "CoreInstanceGroup": {
            "EbsConfiguration": {
              "EbsBlockDeviceConfigs": [
                {
                  "VolumeSpecification": {
                    "SizeInGB": "320",
                    "VolumeType": "gp2"
                  },
                  "VolumesPerInstance": "1"
                }
              ],
              "EbsOptimized": "true"
            },
            "InstanceCount": 2,
            "InstanceType": "r4.4xlarge",
            "Market": "ON_DEMAND",
            "Name": "coreNinja"
          },
          "MasterInstanceGroup": {
            "EbsConfiguration": {
              "EbsBlockDeviceConfigs": [
                {
                  "VolumeSpecification": {
                    "SizeInGB": "320",
                    "VolumeType": "gp2"
                  },
                  "VolumesPerInstance": "1"
                }
              ],
              "EbsOptimized": "true"
            },
            "InstanceCount": 1,
            "InstanceType": "r4.4xlarge",
            "Market": "ON_DEMAND",
            "Name": "masterNinja"
          },
          "Ec2KeyName": {
            "Ref": "KeyName"
          },
          "Ec2SubnetId": {
            "Ref": "SubnetID"
          },
          "TerminationProtected": false
        },
        "JobFlowRole": {
          "Ref": "EMRClusterinstanceProfile"
        },
        "LogUri": {
          "Ref": "LogUri"
        },
        "Name": {
          "Ref": "EMRClusterName"
        },
        "ReleaseLabel": {
          "Ref": "ReleaseLabel"
        },
        "ServiceRole": {
          "Ref": "EMRClusterServiceRole"
        },
        "VisibleToAllUsers": true
      },
      "Type": "AWS::EMR::Cluster"
    },
    "EMRClusterServiceRole": {
      "Properties": {
        "AssumeRolePolicyDocument": {
          "Statement": [
            {
              "Action": [
                "sts:AssumeRole"
              ],
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "elasticmapreduce.amazonaws.com"
                ]
              }
            }
          ],
          "Version": "2012-10-17"
        },
        "ManagedPolicyArns": [
          "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
        ],
        "Path": "/",
        "Policies": [
          {
            "PolicyName": "s3fullaccess",
            "PolicyDocument": {
              "Version": "2012-10-17",
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": "s3:*",
                  "Resource": "*"
                }
              ]
            }
          }
        ]
      },
      "Type": "AWS::IAM::Role"
    },
    "EMRClusterinstanceProfile": {
      "Properties": {
        "Path": "/",
        "Roles": [
          {
            "Ref": "EMRClusterinstanceProfileRole"
          }
        ]
      },
      "Type": "AWS::IAM::InstanceProfile"
    },
    "EMRClusterinstanceProfileRole": {
      "Properties": {
        "AssumeRolePolicyDocument": {
          "Statement": [
            {
              "Action": [
                "sts:AssumeRole"
              ],
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "ec2.amazonaws.com"
                ]
              }
            }
          ],
          "Version": "2012-10-17"
        },
        "ManagedPolicyArns": [
          "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role"
        ],
        "Path": "/"
      },
      "Type": "AWS::IAM::Role"
    },
    "TestStep": {
      "Type": "AWS::EMR::Step",
      "Properties": {
        "ActionOnFailure": "CONTINUE",
        "HadoopJarStep": {
          "Args": [
            "s3://byoo-emr-bootstrap/bootstrap-emr.sh"
          ],
          "Jar": "s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar"
        },
        "Name": "CustomBootstrap",
        "JobFlowId": {
          "Ref": "EMRCluster"
        }
      }
    },
    "myDNSRecord": {
      "Type": "AWS::Route53::RecordSet",
      "DependsOn": [
        "EMRCluster"
      ],
      "Properties": {
        "HostedZoneName": "myDomain.",
        "Comment": "DNS name for my instance. for emr. cloud formation",
        "Name": "mydomain.com",
        "Type": "CNAME",
        "TTL": "600",
        "ResourceRecords": [
          {
            "Fn::GetAtt": [
              "EMRCluster",
              "MasterPublicDNS"
            ]
          }
        ]
      }
    }
  }
}

——————————————————————————————————————————

https://www.linkedin.com/in/omid-vahdaty/