The table is sharded not partitioned – this means each table has it owen table name e.g ga_sessions_20190202
The GA sharded table also has some temporary data on a table called ga_sessions_intraday_20190505
so when you select * from ga_sessions_* you will also get some unexpected ga_sessions_intraday_* shards
Below is example of querying the last 4 days in any GA_sessions table
select * from myProject.MyDataset.ga_sessions_* where TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY)) AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) OR (_TABLE_SUFFIX = CONCAT('intraday',CAST(FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 0 DAY)) AS string) ) OR TABLE_SUFFIX = CONCAT('intraday',CAST(FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AS string) ))
Load into examples (ignoring header or using quoted fields_ :
LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ INTO TABLE engine_triggers fields terminated by ‘,’ LINES TERMINATED BY ‘\n’ IGNORE 1 lines;
LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ INTO TABLE engine_triggers fields TERMINATED BY ‘,’ ENCLOSED BY ‘”‘ ESCAPED BY ‘”‘ ;
Ignoring header and using quoted fields
LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ INTO TABLE engine_triggers fields TERMINATED BY ‘,’ ENCLOSED BY ‘”‘ LINES TERMINATED BY ‘\n’ IGNORE 1 lines;
You can automate it via crontab hourly runs as follows (notice the extra escape char before \n and inside enclosed by”:
0 * * * * mysql -u User -pPassword -hClusterDomainName -e “use myDatabase; truncate myDatabse.engine_triggers; LOAD DATA FROM S3 ‘s3://bucket/file.csv’ INTO TABLE engine_triggers fields TERMINATED BY ‘,’ ENCLOSED BY ‘\”‘ LINES TERMINATED BY ‘\\n’ IGNORE 1 lines”
I also carried out a test for connecting to my Aurora instance from Lambda. Following are the steps taken by me in order to achieve the same:
Create an Aurora Cluster and connect to the Writer instance using cluster endpoint. Create sample database and table. (Make sure you have correct set of source IP address given in the Security group of the instance for allowing successful connection. )
Now coming to creating a Lambda function to access the Aurora instance:
To start with, we first need to create an execution role that gives your lambda function permission to access AWS resources.
Please follow the to create an execution role:
1. Open the roles page in the IAM console: https://console.aws.amazon.com/iam/home#/role 2. Choose Roles from the left dashboard and select Create role. 3. Under the tab “Choose the service that will use this role” select Lambda and then Next:Permissions 4. Search for “AWSLambdaVPCAccessExecutionRole”. Select this and then Next:Tags 5. Provide a Tag and then a Role Name (ex. lambda-vpc-role) and then Create Role.
The AWSLambdaVPCAccessExecutionRole has the permissions that the function needs to manage network connections to a VPC.
Creating Lambda Function
Please follow the below steps to create a Lambda function:
1. Open the Lambda Management Console : https://console.aws.amazon.com/lambda 2. Choose Create a function 3. Choose In Author from scratch, and then do the following: * In Name*, specify your Lambda function name. * In Runtime*, choose Python 2.7. * In Execution Role*, choose “Use an existing role”. * In Role name*, enter a name for your role which was previously created “lambda-vpc-role”. 4. Choose create function. 5. Once you have created the lambda function, navigate to the function page . 6. In the function page, Under Networks Section do the following. * In VPC, choose default VPC * In Subnets*, choose any two subnets * In Security Groups*, choose the default security group 7. Click on Save
Setting up Lambda Deployment Environment
Next you will need to set up a deployment environment to deploy a python code that connects to the RDS database. To connect to a Aurora using Python you will need to import pymysql module. Hence we need to install dependencies with Pip and create a deployment package. In your local console please execute these commands in your local environment.
1. Creating a local directory which will be the deployment package: $ mkdir rds_lambda;
By executing the above command you will install the pymysql module in your current directory
3. Next create a python file which contain the code to connect to the RDS instance: $sudo nano connectdb.py
I have attached the file “connectdb.py” which has the Python code to connect to the RDS instance.
4. Next we need to zip current directory and upload it to the lambda function. $ zip -r rds_lambda.zip `ls`
The above command creates a zip file “rds_lambda.zip” which we will need to upload to the lambda function. Navigate to the newly created lambda function Console page :
1. In the Function Code section -> Code Entry Type -> From the drop down select upload a zip file 2. Browse the zip file from the local directory 3. Next you in the Function Code Section you will have to change the Handler to pythonfilename.function (ex. connectdb.main). 4. Click Save. 5. Next you will need to Add the security group of the Lambda Function in your RDS Security group. 6. After that test the connection, by creating a test event.
If you see that the execution successful then the connection has been made.
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically – if it is related to big data – this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Personally, I have been waiting for over a year to host this lecture at our meetup. At the time in Walla News , I wanted to test drive their solution to accelerate Hive and spark SQL over s3 and external tables. if you are into caching, performance, and and unifying your multiple storage solutions : GCS, S3, etc, You might want to hear the wonderful lecturer Bin Fan, Phd , Founding Engineer and VP open Source at Alluxio.
This Post will be update soon more! stay tuned. for now, you are welcome to join our meetup.