Big Query, Cloud, Data Engineering, GCP

How can I get BigQuery cost per query per user?

How Can I monitor the costs breakdown in GCP BigQuery per user per query?

In order to be able to get costs for jobs in a particular table, you can use BigQuery Audit Logs [1] . You can create a BigQuery logging sink [2] and then you can query the resulting table to get cost breakdowns, for example, like it is done in these examples [3], by creating a view on top of that.

How to parse the query logs data from StackDriver?

After you created the loggin sink, you will get an unpartitioned table called:

cloudaudit_googleapis_com_data_access_*

The table is highly nested, attached quick snippet to parse out costly select queries:

select 
resource.labels.project_id	as project_id,
protopayload_auditlog.authenticationInfo.principalEmail	as user,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.startTime as startTime,
cast(protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.startTime as date) as date,
protopayload_auditlog.requestMetadata.callerSuppliedUserAgent	as userAgent,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query. query as query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes/1024/1024/1024/1024*5 as cost



from
`MyDataSet.cloudaudit_googleapis_com_data_access_*`  

Note this query is also committed in our Big Data Demystified GitHub

Why should I monitor Costs in BigQuery?

  1. Because sometimes you will find scheduled queries which need not run any more.
  2. sometimes the ETL costs much more than you think.
  3. ROI on using pay as you VS flat.
    —————-
    [1] https://cloud.google.com/bigquery/docs/reference/auditlogs/
    [2] https://cloud.google.com/bigquery/docs/reference/auditlogs/#defining_a_bigquery_log_sink_using_gcloud
    [3] https://cloud.google.com/bigquery/docs/reference/auditlogs/#auditdata_examples

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS, AWS athena, AWS EMR, Cloud, Data Engineering, Spark

Big Data in 200KM/h | Big Data Demystified

What we’re about

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…

Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.

how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?

Some of our online materials:

Website:

https://big-data-demystified.ninja/

Youtube channels:

https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber

Meetup:

https://www.meetup.com/AWS-Big-Data-Demystified/

https://www.meetup.com/Big-Data-Demystified

Facebook Group :

https://www.facebook.com/groups/amazon.aws.big.data.demystified/

Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)

Audience:

Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D

AWS Big Data Demystified

Tel Aviv-Yafo, IL
729 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Check out this Meetup Group →

Big Data Demystified

Tel Aviv-Yafo, IL
873 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Next Meetup

Machine Learning Essentials | Big Data Demystified

Wednesday, Sep 4, 2019, 6:00 PM
77 Attending

Check out this Meetup Group →

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, Cloud, Data Engineering, meetup, Performance

Alluxio Demystified | Unify Data Analytics Any Stack Any Cloud

Personally, I have been waiting for over a year to host this lecture at our meetup. At the time in Walla News , I wanted to test drive their solution to accelerate Hive and spark SQL over s3 and external tables. if you are into caching, performance, and and unifying your multiple storage solutions : GCS, S3, etc, You might want to hear the wonderful lecturer Bin Fan, Phd , Founding Engineer and VP open Source at Alluxio.

This Post will be update soon more! stay tuned. for now, you are welcome to join our meetup.

Unify Data Analytics: Any Stack Any Cloud | Webinar | Big Data Demystified

Tuesday, Mar 19, 2019, 7:00 PM

Kikar Halehem
Kdoshei HaShoa St 63 Herzliya, IL

22 Members Went

**** This is a first webinar on this meetup, Please be patient**** The webinar will be broadcasted via Youtube : https://www.youtube.com/watch?v=5g89Wn6qgc0 if you want to join and beome active in this webinar via hangout: https://hangouts.google.com/hangouts/_/3cjuacifwrdtpp2htrcrusakaae if there is a problem join our meetup group for last minute …

Check out this Meetup →

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, Big Query, Cloud, Data Engineering, GCP, GCP Big Data Demystified, meetup

GCP Big Data Demystified #1 | Investing.com Big Data Journey

 

How to get started on Big Data? on the cloud? datacenter? what are the challenges? architecture? Google cloud or AWS cloud? in this blog, i will share with your the slides and the video from a meetup from 27.1.19 detailing the journey investing.com has made to the big data in the cloud.

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Cloud

Big Data and Hadoop options over Microsoft Azure Cloud summery

Azure HDInsights features and advantages:

  1. Hive with LLAP
  2. Spark , SparlSQL, ML,Steaming
  3. Pig
  4. Hbase,
  5. Storm
  6. U-sql – c# and SQL
  7. federated query across several data sources
  8. Kafka! (with rack awareness to azure), replication with mirror maker
  9. Microsoft R server!
  10. Zeppelin and jupiter integration.
  11. Apache Ambari View.
  12. Sqoop
  13. Oozie
  14. Zookeeper, for leader election of head nodes (master node)
  15. Mahut, discontinued in v4.0
  16. phoenix (SQL over Hbase)
  17. mono – open source C# .net implementation.
  18. Apache Slider – like yarn. https://www.slideshare.net/duttashivaji/apache-slider, not in the new version. discontinued in v4.0
  19. Apache Livy
  20. Security – kerberos, and active directory, apache ranger
  21. External Hive metastore
  22. very rich documentations: https://docs.microsoft.com/en-us/azure/hdinsight/
  23. Rich Developer plugins
    1. Zeppelin
    2. intellij
    3. Eclipse
    4. R
    5. Visual studio
    6. Jupiter

Ecosystem

  1. data lake analytics
  2. machine learning
  3. Power BI!!
  4. Azure Cosmos DB – extensions of Azure documentdB, basically noSQL
  5. Azure data factory – orchestration
  6. Azure Event Hub
  7. ISV data science
    1. H2o
    2. data iku

More advantages

  1. each worker can be configure for different sizes.
  2. Hive ODBC
  3. Hive add on for excel
  4. Auto scaling.

Architecture

  1. Gateway nodes – management and security.
  2. Head nodes – like name node, in High availability
  3. Edge nodes – not for data processing, it is for developer and data scientist job testing.
  4. worker nodes – like data nodes.
  5. zoo keeper nodes – for leader election of head nodes.
  6. nimbus nodes – with storm.
  7. Hive meta store – Azure SQL
  8. Azure Data lake store  and Azure blob

Deployment

  1. Azure cli to create clusters
  2. Airflow – open source.
  3. TBD.

 

 

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/