BI, GCP

Tableau Demystified | Install Tableau Server on GCP

This is a straight forward “Cut the bulltshit and give me what I need” manual to install Tableau on windows machine on GCP. Naturally, added my personal tips.

  1. Be sure to install Windows with Desktop (I may try later with linux, and more complex HW environments , stay tuned)
  2. Machine type – minimum 8 cores (to maximize network bandwidth), 30 GB ram, 1024 GB SSD to maximize IO. Be sure to install the machine in a region close to your GCP BigQuery dataset region. assuming public IP 1.2.3.4
  3. Assign password to the user
  4. Download chrome RDP plugin for GCP (optional)
  5. make sure you added port 3389 to your network ingress ports.
  6. RDP to the machine, Turn Off Windows Firewall.
  7. download the tableau server and Run it.
  8. login to TSM requires a windows user in admin group. I created a new user with an easier password for this installation.
  9. You need a license for tableau server, usually , you can use it once for PROD and twice for NON PROD. Note you have a 14 day trial options for days. you have an option to offline license activation for air gapped environment.
  10. Once installation is done, you are going to configure admin user, and open port 80 to the instance. access the server from your desktop using web browser and you server IP. eg. https://1.2.3.4
  11. from Tableau desktop, sign to tableau server using the credentials from step 10. you should be able to publish now.
  12. Notice when you create a scheduled extract expected metrics are 40% cpu utilization(!) and about 15 GB RAM. Network will unlikely be bottleneck, but will ossicalte from 0 to 5Mbps (about 10000 per second). This blog is about extract optimization. also log into to TSM and add more instances based on the amount of CPU cores.
  13. Notice after restart, login to the TSM and start tableau server.

Below is example of CPU core allocation for Tableau

Video Blog on getting started on Tableau Desktop:

Tableau Demystified | Quick introduction in 10 minutes

Resources to improve performance of Tableau extracts :

https://help.tableau.com/current/server/en-us/install_config_top.htm

https://www.tableau.com/about/blog/2018/4/zulilys-top-10-tips-self-service-analytics-google-bigquery-and-tableau-84969

https://community.tableau.com/docs/DOC-23150

https://community.tableau.com/docs/DOC-23161

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, GCP

How to ssh to a remote GCP machine and run a command via Airflow ?

use an Airflow Operator BashOperator which allows to execute bash commands locally in the node that are running the Airflow Workers, and you can use the gcloud command to connect to a remote instance. Below and in our git, I’m sharing an example:

import datetime
import os
import logging

from airflow import models
from airflow.contrib.operators import bigquery_to_gcs
from airflow.contrib.operators import gcs_to_bq
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators import BashOperator
from airflow.contrib.operators import gcs_to_gcs

#from airflow.utils import trigger_rule

yesterday = datetime.datetime.combine(
    datetime.datetime.today() - datetime.timedelta(1),
    datetime.datetime.min.time())

default_dag_args = {
    # Setting start date as yesterday starts the DAG immediately when it is
    # detected in the Cloud Storage bucket.
    'start_date': yesterday,
    # To email on failure or retry set 'email' arg to your email and enable
    # emailing here.
    'email_on_failure': False,
    'email_on_retry': False,
    # If a task fails, retry it once after waiting at least 5 minutes
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': models.Variable.get('gcp_project')
}

bash_cmd='gcloud beta compute --project MyProjectName ssh myMachineHostname --internal-ip --zone us-central1-a --command "ls /tmp/"'
with models.DAG(
        'bash_remote_gcp_machine_example',
        # Continue to run DAG once per day
        schedule_interval="@once",
        default_args=default_dag_args) as dag:

     start = DummyOperator(task_id='start')
    
     end = DummyOperator(task_id='end')
         
     bash_remote_gcp_machine = BashOperator(task_id='bash_remote_gcp_machine_task',bash_command=bash_cmd)


	
start >> bash_remote_gcp_machine >> end

The above airflow will only work if the service account used by the airflow machine is allowed to access the remote machine. if not use something like the below authentication command using BashOperator:

#Authenticating with a Service Account key
bash_command_auth = BashOperator(task_id='auth_bash',bash_command = 'gcloud auth activate-service-account --key-file=/<path_to_your_SA_key_file>/<your_service_account_key.json')

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, Big Query, GCP

How to debug BigQuery query failure in Airflow DAG?

The problem is that sometimes, the complete list of errors in not presented in Airflow.

The log only shows the following error:

Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}. The job was: {'kind': 'bigquery#job', 'etag': 'ldp8TiJ9Ki0q3IiaMV8p5g==', 'id': 'MyProjectName:US.job_R2ig2EmIBy0-wD76zcNOCcuuyVX4',

The trick is to understand there are more errors, by running in the CMD:

 bq show -j US.job_R2ig2EmIBy0-wD76zcNOCcuuyVX4

This will share the list of failure details:

Failure details:
 - gs://myBucket/reviews_utf8/c_201804.csv: Error while reading data,
   error message: CSV table encountered too many errors, giving up.
   Rows: 1; errors: 1. Please look into the errors[] collection for
   more details.
 - gs://myBucket/reviews_utf8/c_201801.csv: Error while reading
   data, error message: Could not parse '3.9.58' as double for field
   App_Version_Name (position 2) starting at location 342
 - gs://myBucket/reviews_utf8/c_201804.csv: Error while reading data,
   error message: Could not parse '2.0.1' as double for field
   App_Version_Name (position 2) starting at location 342

From there – debugging the problem was simply a matter of understanding there is a mismatch between the value in the csv and the data type of the table.

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, AWS, Cloud SQL, Data Engineering, GCP

Airflow Demystified | Everything you need to know about installing a DIY LocalExecutor Airflow cluster backed by MySQL Cloud SQL

Whether you are using Google’s Composer or you want experiment with Airflow

there many reason to start using Airflow:

  1. The first and foremost – orchestration visibility and management, full
  2. The second would be – Cloud Vendor Agnostic – u can used it in any cloud like any other open source tech.
  3. Large community
  4. Massive amount of connectors built in.

Reasons to use Airflow DIY cluster:

  1. high customization options like type of several types Executors.
  2. Cost control a GCP compsor starts with a min of 3 nodes – about 300$ monthly. compared with a DYI cluster – start with 5$ monthly for a a Sequential Executor Airflow server or about 40$ for a Local Executor Airflow Cluster backed by Cloud MySQL (with 1 CPU and 4 GB RAM)
  3. Our moto in Big Data Demystified community is : Faster , Cheaper, Simpler. It is easier to connect to the data in the DIY cluster in order to perform some custom analytical reports. such Cost Per DAG (if you can wait for the blog about this, start reading bout cost per Query per user in BigQuery)

Getting started on Airflow? want to learn Airflow by example?

start playing with GCP composer, get to know the basic functionality of airflow.

We already of a good airflow blog for you with examples and slides

How to Install a Sequential Executor airflow server? (standalone, no concurrency) ?

we already have a basic airflow installation blog for your.

Installing a DIY Airflow cluster in LocalExecutor mode?

Tips for DIY cluster

  1. My first tip would RTFM… read the airflow docs.
  2. Generally speaking – get your self very familiar with Airflow.cfg, if you get lost in the documentations… here is a working example of airflow.cfg configuration file.

There several thing you need to consider when deploying a DIY Airflow Cluster

  1. You must have a DB for concurrency in LocalExceutor
  2. The Backed DB in GCP should be Cloud SQL (cheaper instance would do), and then you can connect the Cloud SQL to BQ Via Federated queries . In AWS consider using AWS Aurora for maximum flexibility in billing . To configure Cloud SQL read our blog
  3. If you are using GCP Composer , and you want to connecting the Backed DB, it is not going to be straight forward, you are going to need SQL proxy or SQLAlchemy, and no query federation to BQ will be supported..
  4. You are going to need to create a service on the OS level for start / stop of airflow. Or a basic start airflow / Stop ariflow script if you are lazy like me , just dont forget to stop/start the FUSE from section 5.
  5. For Security at Rest and high availability of data, Consider using FUSE on top for GCS or AWS s3 for DAGs and LOGS folder.
  6. For Security in Motion consider adding to HTTPS and configure your HTTPS certificate in your Airflow cluster.
  7. For Security in Access consider using Airflow GMAIL Authentication or any other solution like LDAP
  8. For basic airflow monitoring on the ETL level, consider the cool using Airflow Slack Integration as explained in our blog. or email integration. There are more options, we shall cover them in the future.

Some more of our blogs about Airflow

The another installation manual for SequentialExecutor:

Airflow SequentialExecutor Installation manual and basic commands

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, Cloud SQL, GCP

How to connect GCP Composer cloud SQL database ? and migrate data to BQ? Is it possible? What are my options?

In this blog I will connect to Airflow Cloud SQL Database which is in the Tenant Project of the Composer Environment [1]

You can do that by installing the SQLAlchemy toolkit and following these instructions [2].
Or you could also remotely authorize access to your Cloud SQL database from an application, client, or other GCP service, Cloud Composer provides the Cloud SQL proxy in the GKE cluster [3]

[1] https://cloud.google.com/composer/docs/concepts/overview#architecture
[2] https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies#installing_sqlalchemy_to_access_the_airflow_database
[3] https://cloud.google.com/sql/docs/mysql/connect-kubernetes-engine

How to migrate data from GCP Composer to BigQuery?

After researching your question, I found this documentation link[1] which tells you how to access the Cloud SQL database of Composer from another product, and this other one[2], which explains how to integrate BigQuery and Cloud SQL.

[1]: https://cloud.google.com/composer/docs/concepts/overview#cloud-sql
[2]: http://googlecloudsql.tobigquery.com/

BigQuery and Federated queries of the composer backed database Cloud SQL?

Using BigQuery federated queries to access the Cloud SQL instance that runs within the GKE cluster of a Composer environment is not possible. The only way to access the SQL instance data is using a Cloud SQL proxy from the GKE cluster

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/