GCE, GCP

Unable to connect to GCE instance via ssh after removing 0.0.0.0/0 rule from the FW.

Unable to connect to GCE instance via ssh after removing 0.0.0.0/0 rule from the FW

Quick way to debug the problem try ssh from the CLI , from an authorized IP, if you are getting response like the below, then the FW rule is working.

Koreshs-iMac-2:~ omid$ ssh 35.223.117.66

The authenticity of host '35.223.117.66 (35.223.117.66)' can't be established.
ECDSA key fingerprint is SHA256:FIhYUjgJp+b+F7zuadEg4h7UXWSAzdYpyHVsu8OUg8A.
Are you sure you want to continue connecting (yes/no)?

If not – then you have several other reasons for these to fail. [1]

If you are trying to connect via SSH using the GCE console gui and it is not working then you are not using your IP, in this case you are getting a dynamic IP on Google IP ranges and the firewall rule haven’t these sources. On

Read these:

[1] https://cloud.google.com/compute/docs/ssh-in-browser#ssherror
[2] https://support.google.com/a/answer/60764

good step by step manual get the public IP ranges of google :

based on this blog:

nslookup -q=TXT _netblocks.google.com 8.8.8.8
nslookup -q=TXT _netblocks2.google.com 8.8.8.8
nslookup -q=TXT _netblocks3.google.com 8.8.8.8

and then based on the results of each command run something like:

nslookup -q=TXT _netblocks.google.com 8.8.8.8
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
_netblocks.google.com text = "v=spf1 ip4:35.190.247.0/24 ip4:64.233.160.0/19 ip4:66.102.0.0/20 ip4:66.249.80.0/20 ip4:72.14.192.0/18 ip4:74.125.0.0/16 ip4:108.177.8.0/21 ip4:173.194.0.0/16 ip4:209.85.128.0/17 ip4:216.58.192.0/19 ip4:216.239.32.0/19 ~all"

 how can connect via ssh from my ip on the terminal? instead of using the gui

Using your terminal, in this document you find information about how to do this [4] [5]. With first document you provide and configure the public key and in the second document you find an explanation about how to connect using SSH command.

Good reads:

[4] https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys#addkey
[5] https://cloud.google.com/compute/docs/instances/connecting-advanced#thirdpartytools


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

 

Contact me at-

Data Engineering, GCP, python, windows

How to Install Python and PIP on Windows running on GCP?

This is blog a “cut the bulltishit” and give me what i need to get started blog. end to end if this is your first time – 1 hour you are up and running.

The business use case: the data science team need a server with GUI to run python scripts daily and keep changing them manually until the get the POC results expected.

The technical steps to install GCE machine with windows OS and GUI

  1. Install GCE machine, like any other in GCE, but change the boot disk to run “windows server” and version should be: ” Windows server 2016 datacenter with desktop Experience” . In addition, allow access Scope “Allow full access to all cloud API’s
  2. you need to confirm RDP network access , port 3389
  3. press the down facing arrows on the right of the RDP button to set password for your users.
  4. press RDP on the GCE console, if you are using chrome, install the chrome RDP plugin, it will simply your RDP experience , use the password from “3” and no need for domain, Notice you can change the screen size in the plugin” options

Technical steps to install Python and pip on window server machine

  1. Disable IE Enhanced Security, I used this manual . basically : Server Manager –> Local Server –> IE Enhanced Security Configuration >> Off
  2. Installed python using this blog. don’t for get to right click and “run as administrator“. browse to : https://www.python.org/downloads/windows/ download latest version. customize the installation to ensure all components are installed including pip, and ADD to PATH.
  3. To ensure python is accessible everywhere = ensured the path is updated using this blog. sample path :”C:\Users\Username\AppData\Local\Programs\Python\Python37
  4. If you have not installed pip, Install pip using this blog, don’t forget to add it to the path.
  5. used this video to schedule python to run daily. make sure to mark in the properties, that the job should run even if the user is not logged in.

Further more, some official Google compute engined tutorials


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

BI, GCP

Tableau Demystified | Install Tableau Server on GCP

This is a straight forward “Cut the bulltshit and give me what I need” manual to install Tableau on windows machine on GCP. Naturally, added my personal tips.

  1. Be sure to install Windows with Desktop (I may try later with linux, and more complex HW environments , stay tuned)
  2. Machine type – minimum 8 cores (to maximize network bandwidth), 30 GB ram, 1024 GB SSD to maximize IO. Be sure to install the machine in a region close to your GCP BigQuery dataset region. assuming public IP 1.2.3.4
  3. Assign password to the user
  4. Download chrome RDP plugin for GCP (optional)
  5. make sure you added port 3389 to your network ingress ports.
  6. RDP to the machine, Turn Off Windows Firewall.
  7. download the tableau server and Run it.
  8. login to TSM requires a windows user in admin group. I created a new user with an easier password for this installation.
  9. You need a license for tableau server, usually , you can use it once for PROD and twice for NON PROD. Note you have a 14 day trial options for days. you have an option to offline license activation for air gapped environment.
  10. Once installation is done, you are going to configure admin user, and open port 80 to the instance. access the server from your desktop using web browser and you server IP. eg. https://1.2.3.4
  11. from Tableau desktop, sign to tableau server using the credentials from step 10. you should be able to publish now.
  12. Notice when you create a scheduled extract expected metrics are 40% cpu utilization(!) and about 15 GB RAM. Network will unlikely be bottleneck, but will ossicalte from 0 to 5Mbps (about 10000 per second). This blog is about extract optimization. also log into to TSM and add more instances based on the amount of CPU cores.
  13. Notice after restart, login to the TSM and start tableau server.

Below is example of CPU core allocation for Tableau

Video Blog on getting started on Tableau Desktop:

Tableau Demystified | Quick introduction in 10 minutes

Resources to improve performance of Tableau extracts :

https://help.tableau.com/current/server/en-us/install_config_top.htm

https://www.tableau.com/about/blog/2018/4/zulilys-top-10-tips-self-service-analytics-google-bigquery-and-tableau-84969

https://community.tableau.com/docs/DOC-23150

https://community.tableau.com/docs/DOC-23161

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, GCP

How to ssh to a remote GCP machine and run a command via Airflow ?

How to ssh to a remote GCP machine and run a command via Airflow ?

use an Airflow Operator BashOperator which allows to execute bash commands locally in the node that are running the Airflow Workers, and you can use the gcloud command to connect to a remote instance. Below and in our git, I’m sharing an example:

import datetime
import os
import logging
from airflow import models
from airflow.contrib.operators import bigquery_to_gcs
from airflow.contrib.operators import gcs_to_bq
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators import BashOperator
from airflow.contrib.operators import gcs_to_gcs
#from airflow.utils import trigger_rule
yesterday = datetime.datetime.combine(
    datetime.datetime.today() - datetime.timedelta(1),
    datetime.datetime.min.time())
default_dag_args = {
    # Setting start date as yesterday starts the DAG immediately when it is
    # detected in the Cloud Storage bucket.
    'start_date': yesterday,
    # To email on failure or retry set 'email' arg to your email and enable
    # emailing here.
    'email_on_failure': False,
    'email_on_retry': False,
    # If a task fails, retry it once after waiting at least 5 minutes
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': models.Variable.get('gcp_project')
}
bash_cmd='gcloud beta compute --project MyProjectName ssh myMachineHostname --internal-ip --zone us-central1-a --command "ls /tmp/"'
with models.DAG(
        'bash_remote_gcp_machine_example',
        # Continue to run DAG once per day
        schedule_interval="@once",
        default_args=default_dag_args) as dag:
     start = DummyOperator(task_id='start')
    
     end = DummyOperator(task_id='end')
         
     bash_remote_gcp_machine = BashOperator(task_id='bash_remote_gcp_machine_task',bash_command=bash_cmd)
	
start >> bash_remote_gcp_machine >> end

The above airflow will only work if the service account used by the airflow machine is allowed to access the remote machine. if not use something like the below authentication command using BashOperator:

#Authenticating with a Service Account key
bash_command_auth = BashOperator(task_id='auth_bash',bash_command = 'gcloud auth activate-service-account --key-file=/<path_to_your_SA_key_file>/<your_service_account_key.json')

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. 
If you have any comments, thoughts, questions, or you need someone to consult with, 

feel free to contact me via LinkedIn:

airflow, Big Query, GCP

How to debug BigQuery query failure in Airflow DAG?

How to debug BigQuery query failure in Airflow DAG?

The problem is that sometimes, the complete list of errors in not presented in Airflow.

The log only shows the following error:

Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}. The job was: {'kind': 'bigquery#job', 'etag': 'ldp8TiJ9Ki0q3IiaMV8p5g==', 'id': 'MyProjectName:US.job_R2ig2EmIBy0-wD76zcNOCcuuyVX4',

The trick is to understand there are more errors, by running in the CMD:

 bq show -j US.job_R2ig2EmIBy0-wD76zcNOCcuuyVX4

This will share the list of failure details:

Failure details:
 - gs://myBucket/reviews_utf8/c_201804.csv: Error while reading data,
   error message: CSV table encountered too many errors, giving up.
   Rows: 1; errors: 1. Please look into the errors[] collection for
   more details.
 - gs://myBucket/reviews_utf8/c_201801.csv: Error while reading
   data, error message: Could not parse '3.9.58' as double for field
   App_Version_Name (position 2) starting at location 342
 - gs://myBucket/reviews_utf8/c_201804.csv: Error while reading data,
   error message: Could not parse '2.0.1' as double for field
   App_Version_Name (position 2) starting at location 342

From there – debugging the problem was simply a matter of understanding there is a mismatch between the value in the csv and the data type of the table.

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, 

feel free to contact me via LinkedIn: