Creating the schema from an AVRO file could be done using a python operator [1].
It will be quite similar to the process that you are following on the step 6 of the blog attached [2], but instead of specifying the avro.schema.url we will specify the avro.schema.literal.
First, we have to extract the avro schema from the avro file located in GCS. I have made the python code [3], that will output the json of the avro schema.
Now that we have the schema we should create the Hive Job in order to create the table, I have made an small test with the API [4], the REST request should look similar to [4].
────────────────────
[1]: https://cloud.google.com/composer/docs/how-to/using/writing-dags#pythonoperator
[3]:
from google.cloud import storage from avro.datafile import DataFileReader from avro.io import DatumReader import sys import json bucket_name = "<Bucket Name>" blob_name = "<Bucket Path>/<AVRO FILE NAME>" downloaded_blob = "downloaded_blob.avro" client = storage.Client() bucket = client.lookup_bucket(bucket_name) blob = bucket.blob(blob_name) blob.download_to_filename(downloaded_blob) with DataFileReader(open(downloaded_blob, 'rb'), DatumReader()) as avf: schema = avf.meta['avro.schema'].decode("utf-8") print(schema) # example: {"type":"record","name":"Root","fields":[{"name":"string_field_0","type":["null","string"]}]}
[4]: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.jobs/submit
[5]:
{ "job": { "hiveJob": { "queryList": { "queries": [ "CREATE EXTERNAL TABLE transactions \nSTORED AS AVRO LOCATION 'gs://<BUCKET NAME>/<PATH TO AVRO FOLDER>/*' TBLPROPERTIES ('avro.schema.literal'='<OUTPUT OF THE PYTHON SCRIPT>')" ] } }, "placement": { "clusterName": "<CLUSTER NAME>" }, "reference": { "jobId": "<JOB NAME>" } } }
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me: