avro Archives | Big Data Demystified

Creating the schema from an AVRO file could be done using a python operator [1].

It will be quite similar to the process that you are following on the step 6 of the blog attached [2], but instead of specifying the avro.schema.url we will specify the avro.schema.literal.

First, we have to extract the avro schema from the avro file located in GCS. I have made the python code [3], that will output the json of the avro schema.

Now that we have the schema we should create the Hive Job in order to create the table, I have made an small test with the API [4], the REST request should look similar to [4].

────────────────────

[1]: https://cloud.google.com/composer/docs/how-to/using/writing-dags#pythonoperator

[2]: https://big-data-demystified.ninja/2018/05/27/how-to-export-data-from-google-big-query-into-aws-s3-emr-hive/

[3]:

from google.cloud import storage
from avro.datafile import DataFileReader
from avro.io import DatumReader
import sys
import json
bucket_name = "<Bucket Name>"
blob_name = "<Bucket Path>/<AVRO FILE NAME>"
downloaded_blob = "downloaded_blob.avro"
client = storage.Client()
bucket = client.lookup_bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.download_to_filename(downloaded_blob)
with DataFileReader(open(downloaded_blob, 'rb'), DatumReader()) as avf:
schema = avf.meta['avro.schema'].decode("utf-8")
print(schema) # example: {"type":"record","name":"Root","fields":[{"name":"string_field_0","type":["null","string"]}]}

[4]: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.jobs/submit

[5]:

{
"job": {
"hiveJob": {
"queryList": {
"queries": [
"CREATE EXTERNAL TABLE transactions \nSTORED AS AVRO LOCATION 'gs://<BUCKET NAME>/<PATH TO AVRO FOLDER>/*' TBLPROPERTIES ('avro.schema.literal'='<OUTPUT OF THE PYTHON SCRIPT>')"
]
}
},
"placement": {
"clusterName": "<CLUSTER NAME>"
},
"reference": {
"jobId": "<JOB NAME>"
}
}
}

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Category: avro

AVRO and BigQuery example