GCP Dataproc

What is Dataproc?

Dataproc is a managed Spark and Hive service that lets you take advantage of Apache open source data projects for batch processing, querying, streaming, and machine learning.

Dataproc advantages:

  1. Fully supported by Airflow.
  2. Easy to use.
  3. Very fast to get started.

Dataproc use cases:

Large scale of data transformation.  In this use case, it should be about 50% cheaper than BigQuery.

Dataproc antipattern use cases:

  1. BI.
  2. Operational Database.

Our Dataproc Blogs:

Dataproc video- Hebrew