My journey with Google Cloud Data Fusion

Published in

gft-engineering

10 min readMar 8, 2023

Modern society is awash with a wide variety and quantity of data. However, before data can be applied to business analytics, it often needs to undergo specific processing and governance. Therefore, the quality of the data directly determines the reliability of analysis results and model predictions. This makes it crucial to process the data effectively.

In this article, I would like to introduce Data Fusion, a fully managed Google Cloud product chosen to be a data processing tool to meet our client’s needs. I will share what I have learned during this new journey and our final decision based on its pros and cons.

Why Data Fusion?

What would be your choice?

when you need a fully managed product from the Google Cloud Platform to do the ETL pipelines?
when you take into account minimizing the time to market?
when you need to ingest data from diverse data sources?
when you have both batch and real-time processing needs?
last but not least, when you require easy maintenance for your data engineering team?

Here we get a few candidates in the queue:

How about Apache NiFi?

Apache NiFi is an open-source data ingestion tool that facilitates the processing and distribution of data flows between diverse systems. Its user-friendly interface allows for drag-and-drop integration. However, one of the challenges with NiFi is that it is not cloud-native. As the data pipeline becomes more complex, more maintenance may be required, which can be a pain point. Secondly, compared to other ETL/ELT platforms, NiFi is more focused on moving data from the source to the destination quickly as a data ingestion tool. Additionally, when customized processors are required, it can lead to real headaches.

How about DataFlow?

Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Unified stream and batch data processing that’s serverless, fast, and cost-effective.

To build the Dataflow pipeline, you must use a programming language like Java, Python, and Go. That means that you need engineers to code and do maintenance tasks.

Maybe Cloud Data Fusion!

In this article, I will share my journey with Google Cloud Data Fusion. Initially, we considered this product to be the best choice for satisfying our client’s requirements. However, is it truly the best option? It is worth doing a deep dive into the details to find out.

What is Data Fusion?

First, Cloud Data Fusion is built on the open-source project CDAP.

CDAP is claimed as a data application platform dedicated to building and managing data analytics applications in hybrid and multi-cloud environments. It enables developers with data and application abstractions to accelerate the development of analytics applications, addressing a broader range of real-time and batch use cases.

In 2019, Data Fusion became publicly available in beta. It is a graphical ETL tool with highly code-free flexibility. It offers over 150 pre-configured plugins, now maintained by both Google and CDAP teams. In most cases, you just need to point-and-click the visual plugins, complete the required fields, and then your ETL pipeline will be built with no coding tasks. However, there are also situations where we need to import customized plugins manually: you can access the following GitHub repository and find a list of them.

Data Integrations

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Besides, integrated features support data governance by providing insights through end-to-end data lineage at different levels.

After this brief introduction, let’s get our hands dirty!

Get started with Data Fusion

Permissions

Before implementing the data pipeline, it is vital to grant all the necessary permissions if you are not the project owner!

Confirm that the compute service account {project-number}-compute@developer.gserviceaccount.com is present and has at least the editor role assigned.
Check in IAM & admin > Service Accounts. You should have your service account (service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com) assigned the role of Service Account User or Cloud Data Fusion API Service Agent. You need to have the correct role set to grant permission (see the image below), as it is required to launchDataproc clusters when creating the instance.

Once the Data Fusion instance is created, you need to navigate to the IAM & admin > IAM. Then assign the service account of Data Fusion the “Cloud Data Fusion API Service Agent” role.

After you have all the necessary access, we can start implementing the pipelines.

Batch Processing

First of all, the Data Fusion instance should be created. There are three editions available: Developer, Basic, and Enterprise. For more details on the differences and pricing, please check this link: https://cloud.google.com/data-fusion/pricing.

After creating the instance, click View Instance, which will prompt a web UI where you can start using this tool.

The Wrangle option allows you to transform data without writing any code, based on the data source. Conversely, Integrate Studio enables you to build the pipeline from scratch. Let’s explore how to build a batch pipeline using Wrangle.

On one hand, you can choose an internal connection like BigQuery or Google Cloud Storage (GCS). On the other hand, more available connections are listed in the left panel, and some require more custom configuration. For example, if you want to read from PostgreSQL, you should choose Add Connection, select the appropriate plugin, and follow the instructions to complete all the required information, such as JDBC Url, username, password, and so on. Once the source connection is configured, the data can be imported.

If you are interested in a quick exploration, you can import sample data from the Sample Buckets. After loading the data, I want to highlight three quick tips about the Wrangle page. Please note that the following points correspond to the ones marked in the image below:

Sometimes, two colors are shown above some columns, indicating whether the data in the column is complete or not. The red progress bar indicates the proportion of null values in that column.
To apply transformations, you can also use the CLI (command line interface). As you start typing commands, an auto-fill feature will assist you in finding matching commands.
After all transformations are applied, click Create a Pipeline. A selection box will prompt you to create a batch pipeline or a real-time pipeline.

Now, the page will load into Integrate Studio. You will find plenty of preconfigured plugins in the left panel, but you can also access more options by clicking the HUB tab, as shown below. In addition, the Configure tab allows you to adjust the resources needed for executing the pipeline by creating ephemeral clusters with the configuration in Dataproc. After completing everything, you can preview the transformation process or deploy the pipeline directly. Once you click “Run”, the batch pipeline should work as planned.

If you need more details about how to build a batch pipeline, try this hands-on lab, which includes more specific instructions: Building Batch Pipelines in Cloud Data Fusion.

Real-Time Processing

The steps for building a real-time pipeline are very similar. Here, we will build a pipeline in which data is read from Dataflow to Pub/Sub. We create a connection with Pub/Sub in Data Fusion, use the Wrangle service to perform data transformation, and finally store the data in GCS.

Compared to the batch pipeline, there are more configurations to take care of, such as batch interval, number of executors, executor capacity, and so on. These configurations are used for creating the ephemeral cluster in Dataproc. When the cluster is running, the pipeline begins actively processing the data.

In our real case, we used Cloud Function to ingest data from a webhook and pull messages through Pub/Sub. After the transformations in Wrangle, the raw data was stored in GCS, and the processed data was stored in Big Query.

Unfortunately, we faced severe latency in this pipeline. After contacting Google Cloud Support, we discovered that it was caused by resource constraints. We then adjusted the configuration using the predefined cluster, which has one Master (1 core, 4GB, 500GB of disk) and four workers (4 core, 8GB, 500GB of disk), which achieved the best performance during our investigation. However, it still had a latency of around 30 seconds. We continued our investigation with GCP Support technicians, and ultimately found that the process used to write to BigQuery (Staging records in GCS, triggering a BigQuery Load job, waiting for completion, continue processing) means that we cannot have lower latency with the current implementation.

Therefore, if low latency is also significant for your project, Dataflow may be a better choice. When Data Fusion was first released, it claimed that Dataflow would be planned for a future release. We will have to wait and see!

Learnings from my journey

Advantages

Fully managed and cloud-native

Cloud Data Fusion is claimed to be a fully managed, cloud-native, enterprise data integration service that can interact with plenty of GCP services. This guarantees accessibility, security, and scalability, and reduces the maintenance burden.

Code-Free pipeline development

As we saw in the section on how to build pipelines in Data Fusion, we mainly only need to point-and-click while using the Wrangle plugin. This feature allows us to create and manage pipelines with code-free efficiency and high speed for analytics.

Diverse plugins and hybrid enablement

With over 150 plugins available on the GitHub repository mentioned in the article, Data Fusion solves the integration with both on-premises and public cloud platforms. Moreover, it is simple to import custom plugins by uploading the plugin JAR file, as described in the readme.md file in each plugin repository.

Efficient building of batch processing pipeline

Building pipelines in Data Fusion is efficient thanks to its code-free feature and various predefined plugins. Moreover, the ephemeral cluster created in Dataproc also achieves efficient execution of batch pipelines. Developing all batch pipelines has been productive with excellent performance.

Disadvantages

Real-Time Processing Performance

As mentioned, I would recommend something other than Cloud Data Fusion if you are looking for very low-latency real-time processing. However, it may perform better if it is a primary one-flow pipeline. The more transformation sub-flows added to the data source, the more delay will occur, because it runs the whole sub-flow one by one and does not process sub-flows in parallel. In my opinion, this was the most significant issue we faced.

Considerable cost

Data Fusion runs pipelines using Dataproc clusters, which will automatically provision ephemeral Dataproc clusters and run the pipelines on them. You can also create and configure a custom static cluster for running them. Therefore, using Data Fusion, both Dataproc and Data Fusion costs should be considered. Moreover, for the best practice, the Dataproc cost will increase heavily when more pipelines are built.

Resource wasted

In my opinion, if only basic transformations are needed, using Data Fusion will be a bit of a waste of resources because it needs to create several virtual machines like Data Fusion instance and Dataproc clusters and requires high-capacity configurations for the cluster to guarantee its best performance. For example, compared to Dataflow, the latter needs far fewer resources for developing pipelines than Data Fusion.

Black box

Due to being a SaaS (Software as a Service) product and focusing on building a code-free environment, sometimes when errors occur, it is tough to trace the root cause as the log offers limited information. Moreover, when more complex transformations are needed, it may require customizing the plugin. In our case, to simplify the data pipeline, we considered using the Spark plugin and passing the Python script to unify all the sub-transformation flow and minimize latency. However, it was quite painful development since we did not know how the pipeline was translated into a Spark job.

This is almost all about my experience with Cloud Data Fusion. I hope this article can help you decide whether this product is the best option for you.

Special thanks to Gonzalo Ruiz de Villa, Esteban Chiner, and Javier Briones