Google Cloud Data Fusion — Enterprise Configuration Part 1

Ryan Haney
5 min readDec 27, 2022

Recently I have been working with a customer that’s building a modern data-mesh platform on Google Cloud. One of their key user stories was to support and empower non-technical (ie, non-programmer) data engineering through low/no-code ETL solutions. Data Fusion was a clear front-runner in that it is a managed, no-code solution with strong integration across the Google Cloud data ecosystem. Cloud Data Fusion’s UI lets ETL developers build scalable data integration solutions to clean, prepare, blend, transfer, and transform data without needing to be a Scala or Python expert. Another plus — Data Fusion is powered by the open source project CDAP, meaning customers aren’t locked in to running their ETL pipelines on Google Cloud only.

Data Fusion clearly fit their requirements developed from their user-story, but we also needed to ensure that Data Fusion could be run in a secure, enterprise-ready manner (and ideally as cost-effectively as possible).

Data Fusion has two primary components we’re concerned with — the ETL development UI, and the compute engine the pipeline is executed on. The Fusion UI is deployed through the Data Fusion service console in Google Cloud, and the compute engine can be any Hadoop compliant cluster; though there are built-in integrations for Dataproc and EMR (and most customers use Dataproc as the compute engine when using Data Fusion on Google Cloud).

In a perfect world, we’d be able to run a central Data Fusion UI instance in a “shared’ analytics service account, but have the actual data pipelines developed run in a project that belongs to the data domain in a data mesh architecture.

Enterprise Data Fusion configuration

This architecture has a number of advantages — we can leverage project level isolation for cost, tie data domain specific permissions to service accounts in the constituent domain projects, and leverage CDAP’s namespace capabilities and Data Fusion’s RBAC support to create a secure multi-tenant development UI. In this architecture, our smallest unit of permission is a data domain — we’re assuming that data engineers within a domain would share the same permissions for ETL development.

Our first task is to create the Data Fusion UI instance. When creating the cluster, make sure you select the Enterprise version, which is required to use namespaces/RBAC for access control and isolation.

Select the “Enterprise” edition

Also be sure to select the Granular Role Based Access Control option to enable RBAC.

Select the “Enable Granular Role Based Access Control” option

For a more complete overview of options when creating a Data Fusion instance, see the official documentation. I’ll be writing another post on setting up and testing a private cluster soon!

Next, log in to the Data Fusion UI, then select “SYSTEM ADMIN” from the upper-right:

Click on SYSTEM ADMIN

Now, select “Configuration” on the upper right, then click “Create New Namespace”:

Configuration -> Create New Namespace

Create a new namespace — here I’m creating one for Human Resources. Only worry about the namespace name and description for now; we’ll go back and add a bit more information in a future section.

Create a new namespace

Now, return to the Data Fusion section of the Google Cloud console, and click on the “Permissions” section on the upper-left.

Permissions section of the Data Fusion console

Click on the “ADD” button, and lets add our first user — the group that represents the data team for the HR data domain.

Enter the group email address in the principal section, and ensure that Namespace User is selected. Instance Admin and Namespace User are how we define instance administrators and regular namespace users when RBAC is enabled. A Namespace User will be limited to seeing objects in Data Fusion (like connections and pipelines) that are scoped to their namespace.

Next, click on Select Namespaces and we’ll choose our newly created HR namespace:

Access rights for the Human_Resources namespace.

Change the role from Cloud Data Fusion Editor to Developer. For a complete breakdown of the pre-defined roles, see the Data Fusion documentation here. Save your changes; now any users added to the gcp-data-eng-hr-dev group will have developer access to our Data Fusion instance, and their permissions will be limited to only objects scoped to the Human_Resources namespace.

Let’s recap! We’ve create a namespace in Data Fusion, then used the Permissions feature to assign a group identity principle as a namespace user via RBAC — this means that any person that’s a member of that group will now have Data Fusion Developer privileges scoped to the Human_Resources namespace in Data Fusion. We can expand this concept out and create a namespace per data domain in our data mesh, then link those namespaces to a group or groups in our IAM structure, allowing us to securely use a single Data Fusion instance to do ETL development, saving on cost and lowering operational overhead.

In the next post I’ll go over the service accounts, IAM permissions, and configuration to enable each namespace to run its pipelines in a particular project on Dataproc with a defined service account.

--

--

Ryan Haney

I'm a Senior Customer Engineer with Google Cloud, and a long-time cloud architect with the unachievable goal of learning everything about everything.