Dataform, what’s the story?

Just exactly what is Google Cloud’s Dataform service and how easy is it to implement . . .

Lee Doolan
Appsbroker CTS Google Cloud Tech Blog

--

https://cloud.google.com/blog/products/data-analytics/introducing-dataform-in-ga

Introduction

In this blog post, I will try and explain exactly what Dataform is, how it compares to similar products like dbt, and which you should potentially use.

I will then try and explain how you can configure Dataform, and easily build, execute, and schedule a basic data pipeline.

What is Dataform?

Dataform is a cloud-based platform that helps data teams build, version control, and orchestrate SQL workflows in BigQuery.

It has finally moved into Google Cloud GA after being purchased by Google in late 2020, but was around many years before that downloadable as open source, or using a free cloud-based IDE.

Now offered as a fully managed Google Cloud service, it provides many features to help data professionals, such as data engineers and data analysts, easily build and manage data pipelines. These features include:

  • A SQL-based language for defining data transformations.
  • A serverless orchestration engine to schedule transformations and determine dependencies.
  • A cloud development environment IDE with integrated version control. Code can be deployed to development, test, and production environments easily and securely.
  • A built-in quality assurance framework to assure data is of good quality.

Is Dataform like dbt?

Yes, it’s very similar to dbt, they both do the same thing under the hood i.e. run and orchestrate SQL commands in underlying databases.

They both implement the Transform stage of the modern data lake ELT approach, meaning other tools are required to first load data into data platforms before dataform or dbt are utilised.

The biggest difference is that dataform uses JavaScript as its scripting language whilst dbt opts for Jinja, a web templating engine for Python.

The core language of both is still SQL, meaning the use of each scripting language is small and knowledge required basic. In my opinion this difference shouldn’t be a deciding factor when choosing which product, but JavaScript is obviously more well known.

It is however more appropriate to compare Dataform against the different flavours of dbt i.e. Core v Cloud.

Dataform v dbt Core

dbt Core has no IDE, relying on its CLI commands and configuration using YAML files. It requires self-hosting also, and the additional technical knowledge required means it’s probably only suitable for more technical users like data engineers.

Dataform v dbt Cloud

Dataform and dbt Cloud on the other hand are very much the same. Both have an easy-to-use IDE allowing less technical users to more easily use.

The major difference is the payment model and cost.

dbt Cloud requires a monthly subscription plus any charges incurred on the data platform . . .
whereas Dataform is free to use with only the data platform charges incurred.

Comparison Summary

The cheapest options are most definitely Dataform and dbt Core, but Dataform wins out with the addition of the same ease of use as the paid-for dbt Cloud and amount of features it includes over dbt Core.

For existing Google BigQuery users I would recommend Dataform. You will get the same functionality as dbt Cloud and can save on the cost of the monthly subscription.

See the summary comparison table below:

Revised from Google Bard chat: ‘Tell me the differences between Dataform, dbt Core and dbt Cloud’

Implementation

For this implementation, I’m going to approach as a non-technical user. I want to demonstrate how easy Dataform is to set up, develop and schedule a pipeline.

So no Infrastructure as Code, CI/CD, or integration with a Github repo, I’ll save that for another blog.

I’m going to create a repository and pipeline using the Google Cloud GUI, assuming all projects and permissions are already in place, although I will hint at any required dependencies during each step. You can read the Google Cloud documentation on how to do most of these steps here.

Initial Setup

There are at least 3 projects you would normally need to create a Dataform pipeline, however, in theory, these could all be the same project.

  • A ‘utilities’ project to contain your dataform repository. You will need the Project Owner role to create the repository.
  • A ‘target data’ project which will eventually contain your BigQuery target datasets and data that you transform.
  • A collection of ‘source data’ projects containing your BigQuery source data that you wish to transform. You can ignore this for this demo as we will use a BigQuery public dataset table.

The Dataform service account, when auto-created will eventually need permissions to read/write data to/from the source and target data projects.

Create a Repository

From your utilities project using the Google Cloud web GUI follow these steps:

  1. Browse to the dataform product and enable the API.

2. Once the API is enabled (you may have to re-browse to it), click + CREATE REPOSITORY

3. Now give the repository a name and region.

In future you can choose to split your workloads with multiple repositories.

For example your core data warehouse data pipelines may be in a secure controlled repository, whilst your data analysts downstream pipelines in a more open less controlled repository.

Form to create a repository

4. Your repository will now be reported as created, and dataform service account details given. Please make a note of the service account as you will need this later.

Create a Development Workspace

After your repository is created, click into the repository to complete the next steps.

  1. Create the Development Workspace by clicking + CREATE DEVELOPMENT WORKSPACE
  2. You will then be prompted to name the development workspace.

Development Workspaces are essentially similar to Github feature branches, meaning multiple workspaces will be created for groups of developers working on different pipelines.

Therefore the name isn’t overly important but maybe should represent the pipelines being developed in that workspace for clarity.

Prompt for development workspace name

Initialise Workspace

After your first Development Workspace is created, you will need to initialise your workspace. Click into the workspace and complete the steps below.

Initialising a workspace adds all the files needed for Dataform and is required to be carried out only once i.e. on the inception of the repository’s first ever workspace.

Subsequent workspaces will ‘pull’ the required files from the committed master branch.

  1. Click the button INITIALIZE WORKSPACE
  2. This will then create all the required files. You must then commit these changes using the now appeared button COMMIT 6 CHANGES
Initialised development workspace

3. Once you have named and committed the files you must push these to the master branch, using the now appeared button ^ PUSH TO DEFAULT BRANCH

Create Pipelines

We are now going to build a basic data pipeline, sourcing data from one project, transforming it, and placing it in another.

To demonstrate dependency we will create an intermediate BigQuery view but in reality you could create the target table in a single step.

The steps required are as below.

  1. Delete the 2 sample files:

definitions/first_view.sqlx

definitions/second_view.sqlx

As part of the initialisation process these 2 sample files are automatically created.

Although these are a good way to get a flavour of how data pipelines can be built, they are pretty basic and will be deleted for this demo.

2. Create a data source definition file for the public dataset.

This is done by first creating a directory called definitions/london-crime, followed by a file in the newly created folder called crime_by_lsoa.sqlxwith the following text . . .

config {
type: "declaration",
database: "bigquery-public-data",
schema: "london_crime",
name: "crime_by_lsoa",
}

The directory created isn’t actually required, but a good way to organise your Dataform files. Dataform will process any sqlx files nested below the ‘definitions’ directory. Objects will then be built in their configured schema and not what root directory they simply sit in.

You must also ensure your Dataform service account has permissions to read any data source table declared.

3. We will now create the intermediate view to transform data from this data source.

This is done by first creating a directory definitions/target_dataset, followed by a file in the newly created folder called v_crime_by_year.sqlxwith the following text . . .

config {
type: "view",
database: "dataform-blog",
schema: "target_dataset",
name: "v_crime_by_year",
}
SELECT year AS `Year`,
SUM(value) AS `No_of_Crimes`
FROM ${ref("london_crime","crime_by_lsoa")}
GROUP BY 1

Note how the config block defines the type of object to be created i.e. a View, and where it will be built.

Also note in the SQL block the use of the JavaScript snippet to reference back to our source data reference.

4. We will now create the final target table from this view.

This is done by creating another file in the directory definitions/target_dataset, called crime_by_year.sqlxwith the following text . . .

config {
type: "table",
database: "dataform-blog",
schema: "target_dataset",
name: "crime_by_year",
}
SELECT *
FROM ${ref("target_dataset","v_crime_by_year")}

Review & Commit, Push Code

Now all the code has been created you should have a workspace looking like this . . .

Our completed development workspace

And after clicking the COMPILED GRAPHtab, you should be able to see the visualised pipeline that should look similar to this . . .

Compiled Graph of pipeline

If all is good, the final step is as below:

  1. Commit all files.
  2. Push the commit into the default git branch.

Execute the Pipeline (manually)

The next step is to attempt a manual run of the pipeline which you can do with the following steps.

You must also ensure your Dataform service account has permissions to read, write and create BigQuery data objects in the target project.

This is done by assigning both the BigQuery Data Editor and BigQuery Job User roles to the service account.

  1. Start the pipeline execution by clicking START EXECUTIONand then All actions

As you build more pipelines you can decide which pipelines, tables and views to execute with the use of tags, and dependency selections. This is documented here, and I will cover in a later blog perhaps.

  1. Confirm the selection at the side by clicking START EXECUTION
  2. The pipeline will then start and you can review execution using the EXECUTIONStab.
Display all pipeline executions

3. You can also review the actual results of the pipeline in BigQuery itself, where you should see the dataset, view, and table have now all been created.

BigQuery dataset

Schedule the Pipeline

The final step is to set up the execution of this pipeline on a schedule. This is a 2-step process initially as below.

  1. First, you need to create a ‘Release Configuration` if one doesn’t already exist.

This is done at the repository level by clicking + NEW RELEASE CONFIGURATIONfrom the RELEASE CONFIGURATIONStab

Release configuration tab

Release configurations are where you would normally set up how and where executions are run depending upon your code lifecycle environment i.e. dev, test and prod.

For this demo we will use the default settings and only created a single configuration.

Set it up with basic default settings like below . . .

default release configuration

2. The final step is to create the actual schedule i.e. what dataform calls a ‘Workflow Configuration’.

This is done again at the repository level by clicking + NEW WORKPLACE CONFIGURATIONfrom the WORKPLACE CONFIGURATIONStab

We will create a 5 minutely one which you configure as below. Note the link to the release configuration.

3. The final step is just to make you aware you can review all scheduled executions using the WORKFLOW EXECUTION LOGStab.

Workflow execution log

Closing Thoughts

Thanks very much for reading this blog post. Although a fairly brief run-through, I do hope I’ve made you aware of what Dataform is and its potential importance and power in delivering data workflows.

I’ve tried to illustrate how Dataform compares to dbt and why I feel it’s a preferable product for new and existing Google BigQuery users.

I have hopefully demonstrated how you can quickly enable your data analysts to build, visualise and schedule data pipelines in BigQuery.

And to conclude if you are running SQL queries manually or using BigQuery Scheduled Queries, please stop! Or at least consider Dataform instead!

Future Blogs & Reading

We have only covered the very basics here so I’d definitely recommend taking a look at Google Cloud’s excellent Dataform documentation here.

I also kindly ask you to look out for my future Dataform blogs too, where I hope to:

  • Show how you can reuse SQL code with JavaScript.
  • Validate tables with data quality tests and assertions.
  • Demonstrate how full Kimball-style data warehouses can be built, with slowly changing dimension tables and incrementally growing transactional fact tables.
  • Deploy Dataform infrastructure and pipelines using CI/CD to multiple environments, controlling and securing using GitHub branches and workspaces.

Thanks

Finally big thanks to my CTS colleagues Tim Ellis-Smith, Jenn Calland, and James Ng for their help in this publication.

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

--

--

Lee Doolan
Appsbroker CTS Google Cloud Tech Blog

Cloud Data Warehouse Architect & Data Engineer | UK Based | https://www.linkedin.com/in/leedoolan77 | Thoughts are my own and not of my employer