Data Analytics

Bring analytics to your data: What’s new with BigQuery federated queries

April 3, 2023

https://storage.googleapis.com/gweb-cloudblog-publish/images/datanalytics2022.max-2500x2500.jpg

Christopher Crosbie

Group Product Manager, Google

Michał Komorowski

Senior Software Engineer

Google Cloud provides a unified, open, and intelligent data cloud for all your operational and analytical data. Whether your data is stored in Cloud Storage, Cloud Bigtable (NoSQL) or even another cloud, with BigQuery, you can run analytical queries directly on the data. One strength in particular for BigQuery is the reduction of toil that normally comes with ingesting data into your data warehouse from your operational databases. If you use a fully managed operational database such as Cloud Spanner, Cloud SQL or Cloud Bigtable, BigQuery can help simplify and unify your operational and analytical databases.

One simplified approach for data movement from MySQL, PostgreSQL, AlloyDB, and Oracle databases directly into BigQuery is Datastream for BigQuery, a serverless and easy-to-use change data capture replication service that allows you to synchronize Cloud SQL or AlloyDB data into BigQuery. Another approach used by thousands of customers for both data movement as well as querying data in place from either Cloud SQL or Cloud Spanner are BigQuery Federated Queries, which let you send a query statement to an operational database and get the result back as BigQuery data (including a conversion to BigQuery data types).

In this post, we will take a look at the different ways BigQuery federated query customers BT Group and MadHive take advantage of the Data Cloud’s unified data warehouse and operational databases. We will also review what’s new in federated queries, including:

SQL Pushdown
Private IP Access
The setting of priority queues for Spanner federation
Spanner to BigQuery JSON type mapping

How customers use federated queries

There are thousands of customers already using federated queries for various reasons and in different ways. We have seen financial institutions who need to maintain regulated projects of customer data use federated queries to push down calculations into their operational stores and then bring back snapshots of aggregated data into BigQuery for further processing. Network security vendors have reduced the pain of maintaining and managing hundreds of ETL jobs by moving to federated queries that are triggered as BigQuery scheduled queries. Healthcare providers who have found themselves restricted on the use of CDC have been able to provide their data analysts with federated query access to the operational data from BigQuery in near real time.

https://storage.googleapis.com/gweb-cloudblog-publish/images/madhive.max-800x800.jpg

Madhive, a leading technology company for modern tv advertising, makes heavy use of BigQuery to perform campaign analytics and data ingestion at petabyte scale. Madhive has adopted federated queries to enrich large amounts of data in their ETL pipelines with data stored in Cloud SQL. As the lead of the data engineering team explains, “federated queries vastly simplify our data pipelines by providing a seamless integration between BigQuery and CloudSQL, that is both easy to maintain and does not require additional tooling.”

https://storage.googleapis.com/gweb-cloudblog-publish/images/bt_group.max-700x700.jpg

BT Group’s Digital unit takes a different approach to federated queries. Instead of using federated queries for ETL, Digital prefers to reduce data duplication. Digital’s teams create views on top of the external query functions and let their analysts query the underlying operational databases via BigQuery views. This performs well enough for most use cases because of a new SQL pushdown feature described below. Crucially, this delivers a speed boost at reduced effort, improving the experience for the teams and enabling the teams to more rapidly support new and existing customer requirements.

New BigQuery federated query features

SQL pushdown

SQL pushdown is an optimization technique. It happens when BigQuery delegates operations like filtering down to the external data source (e.g., Cloud SQL or Cloud Spanner database) instead of performing them on their own. Thanks to that smaller amount of data needs to be transferred back to BigQuery, the overall query execution time is lower and the cost can be significantly reduced since less data is processed. SQL pushdown ecompasses both column pruning (SELECT clauses) and filter pushdowns (WHERE clauses).

Let’s take a look at how pushdowns work under the hood at the example of the SQL that is now generated in Cloud SQL via federated queries.

BigQuery

Cloud SQL - no pushdowns

Cloud SQL - with pushdowns

In the first column you can see a BigQuery query. As you can see it uses the EXTERNAL_QUERY function for communication with a Cloud SQL database. In the middle column you can see the query that would be sent to Cloud SQL without pushdowns. It’s exactly the same query that was provided by a user. If the source table has millions of rows and hundreds of columns, all of them would be sent to BigQuery even though only some of them are really needed. Finally, in the last column you can see how the original query provided by a user would be rewritten with pushdowns. This time only some columns and some rows would be sent back to BigQuery.

SQL pushdowns increase the flexibility with which you can use federated queries in BigQuery. This is the feature that allows BT Group to simply put views on top of their federated query functions and get performant cross-database queries. BT Group’s Digital unit reviewed queries and found that a query similar to the above originally took 10 minutes to be executed by BigQuery federated queries. Now with pushdowns it takes only 26 seconds!

Currently, SQL pushdowns are only applied to queries of the form SELECT * FROM T since these queries constitute a significant percentage of all federated queries. Another limitation is that currently pushdowns are not supported in all cases e.g. not all data types are supported for filter pushdowns. We plan to support more queries and more types of pushdowns, so stay tuned.

Private IP access
Many customers have sensitive datasets they need to keep isolated from the public internet. This new release allows customers to use federated queries on instances with private IP while keeping their data isolated from the public internet.

Manage Spanner execution priority over federation
Cloud Spanner offers a request priority feature, which lets customers assign HIGH, MEDIUM and LOW priorities to specific queries. Queries to Spanner made via a federated BigQuery connection tend to be analytical in nature; for analytical queries,the best practice is to avoid contending with transactional or application requests. However, an analytical query should take precedence over something like a background job or scheduled backup. For most customer use cases, the default setting of MEDIUM meets this requirement.

However, we had requests from gaming companies who wanted to ensure that their games never be interrupted for analytical queries, social media platforms who need to be up 24/7, and automotive companies that did not want to risk any analytical queries contending with their Spanner operations. For these outlier situations where prioritizing operations at all cost is paramount, customers can now set a LOW priority on their federated query. Please just use caution with this setting because LOW priority jobs can be preempted which could lead to failed queries.

Here is a sample query that uses this new feature:

Spanner to BQ JSON type mapping

Both Spanner and BigQuery offer native data types for working with JSON. Unfortunately, since one JSON data type is intended for operational queries and the BigQuery native data type is optimized for analytical queries, the two were not initially compatible. However, with the introduction of a new JSON data type mapping for Spanner, the Spanner JSON type (including the Spanner PostgreSQL JSONB type) will automatically be converted into the BigQuery native JSON data type, allowing you to work with semi-structured, schema-changing data, across both operational and analytical databases

Getting started

There are a few ways to get started with BigQuery. New customers get $300 in free credits to spend on BigQuery. All customers get 10 GB storage and up to 1 TB queries free per month, not charged against their credits. You can get these credits by signing up for the BigQuery free trial. Not ready yet? You can use the BigQuery sandbox without a credit card to see how it works.

To learn more about federated queries with BigQuery, see the documentation. To learn more about Google Cloud databases such as Cloud SQL, Spanner and Bigtable, get started here.

Posted in