Data warehouse, data lake, delta lakes, and multicloud data with Google Cloud’s BigLake

Published in

The SADA Engineering Blog

7 min readApr 12, 2023

Author: Umit Tiric, Manager, Data Engineering, SADA
Co-author: Atnafu Dargaso, Senior Data Engineer, SADA

I remember the first time I had a chance to train and use a data warehouse in a Database Consolidation Appliance with built-in high availability and disaster recovery, in 2012. The full rack appliance allowed us to store 58 terabytes of data and had decent storage capacity and performance. It had scalability and cost issues, and the amount of data that we are processing had been increasing drastically.

Around the same time, data lakes and Hadoop file systems started to handle big data for advanced analytics. Many of those data lakes turned into swamps and became very hard to manage and democratize.

Big data gained traction and promised to provide business insights, help businesses strategize, increase customer satisfaction, and provide more focused value to their customers in order to get ahead of their competitors. Cloud providers have built many products for extracting insights from big data for data analysts and scientists. While many of those technologies address different use cases like analytics, AI/ML, and streaming, all of them include separate internal data storage structures for delivering high-performance processing and results with the same data living in multiple places. These silos bring more management overhead, increase the cost of storage, and create headaches for managing security to democratize the data. Many organizations have also started to use a multicloud approach for specific use cases, so their business data is located in storage systems in public cloud providers such as Google Cloud, AWS, or Azure.

Google Cloud introduced BigLake in April 2022 to avoid data silos and data duplications and enable unified, fine-grained access to data. According to Google, “Moving to the future requires breaking down data silos and enabling various analytics use cases, regardless of where data is stored.”

BigLake is a storage engine (built on years of investments in the BigQuery platform) that promises to unify data lakes and data warehouses by providing fine-grained access controls across multi-cloud storages such as Google Cloud Storage, Amazon S3, and Azure Data Lake Storage (ADLS) gen 2 in open file formats such as AVRO, ORC, and Parquet. Under the hood, BigLake uses BigQuery Omni to provide access to data stored in Amazon S3 and Azure ADLS. Since BigLake acts as a logical storage layer, data administrators can define fine-grained access permissions at the BigLake layer without changing access permissions on the underlying storage. When structured data in storage are defined as BigLake tables, administrators can provide row-level or column-level access to users or other data processing systems.

The BigQuery Storage API has also been extended to give access to BigLake tables to other data processing technologies such as Apache Spark, Trino, and Presto. This allows those systems to access the data with the same access permissions defined on BigLake tables without duplicating data into BigQuery main storage or creating other silos. Data scientists no longer need to build models which require importing the data, and can access the data in place with Google Vertex AI connectors for BigLake instead. This allows the tools to take advantage of the security and access controls that are already enforced, so regardless which tool they choose, the experience is consistent.

BigLake key features

Fine-grained security controls: Allows for table-, row-, and column-level security policies on object store tables at the API level, which eliminates the need for granting file-level access to end users and processing engines.

Compute accessibility: Maintains a single copy of data to avoid data silos and make it uniformly accessible across Google Cloud and open-source engines with the same fine-grained security access policies.

Performance acceleration: Provides high performance on data lakes on GCP, AWS, and Azure by capitalizing on proven BigQuery innovations.

Built on open formats: Allows access to the most popular open data formats, including Parquet, Avro, ORC, CSV, and JSON.

BigLake and BigQuery

BigLake enhances BigQuery’s data warehousing capabilities by extending them to Google Cloud, Azure, and AWS storage. It employs Google Cloud API interfaces to offer enhanced access control and accommodates open formats such as CSV, JSON, Avro, Parquet, and ORC, as well as open source processing engines like Apache Spark. By doing so, it eliminates the artificial distinction between managed warehouses and data lakes, allowing for a single source of truth across various cloud platforms without the need for data duplication or copying.

Biglake = or unifies GCS (or other object stores) + BigQuery

Advantages

Store a single copy of data using the same features across data warehouses and lakes
Granular access control over data in multiple formats and locations
Removes the need to manage file-level access to end users on data stored in object storage
Multicloud governance over distributed data in GCP, Amazon, and Azure
Seamless integration with open source analytics tools and formats

Limitations

Inability to apply row-level access policies on JSON columns
All limitations for external tables apply to BigLake tables
BigLake tables on object stores are subject to the same limitations as BigQuery tables like partitioned tables, table snapshots, quotas , and many more
BigLake tables are read-only
BigLake does not support XML, Excel, fixed-width, or binary formats
You can only use cached metadata with BigLake tables that use Parquet, JSON, and CSV formats

Set up access control policy

Access policies on BigLake refer to the rules and procedures used to control access to specific rows and columns of data. These policies are used to ensure that only authorized users can view or query specific rows or columns of data, while preventing unauthorized access. Before creating Biglake access rules, set up BigLake features in BigQuery.

To use BigLake features in BigQuery:

Enable the Bigquery Connection API
+ADD EXTERNAL DATA SOURCE connector from the BigQuery UI
Use a Connection Service account to access data in the storage
Create a BigLake table

1. Row-level access policy

Row-level security allows filtering of data and grants access to particular rows in a table, depending on user qualifications.

To create a new access policy use:

CREATE ROW ACCESS POLICY or CREATE ROW ACCESS POLICY IF NOT EXISTS

To create, replace, or update existing access policy

CREATE OR REPLACE ROW ACCESS POLICY

To grant access to

Individual:

CREATE OR REPLACE ROW ACCESS POLICY policy_name
ON projectId.dataset_name.table_name
GRANT TO ('user:atnafu@example.com')
FILTER USING (category = 'internal');

Group: #a change

GRANT TO ('group:marketing-asia@example.com')

All authenticated users: #a change

FILTER USING(SESSION_USER() IN UNNEST(column))

To query: need bigquery.tables.getData role
Delete policy to
Individual:

DROP ROW ACCESS POLICY policy_name On table_path

ALL:

DROP ALL ROW ACCESS POLICY policy_name On table_path

2. Column-level access policy

Column-level access policy provides a more granular level of security than traditional access controls, which typically only allow control access to an entire table or view. With column-level access policy, you can control access to individual columns within a table or view, making it easier to enforce data privacy and compliance regulations.

To implement column-level access policy in BigLake, you first need to create a policy that defines the access rules for the specific columns in your table using Taxonomy. You can create policies using Taxonomy through the BigQuery Console. Once you have created a policy, you can associate it with one or more users or roles, which will control their access to the columns covered by the policy.

To restrict data access at the column level:

Create Taxonomy of policy tags for fine-grained access to sensitive columns
Apply tags to selected BigLake table columns to restrict users with fine-grained reader roles
To authorize users to access the BigLake column labeled with a policy tag, they must be given the “Fine-Grained Reader” privilege through a grant policy
To delete policy: Deleting a policy tag automatically removes the association between the policy tag and any columns it was applied to. You can do this through the “delete tags” button in the Policy Tags section in the Edit Schema section of the BigLake Table.

Conclusion

BigLake is an excellent solution for organizations working with structured data, both within Google Cloud and across cloud providers, who want to make that data accessible and usable in a simple and easy processing environment for developers/engineers, analysts, and data scientists. This is how it provides cost-effective fast query performance with multicloud storage and open formats. Integrating with open source platforms, such as Spark, and extending BigQuery to integrate disparate environments, such as data warehouses and lakes, to access data with fine-grained access control is a significant benefit.

About Umit Tiric

Umit Tiric, Manager of Data Engineering at SADA, has 26+ years of professional Information Technology experience (although his career began in 1989 when he released the first commercial packaged software for policy management for insurance companies). Since then, he has designed and implemented IT projects, including the network layer. He has architected and developed enterprise data solutions and applications using various programming languages as well as multi-cloud technologies and open-source frameworks and tools.

About Atnafu Dargaso

Atnafu Dargaso is a Senior Data Engineer at SADA with more than 12 years of experience in the tech industry. He has worked with some of the world’s most innovative companies, including Walt Disney and Getty Images. Throughout his career, he has been fortunate to work on various exciting projects, such as being a Senior Big Data Engineer in a European Union project. Atnafu has a Bachelor of Science in Computer Science, a Master of Science in Computer and Communication Network Engineering from Turin, Italy, and a Masters in Computer Science from Fairfield, Iowa. He also completed graduate-level coursework in Data Science at Harrisburg University of Science and Technology, Pennsylvania, and is currently pursuing a PhD in Computer Science. In addition to his academic background and work experience, he holds several professional certifications, including Google Cloud Professional Data Engineer, Google Cloud Professional Database Engineer, Google Cloud Professional Network Engineer, and Google Cloud Professional Collaboration Engineer. Atnafu’s time at SADA has been one of the most rewarding experiences of his career. He is currently working on exciting projects, collaborating with talented individuals, and continuing to grow as a data engineer.