Creating BigQuery Table Snapshots Dynamically

Abhik Saha
Google Cloud - Community
5 min readApr 27, 2023

--

A table snapshot in BigQuery is a way of preserving the contents of a table, referred to as the base table, at a specific moment in time. It is possible to take a snapshot of a table in its current state or create one of a table as it existed at any time within the last seven days. A table snapshot may have an expiration period, after which BigQuery automatically removes it.

Despite being read-only, you can use a snapshot table for querying purposes just like a standard table. Additionally, a snapshot can serve as the source for creating a standard table that can be modified.

One of the advantages of using table snapshots is the ability to keep a record of a table’s contents for longer than seven days. Unlike with BigQuery’s time travel feature, which allows you to access a table’s data from the last seven days or less, snapshots can retain a table’s data from a specific point in time indefinitely.

Moreover, snapshots can help reduce storage costs since BigQuery only stores the bytes that differ between the snapshot and its base table. As a result, a table snapshot typically uses less storage space than making a full copy of the table.

image source: google-cloud

Why are Table Snapshots useful?

Table snapshots are useful in many scenarios where you need to analyse data as it existed at a specific point in time, even if the data in the source table has been modified or deleted since that time. Here are some of the reasons why table snapshots are useful:

  1. Compliance Reporting: Table snapshots can be used to satisfy compliance reporting requirements, where data needs to be preserved for a specific period. For example, financial institutions need to preserve data for auditing purposes, and snapshots can help ensure that the data is preserved and available for analysis.
  2. Debugging: If you are working on a complex data pipeline or ETL process, it can be helpful to create snapshots at various stages of the process. This can help you identify issues and compare the data at various stages to determine where the issue occurred.
  3. Auditing: Table snapshots can be used for auditing purposes, to compare the data in a table at different points in time and identify any discrepancies. This can help ensure that the data is accurate and complete.
  4. Historical Analysis: If you need to analyse data as it existed at a specific point in time, creating a snapshot can help you do so. This can be useful in many scenarios, such as analysing customer behaviour over time, detecting fraud, or identifying trends.
  5. Disaster Recovery: Table snapshots can be used as a backup mechanism in case of data loss or corruption. If you have a snapshot of the data at a specific point in time, you can use it to restore the data in case of a disaster.

Limitation

  1. To take a table snapshot, it must be in the same region and under the same organization as its base table.

2. Table snapshots are read-only, meaning that you cannot update their data directly, but you can modify their metadata such as description, expiration date, and access policy. However, you can create a standard table from a snapshot and update its data as needed.

3. Due to the time travel feature’s seven-day limit, you can only snapshot a table’s data as it was within the last seven days.

4. Snapshots cannot be taken of views, materialized views, or external tables.

5. When creating a table snapshot, you CANNOT OVERWRITE an existing table or snapshot.

6. If a table being snapshot contains data in write-optimized storage, such as a streaming buffer, it will not be included in the snapshot.

7. If a partitioned table with a partition expiration set is snapshotted, the partition expiration information will not be retained. The snapshotted table will use the destination dataset’s default partition expiration. To preserve the partition expiration information, copy the table instead of taking a snapshot.

Syntax to Create/Delete Table Snapshots

You can create a snapshot of a table using the below syntax. Best practice is to create a table snapshot in a different dataset from the base table.

This practice allows the base table to be restored from its table snapshot even if the base table’s dataset is accidentally deleted.

Creating Table Snapshots Dynamically

CREATE OR REPLACE PROCEDURE `plated-field-383807.snapshot_dataset.sp_table_snapshots`()
BEGIN
DECLARE log_date string;
DECLARE custom_message string;
DECLARE v_project_name string;
DECLARE v_dataset_name string;
DECLARE v_table_name string;
DECLARE v_CSSTifnotexists_query string;
DECLARE v_CSStablename string;
DECLARE v_droptable_query string;
DECLARE v_csst_query string;

BEGIN
SET custom_message ='Error during snapshot creation';

FOR source_tables IN
(
/* Selecting the project_name,dataset_name and table name dynamically from
INFORMATION_SCHEMA */
SELECT
table_catalog AS project_name,
table_schema AS dataset_name,
table_name
FROM
`plated-field-383807.manual_input.INFORMATION_SCHEMA.TABLES`
WHERE
table_type = 'BASE TABLE'
AND table_name NOT IN('error_log_table')
/* Filtering out the table(s) we do not need a snapshot of */
)
DO

SET v_project_name = source_tables.project_name;
SET v_dataset_name = source_tables.dataset_name;
SET v_table_name = source_tables.table_name;
SET v_CSStablename=source_tables.table_name || '_tss';

--select v_CSStablename as v_CSStablename;

/* During the initial run, the snapshots tables do not exist.
So we need to create them by default */

SET v_CSSTifnotexists_query = FORMAT("""CREATE SNAPSHOT TABLE IF NOT EXISTS `%s.%s.%s` CLONE `%s.%s.%s`""",
v_project_name,
'snapshot_dataset',
v_CSStablename,
v_project_name,
v_dataset_name,
v_table_name);
--select v_CSSTifnotexists_query as v_CSSTifnotexists_query;

/* Dropping the old snaphot as Create or replace table don't work
incase of snapshot tables */
SET v_droptable_query = FORMAT("""drop snapshot table `%s.%s.%s`;""",v_project_name,'snapshot_dataset',v_CSStablename);
--select v_droptable_query as v_droptable_query;


/* Creating a snaphot table with 2 days of expiration period
i.e, the table will be dropped after two days*/
SET v_csst_query = FORMAT("""CREATE SNAPSHOT TABLE `%s.%s.%s` CLONE `%s.%s.%s` OPTIONS (expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 2 DAY));""",
v_project_name,
'snapshot_dataset',
v_CSStablename,
v_project_name,
v_dataset_name,
v_table_name);
--select v_csst_query as v_csst_query;

EXECUTE IMMEDIATE v_CSSTifnotexists_query;
EXECUTE IMMEDIATE v_droptable_query;
EXECUTE IMMEDIATE v_csst_query;

END FOR;

EXCEPTION WHEN ERROR THEN

SET LOG_DATE=( SELECT CAST(CURRENT_DATETIME() AS STRING) AS LOG_DATE);


EXECUTE IMMEDIATE "insert into `plated-field-383807.manual_input.error_log_table` (LOG_DATE, PROJECT_NAME, DATASET_NAME, PROCEDURE_NAME,ERROR_STATEMENT_TEXT, ERROR_MESSAGE, CUSTOM_MESSAGE) values (?,?,?,?,?,?,?)" USING LOG_DATE,'plated-field-383807','snapshot_dataset','sp_table_snapshots',@@error.statement_text,@@error.message,custom_message;

END;
END;

Explanation

  1. Declare all the necessary variables initially.
  2. Start a for loop and select each table from INFORMATION_SCHEMA of the dataset you need to create a snapshot of.
  3. For the first run of the procedure, the snapshots tables do not exist.
    So, we need to create them initially. We use the ‘CREATE SNAPSHOT TABLE IF NOT EXISTS’ clause for that.
  4. Then we drop the old snapshots of the tables that already exist.

5. Then we create a new snapshot of the tables with two days expiration period.

6. For error logging details, check the article here.

Results

We can see the procedure executed successfully and all the snapshot tables have been created.

Procedure executed successfully

Follow me on LinkedIn and Medium to get more content like these!

--

--

Abhik Saha
Google Cloud - Community

Data Engineer @Accenture India || Writes about Bigquery, Cloud Function, GCP, SQL || LinkedIn ID: https://www.linkedin.com/in/abhik-saha-919646108/