BigQuery ML and Looker: Meeting the Predictive Analytics Challenge

Integrate Machine Learning models at scale into data visualizations and dashboards with BigQuery ML and Looker from a Data Analyst side.

Axel Thevenot 🐣

Published in

Google Cloud - Community

13 min readMay 5, 2023

Introduction

Analyzing the past to make decisions for the future is one of the challenges of Business Intelligence (BI), which traditionally focused on analyzing historical data.

What if we could analyze the future directly to make better decisions? This is now made possible through AI and the adoption of a predictive approach in data analysis.

The major challenge of moving from descriptive BI to predictive BI is to be able to integrate large-scale Machine Learning (ML) models into our visualizations and dashboards in a simple and robust way.

In this article, we will explore the BigQuery and Looker tools, and how to combine them to create an end-to-end predictive data analysis solution using BigQuery ML.

We will explore the potential for a data analyst to be autonomous in all stages of the predictive analysis process, from data extraction to model training and interpretation.

As a bonus, we will leverage model interpretability features (eXplainable AI) to make the predictions exposed by our model more interpretable.

Summary

Introduction
Moving Toward Predictive BI
Moving Toward ML for Data Analysts
1) “Predictive BI process is too long and too rigid!”
2) ‎BigQuery ML to the Rescue
3) ‎Looker to the Rescue
4) ‎The Ultimate Combination
Complete Demo with ARIMA Forecasting in Looker
1) ‎Dataset
2) ‎ML Model: ARIMA_PLUS
3) ‎ML Model Inference
4) ‎Create Your Dashboard
Bonus: Explainable AI to Your Dashboards
Conclusion

Moving Towards Predictive BI

Descriptive BI focuses on examining past data to draw conclusions, which is useful to make reactive decisions.

This has limitations. It cannot predict future outcomes.

Predictive BI, as opposed to just reacting to past events, involves using data analysis to anticipate future trends and make proactive decisions.

This can be done with Machine Learning models. Those can detect patterns and correlations, which can automate processes or assist in decision-making.

Predictive BI has many use cases, such as fraud detection (automation), customer retention (decision-making), and supply chain optimization (more hybrid between automation and decision-making).

While it is not a new paradigm, it follows a common architecture in many companies.

Note: The architecture is simplified. The tools used or the approach can be more complex.

The Descriptive BI part depends on data sources in a data warehouse (BigQuery). The BI Tool (Looker) generates the dashboards from the historical data. — Data Analysts & Data Engineers.
The ML Training part is fully managed by external AI tools (Vertex AI). Training data is extracted from BigQuery and serves to train the model which is then stored with Vertex AI. — Data Scientists & ML Engineers.
The Predictive BI part infers data from the data sources and the trained model. The output (or inference) is stored in BigQuery and then can be used for predictive Dashboards in Looker. — ML Engineers & Data Engineers & Data Analysts.
The XAI Predictive BI is the bonus one. It is the same as the Predictive BI part with more insights from the model and data. — ML Engineers & Data Engineers & Data Analysts & Data Scientists.

Moving Towards ML for Data Analysts

“Predictive BI process is too long and too rigid!”

The emergence of ML and ML workflows has revolutionized the deployment and management of ML models. They are easier and easier to create and request by the Data Analysts.

Not all organizations have access to a complete team of:

Data Engineers,
ML Engineers,
Data Scientists,
and Data Analysts.

Yes, ML can provide a competitive advantage. But incorporating ML into traditional data analysis workflows is very challenging!

The “common” ML workflow we have seen above has limitations such as longer development cycles and reduced flexibility.

BigQuery ML to the Rescue

BigQuery ML is a game changer in the world of data analysis. Unlike traditional methods where data is exported to external AI tools, BigQuery ML brings AI directly into the data warehouse.

Here are some of the benefits of BigQuery ML:

Ease of use: Data Analysts can easily build machine learning models using SQL queries without extensive programming or machine learning knowledge.
Integration: BigQuery ML is fully integrated into BigQuery. This means there is no need to export data, and input data can be inferred and then stored all in BigQuery.
Scalability: BigQuery ML can handle large datasets and can scale up or down depending on the size of the data, just like BigQuery.
Flexibility: BigQuery ML supports a wide range of ML algorithms, including linear regression, logistic regression, and K-means clustering.

Note: TensorFlow models are also supported. No need to choose between BigQuery ML and your favorite AI tools. You can have the best of both worlds.

ML is now more accessible to Data Analysts and it offers faster model training and deployment.

Let’s come back to our schema to see what changes with BigQuery ML.

The Descriptive BI stays unchanged. — Data Analysts & Data Engineers.
The ML Training part is now fully managed by BigQuery ML in plain SQL. Training data is no longer needed to be extracted from BigQuery. — Data Analysts.
In the Predictive BI part model inference is made with BigQuery ML in plain SQL. — Data Engineers & Data Analysts.
In the XAI Predictive BI part model inference is made with BigQuery ML in plain SQL. — Data Engineers & Data Analysts.

Note: ML Engineers and Data Scientists are no longer present in this schema. It is intentional to challenge traditional assumptions about the necessary roles in Machine Learning workflows.
Data Analysts are more and more autonomous!

Looker to the Rescue.

Looker is not just a BI tool, it is a Data Analytics Platform that offers a range of advanced features.

One of the awesome features is the semantic layer. It abstracts the underlying database and enables a direct connection to it.

This layer greatly enhances the accessibility of data and helps Data Analysts to interact with the database. They can query to the most granular way the data without having to create business layer tables beforehand.

It is also integrated with BigQuery, making it easy to synchronize the two platforms. With this integration, data can be handled by Looker with views and/or by BigQuery with native tables and views, offering more flexibility in data management.

The Descriptive BI is now made from data sources with a direct connection between Looker and BigQuery. — Data Analysts & Data Engineers.
The ML Training stays unchanged. — Data Analysts.
The Predictive BI is now based on views. Inference tables no longer need to be created beforehand and can be inferred on the fly. — Data Engineers & Data Analysts.
In the XAI Predictive BI is now based on views. Inference tables no longer need to be created beforehand and can be inferred on the fly. — Data Engineers & Data Analysts.

The Ultimate Combination

Pushed to the fullest potential, nearly anything that can be done with BigQuery can be achieved within the Looker Platform!

Data sources can remain unchanged, and the overall business layer can be constructed inside Looker with views within the semantic layer.

Even the BigQuery ML model training can be declared and scheduled in the semantic layer of Looker!

By combining BigQuery ML and Looker, Data Analysts can create an end-to-end predictive analytics solution that enables them to develop, deploy, and manage ML models all within Looker.

This approach can significantly reduce the time and cost involved in traditional ML workflows.

The Descriptive BI stays unchanged. — Data Analysts & Data Engineers.
The ML Training is operated within Looker. The model stored in BigQuery is defined in the semantic layer. — Data Analysts.
The Predictive BI does not need to have physical tables in BigQuery for inference data. They can be created on the fly. — Data Analysts.
In the XAI Predictive BI is now based on views. Inference tables no longer need to be created beforehand. — Data Analysts.

Wonderful! From the Data sources, Data Analysts no longer need Data Engineers, Data Scientists, nor ML Engineers to get predictive BI.

Complete Demo with ARIMA Forecasting

Dataset

In this demo, we will replicate the same architecture as shown in the image above using a BigQuery public dataset bigquery-public-data.austin_waste.waste_and_diversion.

This dataset, created by the Austin city government, contains information about waste management, such as who generates it, who collects it, and where it goes.

Our objective is to predict the total load weight for the next year (52 weeks) for each type of waste, including recycling, tires, dead animals, and more.

To achieve this, we will create a historical dataset aggregated on a weekly basis.

CREATE TABLE `<your_dataset_id>.weekly_waste_and_diversion`
SELECT 
  DATE_TRUNC(load_time, WEEK(MONDAY)) AS `load_week`,
  load_type,
  SUM(load_weight)                    AS `total_load_weight`,
FROM `bigquery-public-data.austin_waste.waste_and_diversion`
WHERE TRUE
  AND load_time BETWEEN '2005-01-01' AND '2019-12-01'
  /*
    I will select only 8 time series because I want to show them all :)
  */
  AND load_type IN (
    'BRUSH',
    'DEAD ANIMAL',
    'GARBAGE COLLECTIONS',
    'LITTER',
    'RECYCLING - SINGLE STREAM',
    'SWEEPING',
    'TIRES',
    'YARD TRIMMING'
  )
GROUP BY
  load_week,
  load_type

LookML is the Modeling Language (not Machine Learning) used by Looker to declare the semantic layer. The semantic layer defines the relationship between the tables in your database.

You can define the measures (aggregated metrics) and dimensions (attributes) that will be used to analyze the data.

Note: If you are new to LookML, have a look at the comments provided in the code.

# A view represents a table of data. (native or not)
view: weekly_waste_and_diversion {
  sql_table_name: `<your_dataset_id>.weekly_waste_and_diversion`
    ;;

  # A dimension defines a representation of a column in a table.
  dimension: load_type {
    type: string
    sql: ${TABLE}.load_type ;;
  }
  
  # A dimension_group defines a set of related dimensions. (here in timeframes)
  dimension_group: load_week {
    type: time
    timeframes: [week, month, year]
    sql: ${TABLE}.load_week ;;
  }
  
  # A measure defines a field that represents information about multiple rows. 
  # These are similar to aggregate functions in SQL.
  measure: total_load_weight {
    type: number
    sql: SUM(${TABLE}.total_load_weight) ;;
  }
}

There are two common ways to define a view in the semantic layer. One way is to create the table beforehand and use it as a data source. This approach was taken here.

Alternatively, you can define the table within the semantic layer by using the derived_table keyword and the SQL statement we have above as the data source. It’s entirely up to you which approach you choose.

ML Model: ARIMA_PLUS

We will use only the semantic layer with the derived_table keyword to create the model and its training.

Regarding the training in the semantic layer, the datagroup_trigger will be defined later to automatically and periodically retrain the ML model.

view: weekly_waste_and_diversion_model {

  # A derived_table block defines how a derived table 
  # should be calculated and materialized.
  derived_table: {

    # A datagroup_trigger specifies the datagroup the derived_table belongs in.
    datagroup_trigger: bigquery_ml_training_datagroup
    
    # sql_create defines a full SQL CREATE statement 
    # to execute to create a persistent derived table.
    sql_create:
      CREATE OR REPLACE MODEL ${SQL_TABLE_NAME}
      OPTIONS(
        MODEL_TYPE = 'ARIMA_PLUS',
        TIME_SERIES_TIMESTAMP_COL = 'load_week',
        TIME_SERIES_ID_COL = 'load_type',
        TIME_SERIES_DATA_COL = 'total_load_weight',
        DATA_FREQUENCY = 'WEEKLY',
        HORIZON = 52
      ) AS
      SELECT
        load_type,
        load_week,
        total_load_weight,
      -- # Note: we can reference the training table without using a plain reference. 
      FROM ${weekly_waste_and_diversion.SQL_TABLE_NAME}
    ;;
  }
}

The SQL query here is exactly the same as the one used in the BigQuery Console to create our model and its training.

Even if you are not very familiar with BigQuery ML, you will find that this query is accessible to any Data Analyst.

First, we “create or replace” the ML model. The model type is ARIMA_PLUS, a time series algorithm that uses past observations to forecast future values.

BigQuery’s implementation of ARIMA_PLUS includes a range of features such as automatic model selection, detection and correction of outliers, holiday impacts, trends, seasonal patterns, and more. These features improve forecast accuracy and are included under the hood.

Each of these preprocessing steps is made for you automatically. So you do not need extensive ML or preprocessing knowledge as a Data Analyst.

You can focus on insights.

After creating the model, options are defined.

The ones filled here indicate we want to forecast the total_load_weight on the WEEKLY time series load_week for a horizon of 52 weeks (1 year).

The TIME_SERIES_ID_COL argument defined on load_type indicates that we want one forecast for each of the load types.

Even though I only have 8 time series, up to 100,000,000 time series can be forecast simultaneously with a single query using this option.

Finally, we specify the SQL query for the training data. That’s it!

ML Model Inference

For model inference, it is also possible to bypass the creation of a physical table beforehand in BigQuery. We declare a view that contains both the historical and forecasted data. (using the UNION ALL statement)

Again, the SQL query here is exactly the same as we can use in the BigQuery Console to forecast data with a BigQuery ML model.

The forecast is done using the ML.FORECAST function, which takes as input the model and a STRUCT containing the forecasting horizon and the confidence level, which is set to 0.80.

The query returns several columns, including the forecast_value, total_load_weight_lower_bound, and total_load_weight_upper_bound.

Note: Confidence interval will not be shown in this article. Feel free to use them!

view: weekly_waste_and_diversion_history_and_forecast {
  derived_table: {
    # sql specifies the SQL SELECT statement that will be used to generate 
    # this derived table as a CTE, or a subquery.
    sql:
      SELECT
        load_type,
        load_week,
        total_load_weight,
        "history"                       AS `time_serie_type`,
        CAST(NULL AS FLOAT64)           AS `total_load_weight_lower_bound`,
        CAST(NULL AS FLOAT64)           AS `total_load_weight_upper_bound`,
      FROM ${weekly_waste_and_diversion.SQL_TABLE_NAME}
      UNION ALL
      SELECT
        load_type,
        forecast_timestamp              AS `load_week`,
        forecast_value                  AS `total_load_weight`,
        "forecast"                      AS `time_serie_type`,
        prediction_interval_lower_bound AS `total_load_weight_lower_bound`,
        prediction_interval_upper_bound AS `total_load_weight_upper_bound`,
      FROM ML.FORECAST(
        MODEL ${weekly_waste_and_diversion_model.SQL_TABLE_NAME},
        STRUCT(52 AS `horizon`, 0.80 AS `confidence_level`)
      )
    ;;
  }

  dimension: load_type {
    type: string
    sql: ${TABLE}.load_type ;;
  }

  dimension_group: load_week {
    type: time
    # timeframes define the set of timeframe dimensions 
    # the dimension_group will produce. (accessible in the Looker Explore UI)
    timeframes: [
      raw,
      week,
      month,
      quarter,
      year
    ]
    sql: ${TABLE}.load_week ;;
  }

  dimension: time_serie_type {
    type: string
    sql: ${TABLE}.time_serie_type ;;
  }

  measure: total_load_weight {
    type: number
    sql: SUM(${TABLE}.total_load_weight) ;;
  }

  measure: total_load_weight_lower_bound {
    type: number
    sql: SUM(${TABLE}.total_load_weight_lower_bound) ;;
  }

  measure: total_load_weight_upper_bound {
    type: number
    sql: SUM(${TABLE}.total_load_weight_upper_bound) ;;
  }
}

Create Your Dashboard

Just one more file…

It specifies the connection to the BigQuery database where the data resides and includes all the views we created before.

The datagroup section was the missing piece to retrain periodically our model. We can set up a trigger that will retrain the BigQuery ML model weekly.

Finally, the explore, which contains the view combining historical and forecast data, is added to the Explore menu in Looker. The Explore menu of Looker can then be used to create visualizations.

# A connection is defined in the Admin Settings and 
# specifies the database connection from which a model will retrieve data.
connection: "bigquery_ml_waste_and_diversion_conn"

# Include all the views.
include: "/views/**/*.view"

# Retrain every week on Monday
datagroup: bigquery_ml_training_datagroup {
  # A sql_trigger defines a trigger based on the provided SQL query. 
  # It will trigger the datagroup each time the result changes.
  sql_trigger: SELECT DATE_TRUNC(CURRENT_DATE(), WEEK(MONDAY)) ;;
}

# An explore adds an existing view to the Explore menu of Looker.
# It will then be ready to use to create visualizations. 
explore: weekly_waste_and_diversion_history_and_forecast {}

A few clicks later in the Explore Menu of Looker and Voilà!

Making Predictions More Interpretable

Interpretability is an important aspect of predictive analytics.

End users must understand the reasoning behind the model predictions. eXplainable AI (XAI) provides insights into the model decision-making process, making the predicted outcomes more understandable.

BigQuery ML comes with a lot of XAI features.

Note: XAI functions will vary depending on the selected model type.

With the ARIMA_PLUS, you can replace the ML.FORECAST function in the inference SQL Query with ML.EXPLAIN_FORECAST. This will open up a new set of features to explore time series preprocessing and detected trends.

Combining it with Looker, model interpretability is easily accessible without additional work (or almost) to your end-user! The different aspects of the model performance and interpretability will be generated dynamically in your visualizations.

To maximize the usefulness of XAI, think about the aspects of the predicted values your end-user would like to understand better.

In this example, I emphasize the interpretability of the historical data.
Here, we present the insights from the preprocessing stage of ARIMA_PLUS regarding the influence of seasonal variations on both historical and forecast values. Additionally, we highlight the detected step changes and spikes in the data.

Conclusion

Big changes are happening in how we use data to make decisions.

The traditional approach of analyzing historical data in BI is being challenged by the adoption of a predictive approach using Machine Learning models.
Machine Learning tools are becoming more accessible, both in their ease of use and integration (SQL with BigQuery ML) as well as in the technical expertise required in ML and preprocessing to set up a model.

These two small revolutions are shaking up the essential key roles in a company.

By combining BigQuery ML and Looker, Data Analysts can theoretically do without all the other ML and programming skills that traditional ML workflows require.

I hope this article will also inspire you to explore the combination of BigQuery ML and Looker, and even inspire you to create new solutions for your business to make more informed decisions about the future.