Unleashing the Power of Geospatial Data with DBSCAN Clustering in BigQuery

Rahul Ravi
4 min readApr 23, 2023

One of the most powerful tools for analyzing geospatial data is DBSCAN clustering, which can be used to identify patterns and relationships in location data.

DBSCAN Algorithm

With the advent of cloud-based data platforms like BigQuery, performing DBSCAN clustering on large datasets is easier than ever.

Geospatial analysis in minutes!

In this article, we’ll explore the basics of DBSCAN clustering and highlight several real-world applications of DBSCAN clustering, from optimizing bike rental operations to identifying high-traffic areas for retail businesses. I’ll also demonstrate how to use BigQuery to perform geospatial analysis using DBSCAN. Whether you’re a data scientist, analyst, or business owner, this article will provide valuable insights into the power of DBSCAN clustering for geospatial analysis.

What is DBSCAN?

Density-Based Spatial Clustering of Applications with Noise, is a popular clustering algorithm in machine learning and data mining. DBSCAN is a density-based clustering algorithm, which means it groups together data points that are close to each other in space and have a high density. The algorithm is particularly useful when dealing with datasets that contain noise or outliers, as it can identify these points as noise and exclude them from the clusters. This can be useful in a range of applications, such as logistics monitoring & planning, and customer behavior and preferences based on their location.

To know how this Algorithm works, check out this awesome video by Josh Starmer on Clustering with DBSCAN

BigQuery made DBSCAN easy!

I’m a huge fan of BigQuery. It has quite a lot of functions under its sleeve and one of the most powerful features that I came across was its ability to perform geospatial analysis such as DBSCAN clustering.

Now let's see this in action! I have chosen The New York Citibike public dataset in BigQuery for this example.

SELECT 
station_id,
name,
latitude,
longitude,
ST_GEOGPOINT(longitude, latitude) as location,
cast((ST_CLUSTERDBSCAN(ST_GEOGPOINT(longitude, latitude), 500, 10) OVER()) as string) AS cluster_label
FROM
`bigquery-public-data.new_york_citibike.citibike_stations`

We use the ST_CLUSTERDBSCAN function provided by BigQuery to cluster the stations based on their geographic proximity, using a radius of 500 meters and a minimum cluster size of 10 stations.

The stations that are in close proximity will be assigned to the same cluster where as the noise or outliers will be labeled “null”, allowing us to identify areas with a high and low density along with outliers of bike rental stations.

Tip: You can right away start your analysis with the Explore data -> Explore data with Looker studio. Plot your data using the Filled map with Location data and Cluster Label in Color dimension.

It's as easy as it seems. This took me a couple of minutes to turn a dataset into storytelling. If not for BigQuery, I would have had to write quite some lines of code in Python to achieve the same result.

How can this be used?

Some real-world applications of DBSCAN that I can think of and used in the past:

  1. You can use geospatial data points to identify areas with high foot traffic, such as near a popular tourist attraction or shopping center. By clustering these data points using DBSCAN, you can identify the areas with the highest customer density, and adjust their marketing and promotional efforts accordingly.
  2. You may use DBSCAN to cluster geospatial data points based on the types of food customers typically order in a particular area. This information can then be used to tailor the restaurant’s menu to better meet the needs and preferences of customers in that location.
  3. Identify the areas with the highest concentration of delivery orders, and optimize their delivery routes and staffing to meet customer demand.
  4. Clustering of bike rental stations based on their proximity: Identify areas with a high density of bike rental stations, as well as areas with a lower density where additional bike rental stations could be added to improve the availability of bikes.
  5. Identification of customer groups based on rental behavior: DBSCAN can be used to cluster bike rental trips based on customer behavior, such as rental duration, frequency, and distance traveled. This analysis can help to identify customer segments with different rental behaviors and preferences, which can inform marketing and promotional efforts.

I hope you find this useful and share it with your peers who might find it helpful with their day-to-day. Cheers!

--

--

Rahul Ravi

Optimising one Bit at a time. Sharing solutions for day-to-day tech problems. Tech & Analytics Manager for a Billion $ Food & Real-estate Tech company