Cloud Spanner — Is it possible to speed-up index creation?

Published in

Google Cloud - Community

6 min readApr 15, 2023

Like any other schema changes in Cloud Spanner, adding indexes to an existing database is an ONLINE task. This means that it does not require taking the database offline and does not lock entire columns or tables. Also, index backfill runs in the background using nodes resources as low priority, without affecting the performance of the database.

All these characteristics made adding indexes a non-disruptive task without needing a maintenance window to create them in our production environment. However, these tasks could take several hours to finish (this depends on several factors like table size and available resources).

Documentation mention something interesting in regards to index backfill:

In most cases, it is not possible to speed-up the process (e.g., by adding more nodes)

Why adding more resources wouldn’t help to improve index creation time? Does this process have some kind of resource limit that doesn’t allow it to run faster?

Let’s do some tests to understand how index creation tasks work and if it is possible to improve them.

Before starting, we need to set a baseline. Therefore I’m gonna create an index in a one node instance and measure the time needed to complete the task.

There will be no traffic on the instance during the tests to avoid any affectation over index creation performance because of external operations.

Table’s size is about 300Gb:

  spanner> SELECT
        ->   INTERVAL_END,
        ->   TABLE_NAME,
        ->   USED_BYTES/1024/1024/1024 as Used_GB
        -> FROM
        ->   SPANNER_SYS.TABLE_SIZES_STATS_1HOUR
        -> WHERE
        ->   INTERVAL_END = (
        ->   SELECT
        ->     MAX(INTERVAL_END)
        ->   FROM
        ->     SPANNER_SYS.TABLE_SIZES_STATS_1HOUR)
        -> ORDER BY
        ->   USED_BYTES DESC;
  +----------------------+------------+------------+
  | INTERVAL_END         | TABLE_NAME | Used_GB    |
  +----------------------+------------+------------+
  | 2023-04-03T21:00:00Z | usertable  | 329.404947 |
  +----------------------+------------+------------+

To launch index creation and monitoring its progress I’ll use gcloud CLI:

 $ gcloud spanner databases ddl update  dbtest --instance=test-sr --ddl="create index idx_field1 on usertable(field1)" --async
Schema update in progress. Operation name=projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_286fa9fa7369685f

Index creation task needed about 3hs to finish:

$  gcloud spanner operations describe _auto_op_286fa9fa7369685f --instance=test-sr --database=dbtest
done: true
metadata:
  '@type': type.googleapis.com/google.spanner.admin.database.v1.UpdateDatabaseDdlMetadata
  commitTimestamps:
  - '2023-04-03T23:49:09.636234Z'
  database: projects/my-project/instances/test-sr/databases/dbtest
  progress:
  - endTime: '2023-04-03T23:49:09.636234Z'
    progressPercent: 100
    startTime: '2023-04-03T20:36:48.901235Z'
  statements:
  - CREATE INDEX idx_field1 ON usertable(field1)
name: projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_286fa9fa7369685f
response:
  '@type': type.googleapis.com/google.protobuf.Empty

Now that we have set our baseline, we can repeat the operation with different node counts and compare the results.

Here is a summary of my tests:

As you can see, tripling the number of nodes we improved 2x index creation time.

Evidences:

Node count: 3

$ gcloud spanner instances describe test-sr | grep nodeCount
  nodeCount: 3

 $ gcloud spanner databases ddl update  dbtest --instance=test-sr --ddl="create index idx_field1_3n on usertable(field1)" --async
Schema update in progress. Operation name=projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_ec7aea64338852dd

$ gcloud spanner operations describe _auto_op_ec7aea64338852dd  --instance=test-sr --database=dbtest
done: true
metadata:
  '@type': type.googleapis.com/google.spanner.admin.database.v1.UpdateDatabaseDdlMetadata
  commitTimestamps:
  - '2023-04-04T13:57:17.853399Z'
  database: projects/my-project/instances/test-sr/databases/dbtest
  progress:
  - endTime: '2023-04-04T13:57:17.853399Z'
    progressPercent: 100
    startTime: '2023-04-04T12:34:29.063434Z'
  statements:
  - CREATE INDEX idx_field1_3n ON usertable(field1)
name: projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_ec7aea64338852dd
response:
  '@type': type.googleapis.com/google.protobuf.Empty

Node count: 9

$ gcloud spanner instances describe test-sr | grep nodeCount
nodeCount: 9

$ gcloud spanner databases ddl update  dbtest --instance=test-sr --ddl="create index idx_field1_9n on usertable(field1)" --async
Schema update in progress. Operation name=projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_4997c28687750a60

$ gcloud spanner operations describe _auto_op_4997c28687750a60  --instance=test-sr --database=dbtest
done: true
metadata:
  '@type': type.googleapis.com/google.spanner.admin.database.v1.UpdateDatabaseDdlMetadata
  commitTimestamps:
  - '2023-04-04T15:46:18.772878Z'
  database: projects/my-project/instances/test-sr/databases/dbtest
  progress:
  - endTime: '2023-04-04T15:46:18.772878Z'
    progressPercent: 100
    startTime: '2023-04-04T15:11:39.295105Z'
  statements:
  - CREATE INDEX idx_field1_9n ON usertable(field1)
name: projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_4997c28687750a60
response:
  '@type': type.googleapis.com/google.protobuf.Empty

Ok, with these tests we confirmed that adding nodes we can speed-up index creation.

But there is an important detail, we’re scaling up the node count before launching the index creation task. What happens if we add nodes once the backfill process is already running?

Let’s launch an index creation using 3 nodes and then scale up the node count to 9 once the index backfill is running:

You will notice a difference in the timeline between gcloud outputs and Cloud Console graphs. This is because Cloud Spanner instance and Cloud Console have different time zone.

$ gcloud spanner databases ddl update  dbtest --instance=test-sr --ddl="create index idx_field1_3to9n on usertable(field1)" --async
  Schema update in progress. Operation name=projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_b892b9d5ca0fe69d


 $ gcloud spanner operations describe _auto_op_b892b9d5ca0fe69d  --instance=test-sr --database=dbtest
done: true
metadata:
  '@type': type.googleapis.com/google.spanner.admin.database.v1.UpdateDatabaseDdlMetadata
  commitTimestamps:
  - '2023-04-04T17:17:51.070423Z'
  database: projects/my-project/instances/test-sr/databases/dbtest
  progress:
  - endTime: '2023-04-04T17:17:51.070423Z'
    progressPercent: 100
    startTime: '2023-04-04T16:01:35.191838Z'
  statements:
  - CREATE INDEX idx_field1_3to9n ON usertable(field1)
name: projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_b892b9d5ca0fe69d
response:
  '@type': type.googleapis.com/google.protobuf.Empty

The operation took 70 minutes as well as the first test using 3 nodes.

Let’s take a look to the metrics to understand why there was no improvement in the process:

It’s clear how the CPU usage didn’t scale up despite we’ve added nodes to the instance.

Tripling the number of nodes the cpu usage dropped 3x, indicating that the backfill resources usage is the same as when it started with 3 nodes.

So, scaling up the node count once the backfill process has started doesn’t help to improve it.

Why? Below my thoughts:

After an index creation task starts, there is a “preparation phase” where Cloud Spanner defines the parallelism degree for the index backfill. This parallelism degree is based on the node count of the instance and the available CPU. Therefore, once the backfill process has started, adding nodes won’t speed-up it.

We can observe this phase and the beginning of the index backfill using Key Visualizer Introspection Tool:

Index creation was launched at 1:01PM but its backfill process started ten minutes later. Since we added nodes after 1:11PM (when index backfill started), it didn’t increase backfill parallelism degree.

Ok, if this is true, not only adding nodes does not improve the process but also reducing the number of nodes should not affect the degree of parallelism.

To confirm our conclusions, let’s do one more test. This time we’ll launch the index creation starting with 9 nodes. Then, after the backfill process starts we’ll scale down the instance to a node count of 3.

$ gcloud spanner databases ddl update  dbtest --instance=test-sr --ddl="create index idx_field1_9to3n on usertable(field1)" --async
Schema update in progress. Operation name=projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_231262b7205cf15a

After scaling down the number of nodes the total CPU usage growth reaches 100%.

Now, let’s check the CPU utilization in our test running with 3 nodes to compare with our current test:

In that case, total CPU usage reached 50%.

This means that in our last test, after reducing the node counts the index backfill steel runs with the parallelism degree defined with 9 nodes.

What about the amount of time needed to complete the operation?

$ gcloud spanner operations describe _auto_op_231262b7205cf15a  --instance=test-sr --database=dbtest
done: true
metadata:
  '@type': type.googleapis.com/google.spanner.admin.database.v1.UpdateDatabaseDdlMetadata
  commitTimestamps:
  - '2023-04-04T18:41:53.082363Z'
  database: projects/my-project/instances/test-sr/databases/dbtest
  progress:
  - endTime: '2023-04-04T18:41:53.082363Z'
    progressPercent: 100
    startTime: '2023-04-04T17:46:21.186297Z'
  statements:
  - CREATE INDEX idx_field1_9to3n ON usertable(field1)
name: projects/my-project/instances/test-sr/databases/dbtest/operations/_auto_op_231262b7205cf15a
response:
  '@type': type.googleapis.com/google.protobuf.Empty

It took 55 minutes, this is 35% faster than our first test using 3 nodes.

In summary, no matter if you add/remove nodes once the backfill process starts, it will keep using the same parallel degree and resource limits defined in its initial phase.

Conclusions

Cloud Spanner decides parallelism degree and resource limit for index backfill based on instance node count and available CPU.
You can add nodes before the index backfill starts to improve index creation time, but adding nodes after the backfill process starts doesn’t help to speed up the process.
Removing nodes while the index backfill is running could cause up to 100% of total CPU utilization.

Cloud Spanner — Is it possible to speed-up index creation?

Conclusions

Written by Pablo Arrojo