How we mapped out USD 15K+ cost reduction per Month for our Cloud Data Platform

MD Tanvir Alam
The PropertyGuru Tech Blog
8 min readApr 4, 2023

--

The recent increase in our cloud bills alarmed us to some extent. Our stakeholders prioritized the investigation, and we started an expedition to dig deeper into the cause and possible optimizations. Our initial observation was that BigQuery costs us 60% of the bill and is growing monthly. So we took this as our starting point.

Disclaimer: By “mapped out,” I meant planned out. Some immediately gave results, while others were approved for implementation but are still in progress.

Cover pic credit: AWS (with edits)

Cost optimisation for BigQuery Analysis, Expected savings of USD 10,000 Per Month

Luckily, we had a dashboard to track BQ usage, including Most Usage by User, Top Tables by Frequency, and Top Queries by Processed Bytes. From there, we noticed that top users were consuming terabytes of data every time in a single query, mostly coming from Looker Dashboards. We could have attributed this to the desire to construct user-friendly dashboards without proper understanding of what goes on behind-the-scenes. Instead of assigning blame, we looked for a solution by analyzing the top 20 queries to see if they could benefit from partitioning and clustering.

Step 1: Try Partitioning and Clustering

We figured out that all of the top queries working on tables were well partitioned, mostly by data ingestion date. So there was nothing in this direction for us to try.

Then we decided to add Clustering to the top tables. Our initial application of clustering resulted in an 88% decrease in data processing, which excited us. We jumped in with the full team to apply clustering to everything else. However, most of the top 20 tables in the experiment did not show any more positive results, except for just 2. Most of them were using STRUCT column properties for their filter, which clustering cannot use. Others had mostly the same values in the filter column, making it difficult for clustering to help much. This took some time, and we concluded that clustering could not help reduce costs in this way.

If you want to know how to do partitioning and clustering, this is the simplest guide we found:

Step 2: Try different pricing model for BQ

Our initial investigation showed most expensive queries are not benefiting from basic query optimization plans. Per our current pricing model, the amount of bytes being processed is the main reason for our bill hike. So we thought to investigate if we could benefit from a different pricing model.

Challenges and Risks involved

The other pricing model is Flat-rate, slots-based billing. It would allow us to choose a number of slots for a period of time and pay a flat rate. So in this model, we wouldn’t have to worry about the terabytes consumed by the queries. But the problem is trying to identify the optimum number of slots that can both support our needs and reduce costs simultaneously! All hell can break loose if we commit to fewer slots than we need. If we commit to more, it’s not going to help reduce costs. If we experiment and want to return to on-demand, we still need to pay for the committed time as additional charges.

How we determined the number of slots needed for us

From BigQuery Monitoring: Not enough data

Our first step was to look at BigQuery Monitoring. We examined the Slot Usage chart with the configuration Chart = Slot Usage, Metric = Average, Group By = Overall. This provided us with the overall slot usage for just one month.

Group by Job helped us to figure out top jobs consuming extra slots. This helped to identify optimization hints as well.

But more than 1 month’s usage data was needed to make decisions for such a huge data platform. We needed more data. Thus we started checking INFORMATION_SCHEMA.

From INFORMATION_SCHEMA: jackpot hit

Usually, INFORMATION_SCHEMA stores data for the last 6 months, which should be enough for research.

First, we tried a GCP-suggested example to calculate the average slot utilization for all queries over the past 7 days for a given project. The query is the same as the one provided on this GCP page.

https://cloud.google.com/bigquery/docs/information-schema-jobs

SELECT
SUM(total_slot_ms) / (1000 * 60 * 60 * 24 * 7) AS avg_slots
FROM
`region-us`.INFORMATION_SCHEMA.JOBS
WHERE
-- Filter by the partition column first to limit the amount of data scanned.
-- Eight days allows for jobs created before the 7 day end_time filter.
creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 8 DAY) AND CURRENT_TIMESTAMP()
AND job_type = 'QUERY'
AND end_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND CURRENT_TIMESTAMP();

We tried this for our major GCP projects, which increased our hope immensely. We saw that the utilization, on average, was very low.

Next, we tried to determine the average number of slots from the query’s overall data.

SELECT
job_id,
creation_time,
end_time,
total_slot_ms,
TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND) AS job_duration_seconds,
SAFE_DIVIDE(total_slot_ms,(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND))) AS avg_slots,
FROM
`region-us`.INFORMATION_SCHEMA.JOBS

WOW! The average number of slots over six months’ data hinted that we are going in the right direction. But again, the maximum result worried us, showing huge spikes for some jobs. These spikes were affecting the whole average.

This time, I checked how many such jobs are way above the range. I added a where clause using the accepted maximum, denoted here as XXX. For example, the visual impression says slot usage should not be over 1500, but we can see some with 7000. So XXX here should be 1500.

SELECT
job_id,
creation_time,
end_time,
total_slot_ms,
TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND) AS job_duration_seconds,
SAFE_DIVIDE(total_slot_ms,(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND))) AS avg_slots,
FROM
`region-us`.INFORMATION_SCHEMA.JOBS
where SAFE_DIVIDE(total_slot_ms,(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND))) > XXX

OMG, out of millions of records, only about 0.06% are crossing the accepted range. So this confirms that we don’t need the huge amount of slots that GCP suggests for a performance boost. Our BigQuery is for analysis, not performance. For performance, we have pipelines for sending data to the fastest possible databases. But is this really easy? Did we miss something? We are still scared of making such a big change, as it could trigger huge losses if we are not right.

Consultation with Google Engineers: Figured out a safer way of execution: Flex Slot commitments

Next, we requested direct help from GCP engineers. We had live sessions with their seniors and shared our findings and plans with them. This session was a great confidence booster. They agreed with our findings and suggested the same. But they guided us towards a safer way with the least effort. That is to use Flex Slot commitments for 3 to 5 days for the entire organization. That way, we commit to the least price, and if things don’t go according to plan, we can switch back. Flex Slot commitments offer hourly pricing, which I mentioned above. Finally, the number of slots we planned for showed that we could save $10k, so this is the number we’ll use in our analysis.

Cost optimisation for BigQuery Storage, Expected savings of USD 3500+ Per Month.

Although a flat rate will help us with costs related to analysis, storage cost is not affected by this change. We noticed our storage cost is also huge, and long-term active storage was contributing to nearly half the costs. A quick note for long-term storage: these are the data you did not access/use for the last 90 days. Although cheaper, it’s still cost. Remember, it will go long-term partition-wise, not entirely table-wise.

INFORMATION_SCHEMA.TABLE_STORAGE to the rescue again. We first checked tables we did not access for the last 365 days; the plan was to eliminate them effortlessly.

select 
project_id, table_schema, table_name, total_logical_bytes,
active_logical_bytes, long_term_logical_bytes,
storage_last_modified_time, deleted,
total_logical_bytes/power(1024, 4) AS total_logical_tb
from project-xxx.`region-US`.INFORMATION_SCHEMA.TABLE_STORAGE
where
DATE(storage_last_modified_time) < DATE(DATETIME_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY))
order by
storage_last_modified_time desc,
total_logical_bytes desc,
total_logical_tb desc;

Alas, this gave us 39K records but only 10% savings on the cost of long-term storage. A lot of effort for too little return. Still, we tried to find a pattern, and discovered that 50% came from around 5 tables and their partitions. Phew, still some quick savings.

Then came the tricky part. We thought of checking backups or copy tables. Looking at the BQ dataset pattern, we found that most backup tables used “_copy” or “bk*” in their naming. Out of curiosity, we searched for them in INFORMATION_SCHEMA by simply adding ‘%copy%’ or ‘%bk%’ in the where clause. This time, we did not include the 365-day clause, but we were interested in tables greater than half a terabyte. So we added

where total_logical_bytes/power(1024, 4) >= .5

WOW! We found only 45 tables, but they accounted for half of our storage costs! Jackpot! That was an easy, satisfying reduction in costs.

Cost optimisation for Elastic Search, savings USD 3200+ Per Month, Least effort

This was easy. We looked into the billing report for Elasticsearch in AWS and noticed several very large instances contributing to major costs. Then, we checked all our environments. Voila! We found four large Elasticsearch instances that we could have gotten rid of a long time ago. Each of them incurs a monthly bill of 800 USD, and had very low usage. So, we scaled them down to the appropriate size. Wow, that was the quickest 3K; someone just had to take a look :)

Cost optimisation for Cloudwatch logs, savings around USD 3500+ Per Month, minimum effort

This one was also surprising. CloudWatch was contributing to our costs at an alarming rate as well. But again, this was a quick fix; someone just had to take a look.

Dashboard

We had lots of dashboards created, and each dashboard incurs a separate bill. However, we didn’t need any of them since we maintain separate dashboards for our use. Deleted all, quick bucks.

Alarms

CloudWatch alarms and their metrics contribute to major costs, and we discovered plenty of unnecessary ones. Got rid of 80% of alarms to save major costs.

Logs ingestion

After investing some time, we discovered that many logs had a “Never Expire” retention period setting. For the Dev and Staging environments, we changed all of them to use a very low retention period. For production, we changed them to reasonable retention periods.

For most of the logs that were contributing to the high costs, we debugged and found that they were caused by the code debugging dump. Easy fix.

Credit

Arnab Biswas (Mission Chief)
Tanvir Alam (Lead)
Rahul Baid (Special Agent)
Rizwan Ahmed (Special Agent)
Mahfuzealahi Noman (Special Agent)
Jawahar (Analyst)

--

--

Cloud enthusiast and Senior Data Engineer, Data OPS at PropertyGuru, Malaysia. GCP Certified Professional Data Engineer, AWS certified professional.