How to run Nextflow in GCP using Cloud Batch?

4 min readJan 4, 2023

Introduction

Nextflow is an open-source workflow engine that makes it easy to run complex pipelines on a variety of platforms, including cloud computing environments. In this article, we will look at how to use Nextflow in Google Cloud Platform (GCP) with Cloud Batch.

Google Cloud Batch is a fully-managed service that allows you to run batch computing workloads on Google Cloud Platform. It allows you to scale your workloads on demand, and can run a wide variety of batch computing workloads including data processing, image rendering, machine learning training, and more. Google Cloud Batch makes it easy to execute your batch workloads in a reliable and scalable way, and can help you save money by only paying for the compute resources you use.

By following the steps outlined in this article, you will be able to run your Nextflow pipelines on GCP using Cloud Batch, taking advantage of the scalability and reliability of GCP to process large amounts of data. This is particularly useful for pipelines that require a lot of compute resources or that need to be run on a regular basis.

In the following sections, we will cover the prerequisites for running Nextflow in GCP with Cloud Batch, as well as the steps for writing a Nextflow pipeline script, setting up Cloud Storage, creating a Cloud Batch job, and submitting and monitoring a job. Whether you are new to Nextflow or an experienced user, this article will provide you with the information you need to get started with Nextflow in GCP using Cloud Batch.

Prerequisites

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
Enable the Cloud Batch, Compute Engine, and Cloud Storage APIs. Use below gcloud command to enable these APIs.

gcloud services enable batch.googleapis.com compute.googleapis.com storage.googleapis.com

Create a Cloud Storage bucket

Google cloud storage bucket are required to store input files, temporary files and output files. In this article we will be using one bucket for temporary files and output files. To create new bucket run the following command.

gsutil mb gs://<BUCKET_NAME>

Create a service account and add roles

Create a service account and add following IAM roles using below steps.

Setup variables for creating service account.

export PROJECT=<PROJECT ID>
export SERVICE_ACCOUNT_NAME=nextflow-service-account
export SERVICE_ACCOUNT_ADDRESS=${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com

2. Create the service account.

gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME}

3. The service account needs the following IAM roles:

roles/batch.jobsEditor
roles/batch.agentReporter
roles/iam.serviceAccountUser
roles/serviceusage.serviceUsageConsumer
roles/storage.objectAdmin
roles/logging.admin

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
    --role roles/batch.jobsEditor

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
    --role roles/batch.agentReporter

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
    --role roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
    --role roles/serviceusage.serviceUsageConsumer

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
    --role roles/storage.objectAdmin

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
    --role roles/logging.admin

Install and configure Nextflow in Cloud Shell

To install Nextflow, which is supported by Google Cloud Batch, in Cloud Shell, follow these steps:

Install nextflow

curl get.nextflow.io | bash

2. set NXF_EDGE variable

export NXF_EDGE=1

3. Update the nextflow

./nextflow -self-update

4. To clone the sample pipeline repository, which includes both the pipeline to be executed and the sample data it requires, run the following command:

git clone https://github.com/nextflow-io/rnaseq-nf.git

5. To configure Nextflow, perform the following actions:

5a. Change to the rnaseq-nf folder.

cd rnaseq-nf
git checkout v2.0

5b. Add the following block of code into nextflow.config file to configure.

   'google-batch' {
        params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
        params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
        params.multiqc = 'gs://rnaseq-nf/multiqc'
        params.outdir =  'gs://<PBUCKET_NAME>/output'
        process.executor = 'google-batch'
        process.container = 'nextflow/rnaseq-nf:latest'
        workDir = 'gs://<PBUCKET_NAME>/work_dir'
        google.location = 'us-central1'
        google.region  = 'us-central1'
        google.project = '<PROJECT_ID>'
        google.batch.serviceAccountEmail = '<SERVICE_ACCOUNT_ADDRESS>'
    }

Final nextflow.config file looks like this.

5c. Change back to the previous folder.

cd ..

Execute the pipeline using Nextflow

To run the pipeline using Nextflow, execute the following command. Please note that once the pipeline has been initiated, it will run in the background until completion, which may take up to 10 minutes.

NXF_VER=22.12.0-edge ./nextflow run rnaseq-nf/main.nf -profile 'google-batch'

Upon completion of the pipeline, the following message will be displayed:

N E X T F L O W  ~  version 22.12.0-edge
Launching `rnaseq-nf/main.nf` [cheesy_coulomb] DSL2 - revision: ef908c0bfd
 R N A S E Q - N F   P I P E L I N E
 ===================================
 transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa
 reads        : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq
 outdir       : gs://<PROJECT_ID>/output

executor >  google-batch (4)
[08/b833e8] process > RNASEQ:INDEX (transcript)     [100%] 1 of 1 ✔
[42/c5fdb7] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔
[55/d4f890] process > RNASEQ:QUANT (gut)            [100%] 1 of 1 ✔
[d3/c3eba6] process > MULTIQC                       [100%] 1 of 1 ✔

Done! Open the following report in your browser --> gs://animated-spirit-246508/output/multiqc_report.html

Completed at: 23-Dec-2022 13:45:42
Duration    : 8m 4s
CPU hours   : (a few seconds)
Succeeded   : 4

Viewing output of Nextflow pipeline

The output files can be found in following location, gs://<BUCKET_NAME>/output

Troubleshooting

If a pipeline fails, it can be troubleshoot by following methods:

Use cloud logging to see any error logs have generated.
Open nextflow.log, .command.err, .command.log, command.out files to check for errors messages.