Deploy Flan-T5 XXL on Vertex AI Prediction

Published in

Google Cloud - Community

5 min readMar 9, 2023

In a previous post, I wrote about how to fine-tune and deploy a Flan-T5 Base model (220M parameters) in Vertex AI. This post shows how to deploy a FLAN-T5 XXL model (11B parameters) in Vertex AI Prediction. The model will be downloaded and embedded in a custom prediction image. You will use a a2-highgpu-1g machine type in Vertex AI Prediction containing 1xA100 GPU.

Note that building the image and deploying it into Vertex AI is a process that can take up to 4 hours, and the cost of a Vertex Prediction endpoint (24x7) is around USD 170 per day with 1xA100 GPU.

A demo based on Gradio and deployed in Cloud Run is also provided as a quick setup for you to make requests to the deployed model.

The model: Flan-T5 XXL (fp16)

If you are familiar with T5, an open-source LLM model from Google, Flan-T5 improves T5 at almost everything. For the same number of parameters, Flan-T5 models have been fine-tuned on more than 1000 additional tasks covering more languages.

As mentioned in the paper:

We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). […] Flan-PaLM achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU.

Fig.2 How Flan works. Source. Google blog

Flan-T5 is released with different sizes: Small, Base, Large, XL and XXL. XXL is the biggest version of Flan-T5, containing 11B parameters. Original checkpoints from Google for all sizes are available here.

The Flan-T5 XXL model card can be accessed here.

You will use the fp16 version (floating point 16) of the model, which reduces the original size (42 GiB) by half (21 GiB). This reduced version of the model is easier to manage and can be downloaded from Hugging Face.

You must download the model and store it in predict/flan-t5-xxl-sharded-fp16 directory, with a content similar to what is shown below. Note that handler.py is not needed since you will not be using TorchServe:

config.json
pytorch_model-00001-of-00012.bin
pytorch_model-00002-of-00012.bin
pytorch_model-00003-of-00012.bin
pytorch_model-00004-of-00012.bin
pytorch_model-00005-of-00012.bin
pytorch_model-00006-of-00012.bin
pytorch_model-00007-of-00012.bin
pytorch_model-00008-of-00012.bin
pytorch_model-00009-of-00012.bin
pytorch_model-00010-of-00012.bin
pytorch_model-00011-of-00012.bin
pytorch_model-00012-of-00012.bin
pytorch_model.bin.index.json
special_tokens_map.json
spiece.model
tokenizer.json
tokenizer_config.json

Build the Custom Prediction Container image

A Custom Container image for predictions is required. A Custom Container image in Vertex AI requires that the container runs an HTTP server. Specifically, the container must listen and respond to liveness checks, health checks, and prediction requests.

You will use the Tiangolo Uvicorn-Gunicorn web server. It does not have advanced capabilities like NVidia Triton Inference Server or TorchServe about handling multiple models, but enough for the purposes of this demo. You must build and push the container image to Artifact Registry:

gcloud auth configure-docker europe-west4-docker.pkg.dev
gcloud builds submit --tag europe-west4-docker.pkg.dev/argolis-rafaelsanchez-ml-dev/ml-pipelines-repo/flan-t5-xxl-sharded-fp16 --machine-type=e2-highcpu-8 --timeout="2h" --disk-size=300

This build process can take up to 2 hours even with a e2-highcpu-8 due to the container size (around 20 GiB).

Deploy the model to Vertex AI Prediction

Upload and deploy the image to Vertex AI Prediction using the provided script: python3 upload_custom.py. You will use a a2-highcpu-1g (12 vCPU, 85 GB RAM) with 1xA100 (40 GB HBM2) to deploy. Note the price of $4.22 per node-hour plus the price of $2.9 per hour for the GPU.

The upload and deploy process may take up to 2 hours. Note the parameter deploy_request_timeout to avoid a 504 Deadline Exceeded error during the deployment:

DEPLOY_IMAGE = 'europe-west4-docker.pkg.dev/argolis-rafaelsanchez-ml-dev/ml-pipelines-repo/flan-t5-xxl-uvicorn' 
HEALTH_ROUTE = "/health"
PREDICT_ROUTE = "/predict"
SERVING_CONTAINER_PORTS = [7080]model = aiplatform.Model.upload(
    display_name=f'custom-finetuning-flan-t5-xxl',    
    description=f'Finetuned Flan T5 XXL model with Uvicorn and FastAPI',
    serving_container_image_uri=DEPLOY_IMAGE,
    serving_container_predict_route=PREDICT_ROUTE,
    serving_container_health_route=HEALTH_ROUTE,
    serving_container_ports=SERVING_CONTAINER_PORTS,
)
print(model.resource_name)# Retrieve a Model on Vertex
model = aiplatform.Model(model.resource_name)# Deploy model
endpoint = model.deploy(
     machine_type='a2-highgpu-1g',
     traffic_split={"0": 100}, 
     min_replica_count=1,
     max_replica_count=1,
     accelerator_type= "NVIDIA_TESLA_A100",    
     accelerator_count=1,
     traffic_percentage=100,
     deploy_request_timeout=1200,
     sync=True,
)

Scaling

Estimating the resources required to serve a model is important. In this example the Flan-T5 XXL (fp16) model is approximately 20 GiB size and the machine type used for predictions has 12 vCPU, 85 GB RAM, and 40 GB in the GPU. It is recommended to make load testing for production deployments.

Both Uvicorn-Gunicorn server and Vertex AI Prediction provides several options to scale and handle resources properly:

Vertex AI Prediction provides autoscaling of CPUs/GPUs. In this example, no autoscaling is configured, so you will always use 1 machine.
The Uvicorn-Gunicorn server allows to configure the number of workers. In this example, the number of workers is 1. Given the size of the model, you can also try to use 2 workers and serve 2 instances of the model.
The Uvicorn-Gunicorn server supports pre-loading, which can save RAM resources by preloading the model before workers are forked (in case of using more than 1 worker). In this example pre-loading is not used.
You can try the 8-bit version of Flan-T5 XXL, but be aware if the issues.

Scaling a model of this size is not the main focus of this post, and hence has not been analyzed.

Gradio demo UI

You are now ready to send predictions to the deployed model. You can simply use the REST API or the Python SDK, but in this demo you will build a simple UI using Gradio.

The Gradio app is built on Cloud Run. You need to build the docker image first, upload the image to Artifact Registry and then deploy it to Cloud Run. The following command runs the last step (Cloud Run deployment):

gcloud run deploy flan-t5-xxl-gradio --port 7860 --image europe-west4-docker.pkg.dev/argolis-rafaelsanchez-ml-dev/ml-pipelines-repo/flan-t5-xxl-gradio --allow-unauthenticated --region=europe-west4 --platform=managed

The image below shows the Gradio app deployed in Cloud Run. You can test the provided examples or try your owns. Make sure to read the instructions provided in the UI main page: