Deploying a Text-Generation Inference server (TGI) on a Google Cloud TPU instance

Text-Generation-Inference (TGI) enables serving Large Language Models (LLMs) on TPUs, with Optimum TPU delivering a specialized TGI runtime that’s fully optimized for TPU hardware.

TGI also offers an openAI-compatible API, making it easy to integrate with numerous tools.

For a list of supported models, check the Supported Models page.

Deploy TGI on a Cloud TPU Instance

This guide assumes you have a Cloud TPU instance running. If not, please refer to our deployment guide.

You have two options for deploying TGI:

Use our pre-built TGI image (recommended)
Build the image manually for the latest features

Option 1: Using the Pre-built Image

The optimum-tpu image is available at ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi. Please look at optimum-tpu container documentation for the latest TGI image. The tutorial on serving also walks you through how to start the TGI container from a pre-built image. Here’s how to deploy it:

docker run -p 8080:80 \ --shm-size 16GB \ --privileged \ --net host \ -e LOG_LEVEL=text_generation_router=debug \ -v ~/hf_data:/data \ -e HF_TOKEN=<your_hf_token_here> \ ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \ --model-id google/gemma-2b-it \ --max-input-length 512 \ --max-total-tokens 1024 \ --max-batch-prefill-tokens 512 \ --max-batch-total-tokens 1024

You need to replace with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)

If you already logged in via `huggingface-cli login` then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convinence

You can also use the GCP-provided image as referenced in the optimum-tpu container page

Option 2: Manual Image Building

For the latest features (main branch of optimum-tpu) or custom modifications, build the image yourself:

Clone the repository:

git clone https://github.com/huggingface/optimum-tpu.git

Build the image:

make tpu-tgi

Run the container:

HF_TOKEN=<your_hf_token_here> MODEL_ID=google/gemma-2b-it sudo docker run --net=host \ --privileged \ -v $(pwd)/data:/data \ -e HF_TOKEN=${HF_TOKEN} \ huggingface/optimum-tpu:latest \ --model-id ${MODEL_ID} \ --max-concurrent-requests 4 \ --max-input-length 32 \ --max-total-tokens 64 \ --max-batch-size 1

Executing requests against the service

You can query the model using either the /generate or /generate_stream routes:

curl localhost/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json'

curl localhost/generate_stream \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json'