Posted on Sep 25 • Originally published at Medium

Serverless AI: EmbeddingGemma with Cloud Run

Building on the previous blog post about running Qwen3 Embedding models on Cloud Run, this article focuses on the recently released EmbeddingGemma model from the Gemma family. Discover how to leverage the same powerful serverless techniques to deploy this model on Google Cloud's serverless platform.

You will learn how to:

Containerize the embedding model with Docker and Ollama
Deploy the embedding model to Cloud Run with GPUs
Test the deployed model from a local machine

Before we dive into the code, let's briefly discuss the core components that power this serverless AI solution.

EmbeddingGemma Model

According to the EmbeddingGemma model card:

“EmbeddingGemma is a 308M parameter multilingual text embedding model based on Gemma 3. It is optimized for use in everyday devices, such as phones, laptops, and tablets. The model produces numerical representations of text to be used for downstream tasks like information retrieval, semantic similarity search, classification, and clustering.”

Its optimization for efficiency makes EmbeddingGemma an ideal candidate for serverless deployment on Cloud Run, ensuring high performance and cost-effectiveness for your AI applications.

Cloud Run

Cloud Run is a managed compute platform on Google Cloud that lets you run containerized applications in a serverless environment. Think of it as a middle ground between a simple function-as-a-service (like Cloud Run Functions) and a more customizable GKE cluster. You give it a container image, and it handles all the underlying infrastructure, from provisioning and scaling to managing the runtime.

The beauty of Cloud Run is that it can automatically scale to zero, meaning when there are no requests, you aren't paying for any resources. When traffic picks up, it quickly scales up to handle the load. This makes it perfect for stateless models that need to be highly available and cost-effective.

Deployment

Let's walk through the deployment process step-by-step.

Prepare the environment

First lets configure the gcloud CLI environment.

Note: if you do not have gcloud CLI installed please follow instructions available here.

Step 1 - Set your default project:

gcloud config set project PROJECT_ID

Step 2 - Configure Google Cloud CLI to use the _europe-west1 _region for Cloud Run commands:

gcloud config set run/region europe-west1

Important: at the time of writing, GPUs on Cloud Run are available in several regions. To check the closest supported region please refer to this page.

Containerize

Now we will use Docker and Ollama to run the EmbeddingGemma model. Create a file named Dockerfile containing:

FROM ollama/ollama:latest # Listen on all interfaces, port 8080 ENV OLLAMA_HOST=0.0.0.0:8080 # Store model weight files in /models ENV OLLAMA_MODELS=/models # Reduce logging verbosity ENV OLLAMA_DEBUG=false # Never unload model weights from the GPU ENV OLLAMA_KEEP_ALIVE=-1 # Store the model weights in the container image ENV MODEL=embeddinggemma:latest RUN ollama serve & sleep 5 && ollama pull $MODEL # Start Ollama ENTRYPOINT ["ollama", "serve"]

Build and Deploy

We will now use Cloud Run's source deployments. This allows you to achieve the following with one command:

First, compile the container image from the provided source.
Next, upload the resulting container image to an Artifact Registry.
Then, deploy the container to Cloud Run, ensuring that GPU support is enabled using the --gpu and --gpu-type parameters.
Finally, redirect all incoming traffic to this newly deployed version.

You just need to run:

gcloud run deploy embedding-gemma \ --source . \ --concurrency 4 \ --cpu 8 \ --set-env-vars OLLAMA_NUM_PARALLEL=4 \ --gpu 1 \ --gpu-type nvidia-l4 \ --max-instances 1 \ --memory 32Gi \ --no-allow-unauthenticated \ --no-cpu-throttling \ --no-gpu-zonal-redundancy \ --timeout=600 \ --labels dev-tutorial=blog-embedding-gemma

Note the following important flags in this command:

--concurrency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.
--gpu 1 with --gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
--max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project's NVIDIA L4 GPU quota.
--no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run's built-in Identity and Access Management (IAM) authentication for service-to-service communication.
--no-cpu-throttling is required for enabling GPU.
--no-gpu-zonal-redundancy set zonal redundancy options depending on your zonal failover requirements and available quota.

Test the deployment

Upon successful deployment of the service, you can initiate requests. However, direct api calls will result in an HTTP 401 Unauthorized response from Cloud Run.

This behaviour follows Google’s “secure by default” approach. The model is intended for calls from other services, such as a RAG application, and therefore is not open for public access.

To support local testing of your deployment, the simplest approach is to launch the Cloud Run developer proxy using the following command:

gcloud run services proxy embedding-gemma --port=9090

Afterwards, in a second terminal window, run:

curl http://localhost:9090/api/embed -d '{ "model": "embeddinggemma", "input": "Sample text" }'

The response will look similar to this:

You can also use Python to call the endpoint. Example:

from ollama import Client client = Client(host="http://localhost:9090") response = client.embed(model="embeddinggemma", input="Sample text") print(response)

Congratulations 🎉 The Cloud Run deployment is up and running!

RAG Example

You can use the newly deployed model to build your first RAG application. Here’s how to achieve this:

Step 1 - Generate Embeddings

Start with required dependencies:

pip install ollama chromadb

Create an example.py file containing:

import ollama import chromadb documents = [ "Poland is a country located in Central Europe.", "The capital and largest city of Poland is Warsaw.", "Poland's official language is Polish, which is a West Slavic language.", "Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.", "Poland is famous for its traditional dish called pierogi, which are filled dumplings.", "The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.", ] client = chromadb.Client() collection = client.create_collection(name="docs") ollama_client = ollama.Client(host="http://localhost:9090") # Store each document in a in-memory vector embeddings database for i, d in enumerate(documents): response = ollama_client.embed(model="embeddinggemma", input=d) embeddings = response["embeddings"] collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])

Step 2 - Retrieve

Next, with the following code you can search the vector database for the most relevant document (add it to your example.py):

# An example question question = "What is Poland's official language?" # Generate an embedding for the input and retrieve the most relevant document response = ollama_client.embed(model="embeddinggemma", input=question) results = collection.query(query_embeddings=[response["embeddings"][0]], n_results=1) data = results["documents"][0][0]

Step 3 - Generate Final Answer

In this final step step we will use a locally installed Gemma3.

_Note: We use Gemma3 in the generation step, but any other model could work here (e.g., Gemini, Qwen3, Llama, etc.). Nevertheless, it is critical to use the same embeddings model in Step 1 (Generate Embeddings) and Step 2 (Retrieve). _

To install the Gemma3:latest model run:

ollama pull gemma3

Now can combine user’s prompt with search results and generate the final answer (add this code to example.py):

# Final step - generate a response combining the prompt and data we retrieved in step 2 prompt = f"Using this data: {data}. Respond to this prompt: {question}" output = ollama.generate( model="gemma3", prompt=prompt, ) print(f"Prompt: {prompt}") print(output["response"])

Run the code:

python example.py

The answer should look similar to the one below:

Prompt: Using this data: Poland's official language is Polish, which is a West Slavic language.. Respond to this prompt: What is Poland's official language? Poland's official language is Polish. It's a West Slavic language.

You have successfully created and run your first RAG application using the EmbeddingGemma model.

Summary

At this point, you have successfully established a Cloud Run service running the EmbeddingGemma model, ready to generate embeddings for semantic search or RAG applications.

This method also allows you to deploy and compare multiple embedding models on Cloud Run (e.g. Qwen3 Embedding or other Ollama-supported models), enabling you to find the best fit for your specific use case without major code changes.

Ready to build your own serverless AI applications?

Start building on Cloud Run today and explore its full potential!
If you’re interested in learning more about RAG evaluation, this article is a good starting point.

Thanks for reading

If you found this article helpful, please consider following me here and giving it a clap 👏 to help others discover it.

I'm always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky.

DEV Community