Posted on Feb 6 • Originally published at Medium

Leverage open models like Gemma 2 on GKE with LangChain

In my previous posts, we explored how LangChain simplifies AI application development and how to deploy Gemini-powered LangChain applications on GKE. Now, let's take a look at a slightly different approach: running your own instance of Gemma, Google's open large language model, directly within your GKE cluster and integrating it with LangChain.

Why choose Gemma on GKE?

While using an LLM endpoint like Gemini is convenient, running an open model like Gemma 2 on your GKE cluster can offer several advantages:

Control: You have complete control over the model, its resources, and its scaling. This is particularly important for applications with strict performance or security requirements.
Customization: You can fine-tune the model on your own datasets to optimize it for specific tasks or domains.
Cost optimization: For high-volume usage, running your own instance can potentially be more cost-effective than using the API.
Data locality: Keep your data and model within your controlled environment, which can be crucial for compliance and privacy.
Experimentation: You can experiment with the latest research and techniques without being limited by the API's features.

Deploying Gemma on GKE

Deploying Gemma on GKE involves several steps, from setting up your GKE cluster to configuring LangChain to use your Gemma instance as its LLM.

Set up credentials

To be able to use the Gemma 2 model, you first need a Hugging Face account. Start by creating one if you don't already have one, and create a token key with read permissions from your settings page. Make sure to note down the token value, which we'll need in a bit.

Then, go to the model consent page to accept the terms and conditions of using the Gemma 2 model. Once that is done, we're ready to deploy our open model.

Set up your GKE Cluster

If you don't already have a GKE cluster, you can create one through the Google Cloud Console or using the gcloud command-line tool. Make sure to choose a machine type with sufficient resources to run Gemma, such as the g2-standard family which includes an attached NVIDIA L4 GPU. To simplify this, we can simply create a GKE Autopilot cluster.

gcloud container clusters create-auto langchain-cluster \ --project=PROJECT_ID \ --region=us-central1

Deploy a Gemma 2 instance

For this example we'll be deploying an instruction-tuned instance of Gemma 2 using a vLLM image. The following manifest describes a deployment and corresponding service for the gemma-2-2b-it model. Replace HUGGINGFACE_TOKEN with the token you generated earlier.

apiVersion: apps/v1 kind: Deployment metadata: name: gemma-deployment spec: replicas: 1 selector: matchLabels: app: gemma-server template: metadata: labels: app: gemma-server ai.gke.io/model: gemma-2-2b-it ai.gke.io/inference-server: vllm examples.ai.gke.io/source: model-garden spec: containers: - name: inference-server image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250114_0916_RC00_maas resources: requests: cpu: 2 memory: 34Gi ephemeral-storage: 10Gi nvidia.com/gpu: 1 limits: cpu: 2 memory: 34Gi ephemeral-storage: 10Gi nvidia.com/gpu: 1 args: - python - -m - vllm.entrypoints.api_server - --host=0.0.0.0 - --port=8000 - --model=google/gemma-2-2b-it - --tensor-parallel-size=1 - --swap-space=16 - --gpu-memory-utilization=0.95 - --enable-chunked-prefill - --disable-log-stats env: - name: MODEL_ID value: google/gemma-2-2b-it - name: DEPLOY_SOURCE value: "UI_NATIVE_MODEL" - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 --- apiVersion: v1 kind: Service metadata: name: llm-service spec: selector: app: gemma-server type: ClusterIP ports: - protocol: TCP port: 8000 targetPort: 8000 --- apiVersion: v1 kind: Secret metadata: name: hf-secret type: Opaque stringData: hf_api_token: HUGGINGFACE_TOKEN

Save this to a file called gemma-2-deployment.yaml, then deploy it to your cluster:

kubectl apply -f gemma-2-deployment.yaml

Deploying LangChain on GKE

Now that we have our GKE cluster and Gemma deployed, we need to create our LangChain application and deploy it. If you've followed my previous post, you'll notice that these steps are very similar. The main differences are that we're pointing LangChain to Gemma instead of Gemini, and that our LangChain application uses a custom LLM class to ingest our local instance of Gemma.

Containerize your LangChain application

First, we need to package our LangChain application into a Docker container. This involves creating a Dockerfile that specifies the environment and dependencies for our application. Here is a Python application using LangChain and Gemma, which we'll save as app.py:

from langchain_core.callbacks.manager import CallbackManagerForLLMRun from langchain_core.language_models.llms import LLM from langchain_core.prompts import ChatPromptTemplate from typing import Any, Optional from flask import Flask, request import requests class VLLMServerLLM(LLM): vllm_url: str model: Optional[str] = None temperature: float = 0.0 max_tokens: int = 2048 @property def _llm_type(self) -> str: return "vllm_server" def _call( self, prompt: str, run_manager: Optional[CallbackManagerForLLMRun] = None, **kwargs: Any, ) -> str: headers = {"Content-Type": "application/json"} payload = { "prompt": prompt, "temperature": self.temperature, "max_tokens": self.max_tokens, **kwargs } if self.model: payload["model"] = self.model try: response = requests.post(self.vllm_url, headers=headers, json=payload, timeout=120) response.raise_for_status() json_response = response.json() if isinstance(json_response, dict) and "predictions" in json_response: text = json_response["predictions"][0] else: raise ValueError(f"Unexpected response format from vLLM server: {json_response}") return text except requests.exceptions.RequestException as e: raise ValueError(f"Error communicating with vLLM server: {e}") except (KeyError, TypeError) as e: raise ValueError(f"Error parsing vLLM server response: {e}. Response was: {json_response}") llm = VLLMServerLLM(vllm_url="http://llm-service:8000/generate", temperature=0.7, max_tokens=512) prompt = ChatPromptTemplate.from_messages( [ ( "system", "You are a helpful assistant that answers questions about a given topic.", ), ("human", "{input}"), ] ) chain = prompt | llm def create_app(): app = Flask(__name__) @app.route("/ask", methods=['POST']) def talkToGemini(): user_input = request.json['input'] response = chain.invoke({"input": user_input}) return response return app if __name__ == "__main__": app = create_app() app.run(host='0.0.0.0', port=80)

Then, create a Dockerfile to define how to assemble our image:

# Use an official Python runtime as a parent image FROM python:3-slim # Set the working directory in the container WORKDIR /app # Copy the current directory contents into the container at /app COPY . /app # Install any needed packages specified in requirements.txt RUN pip install -r requirements.txt # Make port 80 available to the world outside this container EXPOSE 80 # Run app.py when the container launches CMD [ "python", "app.py" ]

For our dependencies, create the requirements.txt file containing LangChain and a web framework, Flask:

langchain flask

Finally, build the container image and push it to Artifact Registry. Don't forget to replace PROJECT_ID with your Google Cloud project ID.

# Authenticate with Google Cloud gcloud auth login # Create the repository gcloud artifacts repositories create images \ --repository-format=docker \ --location=us # Configure authentication to the desired repository gcloud auth configure-docker us-docker.pkg.dev/PROJECT_ID/images # Build the image docker build -t us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1 . # Push the image docker push us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1

After a handful of seconds, your container image should now be stored in your Artifact Registry repository.

Deploy to GKE

Create a YAML file with your Kubernetes deployment and service manifests. Let's call it deployment.yaml, replacing PROJECT_ID.

apiVersion: apps/v1 kind: Deployment metadata: name: langchain-deployment spec: replicas: 3 # Scale as needed selector: # Add selector here matchLabels: app: langchain-app template: metadata: labels: app: langchain-app spec: containers: - name: langchain-container image: us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1 ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: langchain-service spec: selector: app: langchain-app ports: - protocol: TCP port: 80 targetPort: 80 type: LoadBalancer # Exposes the service externally

Apply the manifest to your cluster:

# Get the context of your cluster gcloud container clusters get-credentials langchain-cluster --region us-central1 # Deploy the manifest kubectl apply -f deployment.yaml

This creates a deployment with three replicas of your LangChain application and exposes it externally through a load balancer. You can adjust the number of replicas based on your expected load.

Interact with your deployed application

Once the service is deployed, you can get the external IP address of your application using:

export EXTERNAL_IP=`kubectl get service/langchain-service \ --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`

You can now send requests to your LangChain application running on GKE. For example:

curl -X POST -H "Content-Type: application/json" \ -d '{"input": "Tell me a fun fact about hummingbirds"}' \ http://$EXTERNAL_IP/ask

Considerations and enhancements

Scaling: You can scale your Gemma deployment independently of your LangChain application based on the load generated by the model.
Monitoring: Use Cloud Monitoring and Cloud Logging to track the performance of both Gemma and your LangChain application. Look for error rates, latency, and resource utilization.
Fine-tuning: Consider fine-tuning Gemma on your own dataset to improve its performance on your specific use case.
Security: Implement appropriate security measures, such as network policies and authentication, to protect your Gemma instance.

Conclusion

Deploying Gemma on GKE and integrating it with LangChain provides a powerful and flexible way to build AI-powered applications. You gain fine-grained control over your model and infrastructure while still leveraging the developer-friendly features of LangChain. This approach allows you to tailor your setup to your specific needs, whether it's optimizing for performance, cost, or control.

Next steps:

Explore the Gemma documentation for more details on the model and its capabilities.
Check out the LangChain documentation for advanced use cases and integrations.
Dive deeper into GKE documentation for running production workloads.

In the next post, we will take a look at how to streamline LangChain deployments using LangServe.