DEV Community

Cover image for Serving LLMs at Scale with KitOps, Kubeflow, and KServe
Jesse Williams for Jozu

Posted on • Originally published at jozu.com

Serving LLMs at Scale with KitOps, Kubeflow, and KServe

Introduction

Over the past few years, large language models (LLMs) have transformed how we build intelligent applications. From chatbots to code assistants, these models are used to power production systems across industries. But while training LLMs has become more accessible, deploying them at scale remains a challenge. Models generally come with gigabyte-sized weight files, depend on specific library versions, require careful GPU or CPU resource allocation, and need constant versioning as new checkpoints roll out. More often than not, a model that works in a data scientist's notebook can fail in production because of a mismatched dependency, a missing tokenizer file, or an environment variable that wasn't set.

KitOps (a CNCF project backed by Jozu) offers a solution called ModelKits, which is a standardized artifact that packages an ML model with its dependencies and configuration. This open-source toolkit lets organizations, developers, and data scientists bundle their models into versionable, signable, and portable ModelKits that can be pushed to any OCI-compliant registry. The result is consistent version tracking and reliable model artifacts across all environments, bringing the same level of control we expect from software development to machine learning deployments.

In this guide, we'll show you how to combine KitOps with Kubeflow and KServe to serve large language models at scale. You'll learn how to package an LLM into a ModelKit, deploy it with KServe's inference endpoints, and let Jozu handle the orchestration, all without needing dedicated GPU hardware to follow along—you can take an even deeper dive into production ML on Kubernetes by downloading our full technical guide to Kubernetes ML.

Learning Objectives

  • Build and package a TensorFlow LLM model into a ModelKit using KitOps
  • Pack and push the ModelKit to Jozu, an OCI-compliant registry built for ModelKits
  • Set up Kubeflow and KServe to serve your model in production
  • Scale and secure your model deployments in production environments

Prerequisites and Setup

Before we start deploying LLMs at scale, let's make sure you have the right tools installed and configured. This section walks through everything you need such as Python for running your model code, the KitOps CLI for packaging ModelKits, and a Jozu sandbox account for storing and managing your artifacts.

Install Python

For this project, you'll need Python 3.10 or above installed on your system. This ensures compatibility with modern ML libraries like TensorFlow and the dependencies we'll use throughout this guide. If you don't have Python installed yet, grab it from python.org and follow the installation steps for your operating system.

Install the KitOps CLI

The Kit CLI is what we'll use to pack, push, and manage ModelKits. Head over to the KitOps installation page and pick the installation method that matches your OS, whether you're on macOS, Linux, or Windows, and install accordingly.

Once you've installed the CLI, verify it's working by running:

kit version 
Enter fullscreen mode Exit fullscreen mode

The output should show the version details:

Sign Up for Jozu

Jozu is your OCI-compliant registry for ModelKits. It's where you'll push packaged models and pull them during deployment. To get started with Jozu, head over to jozu.ml and click Sign Up to create an account. Make sure to note your username and password as you'll need them in the next step to authenticate your CLI.

Authenticate with Jozu

Now let's connect your local Kit CLI to your Jozu account. Open a terminal and run:

kit login jozu.ml 
Enter fullscreen mode Exit fullscreen mode

You'll be prompted to enter your username (the email you registered with) and the password you created. If everything is set up correctly, you'll see:

Building a TensorFlow LLM Model

TensorFlow is one of the most popular open-source frameworks for building and training machine learning models. It was developed by Google, and it's particularly well-suited for production environments where you need scalable, efficient model serving across CPUs, GPUs, and TPUs.

TensorFlow shines in enterprise deployments, mobile applications, and in scenarios where you need tight integration with serving infrastructure. In this guide, we'll use TensorFlow to fine-tune a small T5 model that translates corporate jargon into plain language.

Set Up Your Project Directory

Let's start by creating a clean workspace for our model. Run these commands in your terminal to create your project directory:

mkdir corporate-speak cd corporate-speak 
Enter fullscreen mode Exit fullscreen mode

Now create a Python virtual environment to keep dependencies isolated. It is essential to use a virtual environment as it isolates the project's dependencies from your global Python installation, therefore preventing conflicts with other projects and ensuring reproducible results:

python3 -m venv env source env/bin/activate 
Enter fullscreen mode Exit fullscreen mode

Install Dependencies

Create a requirements.txt file in your project root with the following libraries:

tensorflow==2.19.1 transformers==4.49.0 huggingface-hub==0.26.0 tf-keras fastapi uvicorn sentencepiece 
Enter fullscreen mode Exit fullscreen mode

Install everything with:

pip install -r requirements.txt 
Enter fullscreen mode Exit fullscreen mode

This pulls in TensorFlow for training, Transformers for the T5 model, FastAPI for serving later, and all the supporting libraries we'll need.

Create the Training Data

Before we can train our model, we need some data. Create a data directory in your project root:

mkdir data 
Enter fullscreen mode Exit fullscreen mode

Inside the data directory, create a file called corporate\_speak.json and paste this training dataset:

[ { "term": "Circle back", "meaning": "We'll talk about this later because we don't want to deal with it right now." }, { "term": "Synergy", "meaning": "Making two teams do one team's job, but with extra meetings." }, { "term": "Bandwidth", "meaning": "How much energy or patience a person has left." }, { "term": "Low-hanging fruit", "meaning": "The easiest task that still lets us look productive." }, { "term": "Touch base", "meaning": "Talk briefly to pretend progress is being made." }, { "term": "Pivot", "meaning": "Our original idea failed; let's rename it and try again." }, { "term": "Going forward", "meaning": "Forget what we said last time." }, { "term": "Alignment", "meaning": "Make sure no one disagrees publicly." } ] 
Enter fullscreen mode Exit fullscreen mode

This small dataset gives the model eight examples of corporate jargon and their plain-language meanings. It's just enough to fine-tune T5 for our demonstration without requiring heavy compute resources.

Create the Training Script

Next, make a directory for your application code:

mkdir app 
Enter fullscreen mode Exit fullscreen mode

Inside the app directory, create a file called train\_llm.py and add this code:

import os import json import tensorflow as tf from transformers import T5Tokenizer, TFT5ForConditionalGeneration BASE\_DIR \= os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_))) DATA\_PATH \= os.path.join(BASE\_DIR, "data", "corporate\_speak.json") print(f"Base Directory: {BASE\_DIR}") print(f"Data Path: {DATA\_PATH}") def load\_data(file\_path): """Loads JSON data from the specified file path.""" try: with open(file\_path, 'r') as f: data \= json.load(f) print(f"Successfully loaded {len(data)} records from data file.") return data except FileNotFoundError: print(f"ERROR: Data file not found at {file\_path}") print("Please ensure you have created the file 'corporate\_speak.json' and the 'data' folder.") return None except json.JSONDecodeError: print(f"ERROR: Could not decode JSON from {file\_path}. Check file format.") return None DATA \= load\_data(DATA\_PATH) if DATA is None: exit() ## Stop if data loading failed  prompts \= [f"term: {item['term']}" for item in DATA] responses \= [f"meaning: {item['meaning']}" for item in DATA] MODEL\_NAME \= 't5-small' MAX\_LENGTH \= 128 BATCH\_SIZE \= 4 LEARNING\_RATE \= 1e-5 EPOCHS \= 15 print(f"\\nLoading T5 model and tokenizer: {MODEL\_NAME}...") tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_NAME) model \= TFT5ForConditionalGeneration.from\_pretrained(MODEL\_NAME) tokenized\_inputs \= tokenizer( prompts, return\_tensors='tf', max\_length=MAX\_LENGTH, padding='max\_length', truncation=True ) tokenized\_targets \= tokenizer( responses, return\_tensors='tf', max\_length=MAX\_LENGTH, padding='max\_length', truncation=True ) labels \= tokenized\_targets['input\_ids'] dataset \= tf.data.Dataset.from\_tensor\_slices( ( {'input\_ids': tokenized\_inputs['input\_ids'], 'attention\_mask': tokenized\_inputs['attention\_mask']}, labels ) ).shuffle(buffer\_size=len(DATA)).batch(BATCH\_SIZE) print("\\n--- Starting Fine-Tuning ---") optimizer \= tf.keras.optimizers.Adam(learning\_rate=LEARNING\_RATE) model.compile(optimizer=optimizer) history \= model.fit( dataset, epochs=EPOCHS, verbose=1 ) print("--- Fine-Tuning Complete ---") print("\\n--- Testing Model Generation ---") test\_term\_1 \= "term: Touch base" test\_input\_1 \= tokenizer(test\_term\_1, return\_tensors='tf').input\_ids output\_tokens\_1 \= model.generate(test\_input\_1, max\_length=MAX\_LENGTH) decoded\_meaning\_1 \= tokenizer.decode(output\_tokens\_1[0], skip\_special\_tokens=True) print(f"Input: '{test\_term\_1}'") print(f"Output: '{decoded\_meaning\_1}'") test\_term\_2 \= "term: Alignment" test\_input\_2 \= tokenizer(test\_term\_2, return\_tensors='tf').input\_ids output\_tokens\_2 \= model.generate(test\_input\_2, max\_length=MAX\_LENGTH) decoded\_meaning\_2 \= tokenizer.decode(output\_tokens\_2[0], skip\_special\_tokens=True) print(f"\\nInput: '{test\_term\_2}'") print(f"Output: '{decoded\_meaning\_2}'") MODEL\_SAVE\_PATH \= os.path.join(BASE\_DIR, "1") os.makedirs(MODEL\_SAVE\_PATH, exist\_ok=True) model.save(MODEL\_SAVE\_PATH, save\_format='tf') tokenizer.save\_pretrained(MODEL\_SAVE\_PATH) print(f"\\nModel saved to: {MODEL\_SAVE\_PATH}") 
Enter fullscreen mode Exit fullscreen mode

This script does four things: it loads your training data from a JSON file, tokenizes the inputs and targets for T5, fine-tunes the model for 15 epochs, and saves the trained weights along with the tokenizer to a directory called 1 in your project root.

It is important to save your model in a numbered directory or version number, as the Tensorflow Kserve program, expects to find your model in this format. Anything that deviates from this will prevent your Kserve inference service from working.

Train the Model

To train your model, run the following command from the root directory:

python3 app/train\_llm.py 
Enter fullscreen mode Exit fullscreen mode

The training process will kick off, and you'll see output showing the model loading, training progress across epochs, test predictions, and finally confirmation that the model has been saved. When complete, you'll have a new directory called 1 containing your model's saved weights (saved_model.pb), variables, tokenizer config files, and all the assets TensorFlow needs to reload and serve your model later.

Testing the Model with FastAPI

Before we package our model for production, let's make sure it actually works. We'll build a simple FastAPI inference server that loads the trained model and exposes an endpoint for predictions.

Create the Inference Server

In your app directory, create a file called inference.py and add this code:

import os import tensorflow as tf from transformers import T5Tokenizer, TFT5ForConditionalGeneration from fastapi import FastAPI, HTTPException from pydantic import BaseModel import uvicorn app \= FastAPI( title="Jargon Decoder LLM API", description="A service to translate corporate jargon using a fine-tuned T5 model.", version="1.0.0" ) tokenizer \= None model \= None MAX\_LENGTH \= 128 BASE\_DIR \= os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_))) MODEL\_SAVE\_PATH \= os.path.join(BASE\_DIR, "1") @app.on\_event("startup") async def load\_model\_on\_startup(): """Loads the fine-tuned T5 model and tokenizer when the FastAPI application starts.""" global tokenizer, model print(f"Base Directory: {BASE\_DIR}") print(f"Attempting to load model from: {MODEL\_SAVE\_PATH}") try: tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_SAVE\_PATH) model \= TFT5ForConditionalGeneration.from\_pretrained(MODEL\_SAVE\_PATH) print("Model and tokenizer loaded successfully\! 🚀") except Exception as e: print(f"FATAL ERROR: Could not load model from {MODEL\_SAVE\_PATH}.") print(f"Details: {e}") class JargonRequest(BaseModel): """Schema for the input request.""" term: str \= "Circle back" class JargonResponse(BaseModel): """Schema for the output response.""" original\_term: str decoded\_meaning: str def decode\_jargon(term: str, tokenizer, model) -> str: """ Core function to run inference on the loaded LLM. """ if not tokenizer or not model: raise HTTPException(status\_code=503, detail="Model is not loaded or ready.") prompt \= f"term: {term}" input\_ids \= tokenizer( prompt, return\_tensors='tf', max\_length=MAX\_LENGTH, padding='max\_length', truncation=True ).input\_ids output\_tokens \= model.generate( input\_ids, max\_length=MAX\_LENGTH ) decoded\_meaning \= tokenizer.decode(output\_tokens[0], skip\_special\_tokens=True) if decoded\_meaning.startswith("meaning: "): return decoded\_meaning[9:].strip() return decoded\_meaning.strip() @app.post("/decode/", response\_model=JargonResponse) async def decode(request: JargonRequest): """ API endpoint to translate a corporate jargon term into plain meaning. """ try: meaning \= decode\_jargon(request.term, tokenizer, model) return JargonResponse( original\_term=request.term, decoded\_meaning=meaning ) except HTTPException as e: ## Re-raise explicit HTTP exceptions  raise e except Exception as e: ## Handle unexpected errors  print(f"Inference Error: {e}") raise HTTPException(status\_code=500, detail=f"Internal server error during inference: {e}") if \_\_name\_\_ \== "\_\_main\_\_": uvicorn.run("inference:app", host="0.0.0.0", port=8000, reload=True) 
Enter fullscreen mode Exit fullscreen mode

This inference script sets up a FastAPI application that loads your fine-tuned T5 model on startup. The load_model_on_startup function pulls the tokenizer and model from the saved directory, making them available globally. The decode_jargon function handles the actual inference: it takes a corporate term, formats it as a prompt, runs it through the model, and returns the decoded meaning.

The /decode/ endpoint accepts POST requests with a jargon term and responds with the plain-language translation. Pydantic models ensure type safety for requests and responses, while error handling catches issues like missing models or inference failures.

Start the Server

Run the inference server from your project root:

python3 app/inference.py 
Enter fullscreen mode Exit fullscreen mode

You'll see output showing the model loading and a confirmation that the FastAPI server is running on http://0.0.0.0:8000. The startup event will trigger immediately, pulling your trained weights into memory so they're ready for inference requests.

Test the Endpoint

To test the endpoint, open a new terminal and send a test request with curl:

curl -X POST "http://localhost:8000/decode/" \\ -H "Content-Type: application/json" \\ -d '{"term": "Synergy"}' 
Enter fullscreen mode Exit fullscreen mode

If everything is working, you should see a JSON response with the decoded meaning:

{ "original\_term": "Synergy", "decoded\_meaning": "Synergy" } 
Enter fullscreen mode Exit fullscreen mode

The code and model is working and producing an output which is what we expect. Now that we've confirmed everything works locally, we can package the entire application code, model, and dependencies into a ModelKit for production deployment.

Packaging with KitOps

To make the workflow repeatable and production ready we'll use KitOps to bundle our trained model, inference code, and training data into a single ModelKit.

Initialize the Kitfile

From your project root directory, run:

kit init . 
Enter fullscreen mode Exit fullscreen mode

This creates a Kitfile in your current directory. A Kitfile is a YAML manifest that describes everything needed to reproduce your ML project—model weights, code paths, datasets, and metadata. Think of it like a Dockerfile, but designed specifically for machine learning artifacts. It tells KitOps what to bundle into your ModelKit and how those pieces fit together.

Edit the Kitfile

The generated Kitfile is a good starting point, but it doesn't capture the full structure of our project. Open the Kitfile and replace its contents with this:

manifestVersion: 1.2.0 package: name: corporate-speak-model description: A lightweight language model fine-tuned on corporate jargon to explain complex corporate terms in simple English. authors: [Thoren Oakenshield] code: - path: . description: All necessary scripts, configurations, and application logic model: name: T5 path: ./1/ framework: Tensorflow version: 1.2.0 description: A lightweight language model fine-tuned on corporate jargon to explain complex corporate terms in simple English. datasets: - name: corporate-jargon-data path: ./data/ description: A small JSON dataset containing corporate terms and their real-world meanings. 
Enter fullscreen mode Exit fullscreen mode

Let's break down what this Kitfile does. The package section holds metadata which are the model name, a description, and the author. Next, the code section points to your entire project directory, capturing all your scripts, configuration files, and application logic.

Then, the model section specifies where your trained T5 weights live (the ./1/ directory we created during training), what framework they use, and the version. Finally, the datasets section references your training data in ./data/, so anyone pulling this ModelKit knows exactly what data was used to train the model. This single file gives you a complete snapshot of your ML project.

Pack the ModelKit

Now let's bundle everything into a ModelKit, similar to how you build a Docker image. To pack your ModelKit run:

kit pack . -t jozu.ml/<username>/<model-kit-name>:<version> 
Enter fullscreen mode Exit fullscreen mode

Replace with your Jozu username and : with your model kit name and version. This command reads your Kitfile, collects all the referenced files (code, model weights, data), and packages them into a single OCI-compliant artifact. You'll see output showing KitOps compressing and layering your files.

Push to Jozu

Once the pack completes, push your ModelKit to Jozu by running:

kit push jozu.ml/<username>/<model-kit-name>:<version> 
Enter fullscreen mode Exit fullscreen mode

The CLI uploads your ModelKit layers to the registry. When it finishes, head to your Jozu account at jozu.ml, click on My Repositories, and you should see your newly pushed package listed.

Setting Up the Serving Infrastructure

Before we can deploy our model with KServe, we need to set up the complete infrastructure stack. This includes Docker for containerization, Kubernetes for orchestration, Kubeflow for ML workflows, and KServe for model serving. Let's walk through each installation step by step.

Install Docker

Docker is the container runtime that Minikube will use. If you're on Linux, run:

sudo apt-get update && sudo apt-get install docker.io -y sudo groupadd docker sudo usermod -aG docker $USER newgrp docker 
Enter fullscreen mode Exit fullscreen mode

For macOS or Windows users, head to the official Docker website and follow the installation instructions for your operating system.

Install kubectl

kubectl is the command-line tool for interacting with Kubernetes clusters. It lets you deploy applications, inspect resources, and manage cluster operations.

To Install it run:

sudo snap install kubectl --classic kubectl version --client ## Verify installation  
Enter fullscreen mode Exit fullscreen mode

Install Minikube

Next is Minikube. Minikube runs a local Kubernetes cluster on your machine which is perfect for development and testing without needing cloud resources. TO download and install it, run:

curl -LO https://github.com/kubernetes/minikube/releases/latest/download/minikube-linux-amd64 sudo install minikube-linux-amd64 /usr/local/bin/minikube && rm minikube-linux-amd64 minikube version 
Enter fullscreen mode Exit fullscreen mode

Start Minikube

It's important to start your local Kubernetes cluster with enough resources to handle model serving else your cluster will fail in the process of serving your model. To start minikube run:

minikube start --cpus=4 --memory=10240 --driver=docker kubectl get nodes kubectl cluster-info 
Enter fullscreen mode Exit fullscreen mode

This spins up a single-node cluster with 4 CPUs and 10GB of memory. The kubectl get nodes command confirms your cluster is running, and kubectl cluster-info shows the control plane endpoint.

Install Kubeflow Pipelines

Kubeflow is an open-source platform for running ML workflows on Kubernetes. It provides tools for orchestrating complex pipelines, tracking experiments, and managing model training. We'll install Kubeflow Pipelines, which handles the deployment and serving orchestration:

export PIPELINE\_VERSION=2.4.0 kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE\_VERSION" kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE\_VERSION" 
Enter fullscreen mode Exit fullscreen mode

This installation can take a few minutes. To check if all components are ready, run:

kubectl get pods -n kubeflow 
Enter fullscreen mode Exit fullscreen mode

Wait until all pods show Running status. You should see output similar to this:

NAME READY STATUS RESTARTS AGE cache-deployer-deployment-85b76bcb6-fmslx 1/1 Running 0 21h cache-server-66bd9b7875-rxdvl 1/1 Running 0 21h metadata-envoy-deployment-746744dfb8-zdgtx 1/1 Running 0 21h metadata-grpc-deployment-54654fc5bb-9cvdg 1/1 Running 6 (21h ago) 21h metadata-writer-68658fdf4b-7zpbn 1/1 Running 1 (20h ago) 21h minio-85cd46c575-gt7kp 1/1 Running 0 21h ml-pipeline-6978d6f776-p4zt9 1/1 Running 3 (20h ago) 21h ml-pipeline-persistenceagent-7d4c675666-28qnz 1/1 Running 1 (20h ago) 21h ml-pipeline-scheduledworkflow-695b7b8988-swzdj 1/1 Running 0 21h ml-pipeline-ui-88467988b-4c6md 1/1 Running 0 21h ml-pipeline-viewer-crd-bf5dc64dd-5xqv9 1/1 Running 0 21h ml-pipeline-visualizationserver-5584ff64d7-jr686 1/1 Running 0 21h mysql-6745b5984c-dn4r6 1/1 Running 0 21h workflow-controller-5b84568b94-tjjcz 1/1 Running 0 21h 
Enter fullscreen mode Exit fullscreen mode

Install KServe

KServe is a Kubernetes-native platform for serving ML models. It handles autoscaling, canary rollouts, and provides a unified inference protocol across different model frameworks. You can install it with:

curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.14/hack/quick\_install.sh" | bash 
Enter fullscreen mode Exit fullscreen mode

Once the installation completes, verify that KServe and its dependencies are running with the following commands:

kubectl get pods -n kserve kubectl get pods -n istio-system kubectl get pods -n knative-serving 
Enter fullscreen mode Exit fullscreen mode

You should see output confirming all components are operational:

NAME READY STATUS RESTARTS AGE kserve-controller-manager-86869697f-mcgrd 2/2 Running 0 20h NAME READY STATUS RESTARTS AGE istio-ingressgateway-698fff54fb-bbqh7 1/1 Running 0 20h istiod-7fdcb55c9c-qtwf5 1/1 Running 0 20h NAME READY STATUS RESTARTS AGE activator-5967d4d645-fgfhw 1/1 Running 0 20h autoscaler-598c65f5bc-9pdt4 1/1 Running 0 20h autoscaler-hpa-5b45c655dc-hx4qd 1/1 Running 0 20h controller-7cf55b567b-x45bn 1/1 Running 0 20h knative-operator-76b6894f45-58xlt 1/1 Running 0 20h net-istio-controller-54b458f57b-7cqj7 1/1 Running 0 20h net-istio-webhook-7bc64cfff6-mslz9 1/1 Running 0 20h operator-webhook-565c994ff9-f7hzq 1/1 Running 0 20h webhook-7f575896d6-gc4qc 1/1 Running 0 20h 
Enter fullscreen mode Exit fullscreen mode

Create Registry Credentials

KServe needs credentials to pull your ModelKit from Jozu. To set up these credentials in your project directory, create a file called kitops-jozu-secret.yaml and add the following:

apiVersion: v1 kind: Secret metadata: name: jozu-registry-secret type: Opaque data: KIT\_USER: <YOUR USERNAME ENCODED IN BASE 64> KIT\_PASSWORD: <YOUR PASSWORD ENCODED IN BASE 64> 
Enter fullscreen mode Exit fullscreen mode

Replace the base64-encoded values with your own Jozu credentials. You can encode your username and password by running:

echo -n "your-username" | base64 echo -n "your-password" | base64 
Enter fullscreen mode Exit fullscreen mode

Serving the Model with KServe

Now that our infrastructure is ready and our ModelKit is in the registry, let's deploy it with KServe. This section walks through configuring KServe to pull ModelKits, defining the inference service, and making predictions against the deployed endpoint.

Configure the Storage Initializer

KServe uses storage initializers to fetch model artifacts from registries before starting the inference container. We need to tell KServe how to pull ModelKits using the KitOps storage initializer. To do this create a file called kitops-storage-initializer.yaml:

apiVersion: serving.kserve.io/v1alpha1 kind: ClusterStorageContainer metadata: name: kitops spec: container: name: storage-initializer image: ghcr.io/kitops-ml/kitops-kserve:latest imagePullPolicy: Always env: - name: KIT\_UNPACK\_FLAGS value: "" - name: KIT\_USER valueFrom: secretKeyRef: name: jozu-registry-secret key: KIT\_USER optional: true - name: KIT\_PASSWORD valueFrom: secretKeyRef: name: jozu-registry-secret key: KIT\_PASSWORD optional: true resources: requests: memory: 100Mi cpu: 100m limits: memory: 1Gi supportedUriFormats: - prefix: kit:// 
Enter fullscreen mode Exit fullscreen mode

This ClusterStorageContainer defines a custom storage initializer that understands kit:// URIs. When KServe sees a storageUri starting with kit://, it uses this initializer to authenticate with Jozu (via the credentials in kit-secret), pull the ModelKit, unpack it, and mount the model artifacts into the inference container. The resource limits ensure the initializer doesn't consume too much memory during the download and unpacking phase.

Create the InferenceService

An InferenceService is KServe's core resource for deploying models. It handles routing, autoscaling, canary deployments, and connects your model to a scalable serving runtime. Create a file called kitops-kserve-inference.yaml:

apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: corporate-speak-model-tensorflow spec: predictor: model: modelFormat: name: tensorflow resources: requests: cpu: "250m" memory: "1Gi" limits: cpu: "500m" memory: "2Gi" storageUri: kit://jozu.ml/<username>/<model-kit-name>:<version> 
Enter fullscreen mode Exit fullscreen mode

Replace the storageUri with your actual ModelKit reference from Jozu (username, repository name, and tag). The modelFormat: tensorflow tells KServe to use the TensorFlow serving runtime, while the resource requests and limits ensure your model has enough CPU and memory to handle inference without monopolizing cluster resources.

Deploy the Service

Apply all three manifests to your cluster:

kubectl apply -f kitops-jozu-secret.yaml kubectl apply -f kitops-storage-initializer.yaml kubectl apply -f kitops-kserve-inference.yaml 
Enter fullscreen mode Exit fullscreen mode

If successful, you'll see:

secret/jozu-registry-secret clusterstoragecontainer.serving.kserve.io/kitops created inferenceservice.serving.kserve.io/corporate-speak-model-tensorflow created 
Enter fullscreen mode Exit fullscreen mode

The deployment takes a few minutes as KServe pulls the ModelKit, unpacks it, and starts the inference pod. You can monitor the progress with:

kubectl get pods 
Enter fullscreen mode Exit fullscreen mode

Wait until you see your predictor pod running:

NAME READY STATUS RESTARTS AGE corporate-speak-model-tensorflow-predictor-00001-deploymenwcc2n 2/2 Running 0 2m 
Enter fullscreen mode Exit fullscreen mode

Access the Inference Endpoint

Once the pod is running, find the service endpoint. You can do this by running:

kubectl get services | grep corporate-speak-model-tensorflow 
Enter fullscreen mode Exit fullscreen mode

You'll see several services created by KServe:

corporate-speak-model-tensorflow ExternalName <none> knative-local-gateway.istio-system.svc.cluster.local <none> 20h corporate-speak-model-tensorflow-predictor ExternalName <none> knative-local-gateway.istio-system.svc.cluster.local 80/TCP 20h corporate-speak-model-tensorflow-predictor-00001 ClusterIP 10.103.234.235 <none> 80/TCP,443/TCP 20h corporate-speak-model-tensorflow-predictor-00001-private ClusterIP 10.104.180.43 <none> 80/TCP,443/TCP,9090/TCP,9091/TCP,8022/TCP,8012/TCP 20h 
Enter fullscreen mode Exit fullscreen mode

For local testing, forward the private service to your machine:

kubectl port-forward service/corporate-speak-model-tensorflow-predictor-00001-private 8080:80 
Enter fullscreen mode Exit fullscreen mode

You should see:

Forwarding from 127.0.0.1:8080 -> 8012 Forwarding from [::1]:8080 -> 8012 
Enter fullscreen mode Exit fullscreen mode

Now you can test your inference service.

Testing the Deployment with Tokenized Input

Before testing it is important to know that, KServe's standard TensorFlow serving runtime expects numerical tensors that correspond to the model's signature. Since our T5 model was fine-tuned using token IDs, we must tokenize the input locally before sending the request.

First, you'll need a quick script to generate the correct numerical payload. To do this, create a temporary Python script generate\_payload.py in your project root to handle the tokenization and generate the JSON payload:

 import tensorflow as tf ## Required for Tensors from transformers import T5Tokenizer import json import os MODEL\_SAVE\_PATH \= os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_))), "1") tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_SAVE\_PATH) MAX\_LENGTH \= 128 term \= "Synergy" ## You can change the term here prompt \= f"term: {term}" ## T5 was trained to expect this prefix  inputs \= tokenizer( prompt, return\_tensors='tf', max\_length=MAX\_LENGTH, padding='max\_length', truncation=True ) input\_ids\_list \= inputs['input\_ids'][0].numpy().tolist() attention\_mask\_list \= inputs['attention\_mask'][0].numpy().tolist() payload \= { "instances": [ { "input\_ids": input\_ids\_list, "attention\_mask": attention\_mask\_list ## KServe needs both for attention  } ] } with open('test\_payload.json', 'w') as f: json.dump(payload, f, indent=2) 
Enter fullscreen mode Exit fullscreen mode

In a new terminal, run the script to create the file:

python3 generate\_payload.py 
Enter fullscreen mode Exit fullscreen mode

Now, use curl to send the generated test_payload.json file to the KServe endpoint.

curl -X POST http://localhost:8080/v1/models/corporate-speak-model-tensorflow:predict \\ -H "Content-Type: application/json" \\ -d @test\_payload.json 
Enter fullscreen mode Exit fullscreen mode

KServe will route the request containing the numerical IDs to the TensorFlow serving runtime, which passes it directly to the T5 model's generation function. You should see a JSON response with the decoded meaning:

{ "predictions": [ { "output": "Synergy" } ] } 
Enter fullscreen mode Exit fullscreen mode

Scaling and Securing Your Deployment

Running a model in production requires thinking beyond basic functionality. As time goes on you will need autoscaling to handle traffic spikes, resource limits to prevent runaway costs, and security measures to protect your models and data. KServe and KitOps give you the tools to handle all of this without the need to build custom infrastructure.

Autoscaling with KServe

KServe integrates with Knative Serving to provide automatic scaling based on request load. By default, your InferenceService will scale down to zero replicas when idle and scale up as traffic increases. You can customize this behavior by adding autoscaling annotations to your InferenceService manifest.

To do this, edit your kitops-kserve-inference.yaml to include autoscaling configuration:

apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: corporate-speak-model-tensorflow annotations: autoscaling.knative.dev/target: "10" autoscaling.knative.dev/minScale: "1" autoscaling.knative.dev/maxScale: "5" spec: predictor: model: modelFormat: name: tensorflow resources: requests: cpu: "250m" memory: "1Gi" limits: cpu: "500m" memory: "2Gi" storageUri: kit://jozu.ml/<username>/<model-kit-name>:<version> 
Enter fullscreen mode Exit fullscreen mode

The target annotation sets the concurrency target per pod (10 requests), minScale ensures at least one pod is always running for faster response times, and maxScale caps the maximum number of replicas to 5, preventing runaway scaling costs. Knative will automatically add or remove pods based on incoming traffic patterns.

Resource Management

The resource limits in your InferenceService prevent a single model from consuming all cluster resources. The requests section tells Kubernetes how much CPU and memory to reserve, while limits sets the maximum the pod can use. For production deployments, you can tune these values based on your model's actual memory footprint and inference latency requirements.

If you're running multiple models, consider creating separate namespaces for isolation:

kubectl create namespace production-models kubectl apply -f kitops-kserve-inference.yaml -n production-models 
Enter fullscreen mode Exit fullscreen mode

This keeps production models separate from staging or experimental deployments and makes it easier to apply different resource quotas and network policies per environment.

Securing ModelKits with Cosign

ModelKit signing ensures that the artifacts you deploy haven't been tampered with between packaging and deployment. You can use Cosign to sign your ModelKits immediately after pushing them to Jozu:

cosign generate-key-pair cosign sign jozu.ml/<username>/<model-kit-name>:<version> --key cosign.key 
Enter fullscreen mode Exit fullscreen mode

This creates a cryptographic signature attached to your ModelKit. In production, you can configure KServe to verify signatures before pulling models, rejecting any unsigned or tampered artifacts. The signature verification happens during the storage initialization phase, before the model ever loads into memory.

Model Versioning and Rollback

One of KitOps' biggest advantages is version control for models. Every ModelKit you push to Jozu is immutable and tagged. If a new model version causes issues in production, rolling back is as simple as updating the storageUri in your InferenceService:

storageUri: kit://jozu.ml/<username>/<model-kit-name>:<the-previous-version> 
Enter fullscreen mode Exit fullscreen mode

Note: When a ModelKit is pushed to Jozu, it is automatically run through 5 different vulnerability scanning tools to ensure that your model is safe and secure. Jozu also creates a downloadable audit log, consisting of the model’s complete lineage.

Apply the change, and KServe will perform a blue-green deployment, spinning up new pods with the old model version while draining traffic from the problematic version. You can also use KServe's canary deployment features to test new model versions with a percentage of traffic before fully rolling out:

apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: corporate-speak-model-tensorflow spec: predictor: model: modelFormat: name: tensorflow storageUri: kit://jozu.ml/<username>/<model-kit-name>:<a-new-version> canaryTrafficPercent: 20 
Enter fullscreen mode Exit fullscreen mode

This routes 20% of traffic to the new model while keeping 80% on the stable version. Monitor metrics, and if everything looks good, increase the percentage until you're confident enough to promote the canary to full production.

Wrapping Up

Having a good model isn't enough to serve machine learning applications at scale. The combination of KitOps, Kubeflow, KServe, and Jozu brings software development best practices, like containerization, version control, and automated scaling, into the ML workflow. KitOps standardizes your LLM into a portable ModelKit for reproducible packaging and security, while KServe handles reliable, production-grade serving and automated scaling on Kubernetes, eliminating the need for custom engineering.

This guide demonstrated how to build a TensorFlow LLM, package it with KitOps, push it to an OCI registry, and deploy it using KServe on Kubernetes. The steps covered key operational patterns like configuring autoscaling, securing ModelKits with signatures, managing resource allocation across environments, and performing deployment rollbacks. This consistent methodology scales effortlessly from development environments like Minikube to high-volume production clusters like EKS, GKE, or on-premises systems.

To learn more about KitOps visit kitops.org. To try Jozu Hub in your private environment, you can contact the Jozu team to start a free two-week POC.

Top comments (0)