Self-Hosting Fine-Tuned Models

Getting Started with Self-Hosting Checkpoints

Here is how you can download the checkpoints from the studio after you have fine-tuned your model. Once you have downloaded the checkpoints, you can unzip them and use the following inference engines to load the checkpoints and use them for inference. Download Checkpoints

Inference Engines

You can use the following inference engines to load the checkpoints and use them for inference.

Hugging Face

You can use the Hugging Face library to load the checkpoints and use them for inference using the transformers library.

from transformers import AutoModelForCausalLM, AutoTokenizer import torch   model_path = 'path/to/your/finetuned/model/checkpoint' tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")  SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""  USER_PROMPT = """Title: Lemon Drizzle Cake  Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]  Generic ingredients:"""  conversation = [  {"role": "system", "content": SYSTEM_PROMPT},  {"role": "user", "content": USER_PROMPT}, ]  # format and tokenize the tool use prompt inputs = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, return_dict=True, return_tensors="pt")  inputs.to(model.device) outputs = model.generate(**inputs, max_new_tokens=1000, use_cache=False) print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) 

VLLM

You can use the VLLM library to load the checkpoints and use them for inference using the VLLM library.

from vllm import LLM, SamplingParams  model_path = 'path/to/your/finetuned/model/checkpoint' llm = LLM(model=model_path, tokenizer=model_path)  SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""  USER_PROMPT = """Title: Lemon Drizzle Cake  Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]  Generic ingredients:"""  conversation = [  {"role": "system", "content": SYSTEM_PROMPT},  {"role": "user", "content": USER_PROMPT}, ]  outputs = llm.chat(  messages=conversation,  sampling_params=SamplingParams(temperature=0, max_tokens=256), ) print(outputs[0].outputs[0].text) 

Ollama

You can use the Ollama library to load the checkpoints and use them for inference using the Ollama library. To create a model, you first need to create a Modelfile:

FROM path/to/finetuned/model/checkpoint

Then run the following command to create the model.

ollama create my-model -f Modelfile

Then you can use the model for inference:

from ollama import Client  client = Client(host='http://localhost:11434')  SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""  USER_PROMPT = """Title: Lemon Drizzle Cake  Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]  Generic ingredients:"""  conversation = [  {"role": "system", "content": SYSTEM_PROMPT},  {"role": "user", "content": USER_PROMPT}, ] res = client.chat(model='my-model', messages=conversation) print(res.message.content) 

Get started

Projects 🚀

Datasets 🗃️

Fine-Tuning 🛠️

Inference 🏃‍♂️

Agentic Evaluations 📈

User Guides 📚

Examples 📖

Playground 🛝

Stats 📊

Resources 🧰

Cookbook 🍳

Self-Hosting Fine-Tuned Models

Getting Started with Self-Hosting Checkpoints

Inference Engines

Hugging Face

VLLM

Ollama

Get started

Projects 🚀

Datasets 🗃️

Fine-Tuning 🛠️

Inference 🏃‍♂️

Agentic Evaluations 📈

User Guides 📚

Examples 📖

Playground 🛝

Stats 📊

Resources 🧰

Cookbook 🍳

​Getting Started with Self-Hosting Checkpoints

​Inference Engines

​Hugging Face

​VLLM

​Ollama

Getting Started with Self-Hosting Checkpoints

Inference Engines

Hugging Face

VLLM

Ollama