How to add the llama3 prompt format to inference.py & minimal_chat.py? Is there a guide or tutorial? #437
-
I'm new to writing llm scripts, so I don't understand how to correctly setup the prompt format. I'm trying to use Llama-3-70B-Instruct-exl2 2.4bpw to summarize & extract key words from long captions, which will then be used by my sdxl fine-tuner. So I don't need full chat, just a simple 1 turn script. I've modified inference.py & minimal_chat.py to use, and with a simple prompt it seems to work fine. But the llama3 instructions state: "This format has to be exactly reproduced for effective use." Can anyone point me towards a guide or tutorial that can show me how to correctly configure the prompt format when using exllamav2? from https://huggingface.co/blog/llama3 How to prompt Llama 3 The base models have no prompt format. Like other base models, they can be used to continue an input sequence with a plausible continuation or for zero-shot/few-shot inference. They are also a great foundation for fine-tuning your own use cases. The Instruct versions use the following conversation structure: <|begin_of_text|><|start_header_id|>system<|end_header_id|> {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|> {{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {{ model_answer_1 }}<|eot_id|> This format has to be exactly reproduced for effective use. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
There's many ways to do it, and I guess the most "normal" way would be to format the entire chat history each round as a single text string and tokenize it all. This would work well enough since the generator skips inference for any tokens that haven't changed when you call For the minimal example I chose to concatenate tokenized sequences instead. For Llama3, that could look something like this: from exllamav2 import * from exllamav2.generator import * import sys, torch print("Loading model...") config = ExLlamaV2Config("/mnt/str/models/llama3-8b-instruct-exl2/4.0bpw/") model = ExLlamaV2(config) cache = ExLlamaV2Cache(model, lazy = True) model.load_autosplit(cache) tokenizer = ExLlamaV2Tokenizer(config) generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer) generator.set_stop_conditions([tokenizer.single_id("<|eot_id|>")]) # <- Set the correct stop condition gen_settings = ExLlamaV2Sampler.Settings() system_prompt = "You are a duck." prompt = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n" # <- First round starts with BOS + system prompt prompt += system_prompt prompt += "<|eot_id|>" while True: print() instruction = input("User: ") print() print("Assistant: ", end = "") prompt += "<|start_header_id|>user<|end_header_id|>\n\n" # <- First prompt is system prompt plus this prompt += instruction prompt += "<|eot_id|>" prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n" instruction_ids = tokenizer.encode(prompt, encode_special_tokens = True) # <- Make sure control tokens are treated as such context_ids = instruction_ids if generator.sequence_ids is None \ else torch.cat([generator.sequence_ids, instruction_ids], dim = -1) generator.begin_stream_ex(context_ids, gen_settings) while True: res = generator.stream_ex() if res["eos"]: break print(res["chunk"], end = "") sys.stdout.flush() # Note, when the loop above breaks, generator.sequence_ids contains the tokenized context plus all tokens # sampled so far, including whatever caused a stop condition (in this case, <|eot_id|>) even though it is # not returned by stream_ex() as part of the streamed response. print() prompt = "" # <- Start next round with empty string instead of BOS + system prompt |
Beta Was this translation helpful? Give feedback.
There's many ways to do it, and I guess the most "normal" way would be to format the entire chat history each round as a single text string and tokenize it all. This would work well enough since the generator skips inference for any tokens that haven't changed when you call
begin_stream_ex
.For the minimal example I chose to concatenate tokenized sequences instead. For Llama3, that could look something like this: