Chat template during finetuning?
Hi, did you use the original Llama-3 chat template while finetuning? The template is now missing from the tokenizer config, so it defaults to ChatML. Using the model to follow an instruction by applying a chat template leads to long inference times. Does this sounds familiar?
I did not use the Llama-3 chat template, it is trained on the ChatML template.
I don't fully understand the question, applying a chat template should not lead to longer inference time per token? Unless your conversation is very long, the first token can take a bit longer.
Okey, thanks. I know, this should not be the case. However, I use an instruction to summarize (400 tokens) and I supply context (1000 tokens). On the original Llama-3 8B model inference is done within seconds. When I use the finetuned for Dutch model inference takes quite long, 30sec+. I will take another look at this later today to see if there is something different in the parameters.
That is weird, the model architecture is exactly the same. Only the weights are different.