Instructions needed 🙏

#1
by tomasmcm - opened

Thanks for uploading the model converted to onnx!
But how do we run it? If I use the same code from https://huggingface.co/Xenova/Qwen1.5-0.5B-Chat but change the model to Xenova/OpenELM-450M-Instruct I get Error: Unsupported model type: openelm at AutoModelForCausalLM.from_pretrained
Do I need to set it to a different model type manually?

Hi there! This is currently experimental in the v3 branch, but I can't seem to get good output from these models (either in python or in javascript, which are identical).

If you have time, would you mind putting together a version in python (with AutoModelForCausalLM.from_pretrained and the chat templating) that works well? That would help a ton! :)

Hello! yes, in the meantime I found this discussion that also mentioned the v3 branch https://huggingface.co/Xenova/Phi-3-mini-4k-instruct/discussions/2

I was able to get this to generate using this code and the v3 branch:

import { env, AutoModelForCausalLM, AutoTokenizer } from '@xenova/transformers';

// Disable the loading of remote models from the Hugging Face Hub:
env.allowRemoteModels = false;

// Specify a custom location for models (defaults to '/models/').
env.localModelPath = './models/';

// Create text-generation pipeline
const model_id = 'Xenova/OpenELM-450M-Instruct';

const tokenizer = await AutoTokenizer.from_pretrained(model_id, {
  legacy: true, // TODO: update config
});

const chat = [
  { "role": "user", "content": "Tell me a joke" },
]
const prompt = tokenizer.apply_chat_template(chat, { tokenize: false });
console.log(prompt)
// <s>[INST] Tell me a joke [/INST]

const inputs = tokenizer(prompt);

const model = await AutoModelForCausalLM.from_pretrained(model_id, {
  dtype: 'q8',
  // device: 'webgpu', // NOTE: webgpu produces incorrect results
  // use_external_data_format: true,
  revision: 'refs/pr/3',
});

(async () => {
  const start = performance.now();
  const outputs = await model.generate({
    ...inputs,
    max_new_tokens: 128,
    temperature: 0.8,
  });
  const end = performance.now();
  console.log(tokenizer.decode(outputs[0], { skip_special_tokens: true }));
  /*
[INST] Tell me a joke [/INST]
This is a joke I wrote a while back that I thought you guys might enjoy:
A man walks into a bar.
"Hey guys, I've got a story for you."
The barkeep: "Alright, tell me a joke."
Man: "My first job out of college, I worked at a bar for a couple months. One night, a couple of guys were hanging out with their friends, and they decided to go out for a drink. They ordered a couple of beers, and as they were sipping their drinks,
  */

  console.log('Execution Time:', end - start);
})()

However it does not seem to work very well once you start adding more messages to the history. I'm not sure the model actually uses the Llama template as it's defined in the config.
At some point it started hallucinating and replied Your name is [NAME] and you're a [SYS] [INST] [TYPE] [CLASS] [VERSION] [VERSION_NUMBER] [VERSION_STRING] [VERSION_NUMBER_LATEST] [VERSION_DATE] [VERSION_DATE_LATEST] [REVISION_NUMBER] [REVISION_NUMBER_LATEST] [REVISION_DATE] [REVISION_DATE_LATEST] [REVISION_VERSION] [REVISION_VERSION_LATEST] [REVISION_NUMBER_LATEST] [REVISION_NUMBER_LATEST_VERSION] [REVISION_VERSION_STRING] [REVISION_VERSION_STRING_LATEST] [REVISION_VERSION_STRING_LATEST_VERSION] [REVISION_DATE_STRING] [REVISION_DATE_STRING_LATEST] [REVISION_DATE_STRING_LATEST_VERSION] [REVISION_VERSION_STRING_LATEST_VERSION] [REVISION_VERSION_STRING_LATEST_VERSION_LATEST]
So it seems it does not understand [SYS] and [INST] as dividers 🤔

Right, and this is behaviour I've noticed in the python versions too. There's a bit of confusion around what the correct chat template is.

Sign up or log in to comment