Understanding CoreML conversion of llama 2 7b

#4
by kharish89 - opened

Could you kindly provide more details on the hardware used and process of conversion in a blog/guide style. So many of the community members can benefit the learnings.

Core ML Projects org

I'll publish a guide focused on conversion in a few days!

In addition, we need to provide some of the pieces required to perform text generation with the converted model: tokenizers, text generation strategies, etc. Working on it!

Thanks @pcuenq , I'm looking forward to it ❤️
Where will you publish it?

That would be amazing! TIA.

Thanks for the work

Hi! Thanks a lot for your work! Where the guide can be found?

Amazing work!, There is a variant of the diffusers app adapted for querying llama2?

Core ML Projects org

In case you didn't see it, we published swift-transformers and this post a couple of weeks ago: https://huggingface.co/blog/swift-coreml-llm

Please, let us know if that's helpful, or if you'd like us to dive in more depth on any of the topics :)

complete noob here, but would it be possible to show how to run the coreML model? I am attempting to build a stock app that can process news and give a summary on a stock, but when I load the model, it requires the attention mask and the inputs are in the form of an integer array. Not sure how to use the tokenizer in coreML for it

Thanks for sharing!

Core ML Projects org
edited Aug 25, 2023

@Ovats Perhaps this section in the blog post could help! It covers how to do tokenization in Swift with swift-transformers.

import Tokenizers

func testTokenizer() async throws {
    let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/Llama-2-7b-chat-coreml")
    let inputIds = tokenizer("Today she took a train to the West")
    assert(inputIds == [1, 20628, 1183, 3614, 263, 7945, 304, 278, 3122])
}

The swift-transformers library is still new though, and @pcuenq will be making improvements to it to make it even easier! Perhaps he can add some extra context here too.

Core ML Projects org

Hi @Ovats !

The swift-transformers library will deal with many of those details automatically. I would recommend you take a look at the swift-chat example app, which simply calls generate with a prompt and a configuration object and swift-transformers will do the rest. Under the hood, it will:

  • Tokenize the prompt, using code similar to what @Xenova posted above.
  • Invoke the model repeatedly, because language models produce one token at a time. For example, the greedySearch generation method uses a loop to get the most probable token each time, and it appends it to the output.
  • Prepare a suitable attention mask when necessary (not all models require it).

Please, let us know if that helps!

@pcuenq

How can we use this model in swift-chat and target the ANE?

@pcuenq

How can we use this model in swift-chat and target the ANE?

There's a great ANE repo here that discusses ways to get it on the ANE but it doesn't appear to be guaranteed. A lot of whether a model uses the ANE is a black box. But you can try!

https://github.com/hollance/neural-engine

@pcuenq do you have the code you used to convert the llama2-hf model to coreml? Or scripts?

Currently getting stuck here: https://github.com/huggingface/exporters/issues/76 (as well as any Llama2 based model)

Sign up or log in to comment