Any plans to release Llama.cpp compatible Saiga version?

#1
by fikavec - opened

I don't have a GPU, but "llama_13b_ru_turbo_alpaca_lora_llamacpp" I really liked both the speed and the answers on CPU. I would like to look at Saiga, can you tell me how to megre this model with LLaMA 13B and then convert it to ggml format and quantize to 4bits without GPU? Thank you very much for the models!

Hi!
For previous models, the procedure was this:

  1. Converting to a native format with https://github.com/tloen/alpaca-lora/blob/main/export_state_dict_checkpoint.py
  2. Converting to ggml fp16 with convert.py from https://github.com/ggerganov/llama.cpp
  3. Quantization with "quantize" binary from https://github.com/ggerganov/llama.cpp

However, I had some strange problems with the latest version of a "llama_13b_ru_turbo_alpaca_lora" model, and I'm not sure it will work as expected for this model.
I do have plans to test it myself, but probably only after releasing versions with OpenAssistant data integration.

Thanks, I was able to merge and convert "saiga_13b_lora" to fp16 (25 Gb resulting .bin model) and fp32 (55 Gb resulting .bin model).
When testing, it turned out that if I ask my questions directly, without prompt, the results are unstable and much worse than using prompt.
<start>system\nТы — Сайга, русскоязычный автоматический ассистент. Ты разговариваешь с людьми и помогаешь им. <end>\n<start>user\nMY QUESTION HERE<end>\n<start>
However, if I use the above recommended prompt fromat (with - MY QUESTION HERE), the model does not stop after answering my question and continues up to n_predict ("max_new_tokens": 1024 from generation_config.json) to generate random tasks and dialogs in the format:

<start>user\nRANDOM HERE<end>\n
<start>bot\nRANDOM HERE<end>\n
...
to 1024 tokens

Is there any way to deal with this so that the model completes generation after answering my question and does not continue to generate a random (fictional by model) continuation of the dialogue?
The model itself seemed more knowledgeable than "llama_13b_ru_turbo_alpaca_lora_llamacpp", but the latter works very well, recommended on the resource https://github.com/tloen/alpaca-lora (it's strange that only the 7b version, not the 13b), and there is no problem with excessive generation, although max_new_tokens is not enough to generate code or long answers, at least the answers break off in mid-sentence.

Is there any way to deal with this so that the model completes generation after answering my question and does not continue to generate a random (fictional by model) continuation of the dialogue?

The current version requires an EOS token different from the default one. You should refer to Colab, there is an example of how to use it. I'll fix it in the next version.

IlyaGusev changed discussion status to closed

Sign up or log in to comment