Vocabulary

#4
by NEDIX - opened

Seeing that this project has moved to Llama2 architecture, I have been attempting to convert this model to LLAMA GGML format.

I am currently at a dead end because of inoperable implementations of get_vocab and save_vocabulary methods in tokenization_codegen25.py. When attempting to invoke the get_vocab method the issue is that some of the vocabulary uses a different encoding from the defined utf-8.

These could be solutions:
a. Change tokenization_codegen25.py line 169 encoding from utf-8 to latin-1
b. With the next version of this model filter non utf-8 characters from the vocabulary

Sign up or log in to comment