Bug in tokenize()/detokenize()/tokenize() cycle

#9
by riedgar-ms - opened

The tokenizer for this model doesn't seem to work well with multibyte characters which encode to multiple tokens. Specifically:

from huggingface_hub import hf_hub_download

import llama_cpp

repo_id = "bartowski/Meta-Llama-3-8B-Instruct-GGUF"
filename = "Meta-Llama-3-8B-Instruct-IQ3_S.gguf"

downloaded_file = hf_hub_download(repo_id=repo_id, filename=filename)

llama_model = llama_cpp.Llama(model_path=downloaded_file, n_ctx=4096)

print("\n===========================\n")

sample_string = "歪"

sample_bytes = sample_string.encode()
print(f"{sample_bytes=}")

tokens = llama_model.tokenize(sample_bytes, add_bos=False, special=True)
print(f"{tokens=}")

tokenizer = llama_cpp.LlamaTokenizer(llama_model)
first_token = tokenizer.detokenize([tokens[0]])
print(f"{first_token=}")

tokens_2 = tokenizer.tokenize(first_token, add_bos=False, special=True)
print(f"{tokens_2=}")

Results in a segfault (or OS-equivalent failure) for the final tokenizer.tokenize() call.

The output prior to the segfault is:

sample_bytes=b'\xe6\xad\xaa'
tokens=[15722, 103]
first_token=b'\xe6\xad'

so we can see that sample_string is being encoded to three bytes, but these are represented by two tokens. We can get the byte-representation of the first token (which happens to be the first two bytes of the three), and then try to tokenize() just that.... but that fails.

I had expected the final 'print' to be tokens_2=[15722].

I'm using Python 3.12, with llama_cpp_python 0.2.83 on Windows and llama_cpp_python 0.2.82 on Linux.

nb: This was originally reported by a Guidance user; I have adapted their repro case above

Sign up or log in to comment