Loading AutoTokenizer and AutoModelForCausalLM

#3
by nramirezuy - opened

First of all thank you for converting this to GGUF, it been really massive on my llm learning journey.

I already have this working with LlamaCpp, but I found this post and apparently I can ran these directly on the transformers library.

But when I try to run it with Auto classes, I found the following issues:

AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    model_id, gguf_file=filename, low_cpu_mem_usage=True
)
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BartTokenizer'. 
The class this function is called from is 'GPT2TokenizerFast'.
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 899, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2110, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2336, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 100, in __init__
    super().__init__(
  File "./lib/python3.12/site-packages/transformers/tokenization_utils_fast.py", line 120, in __init__
    tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 81, in load_gguf_checkpoint
    reader = GGUFReader(gguf_checkpoint_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/gguf/gguf_reader.py", line 85, in __init__
    self.data = np.memmap(path, mode = mode)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/numpy/core/memmap.py", line 229, in __new__
    f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
                 ^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_id, gguf_file=filename, low_cpu_mem_usage=True
)
./lib/python3.12/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Converting and de-quantizing GGUF tensors...:   0%|          | 0/291 [00:00<?, ?it/s]
Converting and de-quantizing GGUF tensors...: 100%|██████████| 291/291 [00:29<00:00,  9.96it/s]
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./lib/python3.12/site-packages/transformers/modeling_utils.py", line 4059, in _load_pretrained_model
    raise ValueError(
ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?

It looks like the binary is missing some configuration? Am I supposed to provide it in some way or am I just dumb and this isn't supported?

I will really appreciate any help! and again thank you for the GGUF version!

Fascinating.. I've never heard of this before, thanks for sharing!

I assume that you're using one of the supported types?

For what it's worth, it LOOKS like the only real purpose of this is the get an "unquantized" version that you can use elsewhere, is that correct? "Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it with a plethora of other tools."

if so, are you just doing this as an experiment for fun or is there a reason you don't want to use the original safetensors variant?

Fascinating.. I've never heard of this before, thanks for sharing!

I assume that you're using one of the supported types?

For what it's worth, it LOOKS like the only real purpose of this is the get an "unquantized" version that you can use elsewhere, is that correct? "Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it with a plethora of other tools."

if so, are you just doing this as an experiment for fun or is there a reason you don't want to use the original safetensors variant?

would the unquantized full version from the gguf be as precise as using the full weights to begin with though? I guess this would only be useful for stuff like the leaked miqu that only came in gguf, but with access to the original weights I don't really see the point

Yeah I think this is just for the case where you only have access to the GGUF and want the full safetensors, and likely won't be as accurate

If you download an f16/f32/bf16 GGUF then you'll have the same accuracy as the full safetensors already

I assume that you're using one of the supported types?

Yes, was using Q8_0

if so, are you just doing this as an experiment for fun or is there a reason you don't want to use the original safetensors variant?

I just wanted to do inference through transformers library, mainly to use the tokenizer with langchain. I just was unaware of what quantization meant, now I know. Thanks!

BTW, you are missing the template Mathematical Reasoning Mode -> Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>Math Correct Assistant:.
Not that I care about it, but it could be useful to someone else. Looking at the python-llama-cpp code, I noticed the server is ready to pull the Jinja2 template from metadata and use it with just a command argument.

  --chat_format CHAT_FORMAT
                        Chat format to use.

Sign up or log in to comment