Size Mismatch in safetensors file

#3
by MartialTerran - opened

#Using full precision

pip install transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-135M"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

for multiple GPUs install accelerate and do model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

C:\Users\User\OneDrive\Desktop\SmolLM2-135M>DIR
Volume in drive C has no label.
Volume Serial Number is C87F-B607

Directory of C:\Users\User\OneDrive\Desktop\SmolLM2-135M

12/02/2024 11:49 AM

.
12/02/2024 11:46 AM ..
04/10/2024 11:27 PM 49 CMD_Here.bat
12/02/2024 11:46 AM 2,154,374 HuggingFaceTB_SmolLM2-135M.mhtml
12/02/2024 11:48 AM 1,289 Scripts from Huggingface.txt
12/02/2024 11:49 AM 605 SmolLM2-135M_f32.PY
4 File(s) 2,156,317 bytes
2 Dir(s) 204,501,573,632 bytes free

C:\Users\User\OneDrive\Desktop\SmolLM2-135M>python SmolLM2-135M_f32.PY
C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\huggingface_hub\file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3.66k/3.66k [00:00<?, ?B/s]
C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\huggingface_hub\file_download.py:139: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\User.cache\huggingface\hub\models--HuggingFaceTB--SmolLM2-135M. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
vocab.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 801k/801k [00:00<00:00, 2.52MB/s]
merges.txt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 466k/466k [00:00<00:00, 6.14MB/s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.10M/2.10M [00:00<00:00, 10.6MB/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 831/831 [00:00<?, ?B/s]
config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 704/704 [00:00<?, ?B/s]
C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\huggingface_hub\file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
model.safetensors: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 269M/269M [00:23<00:00, 11.6MB/s]

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last): File "C:\Users\User\OneDrive\Desktop\SmolLM2-135M\SmolLM2-135M_f32.PY", line 8, in
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 2604, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 461, in load_state_dict
return safe_load_file(checkpoint_file)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\safetensors\torch.py", line 315, in load_file
result[k] = f.get_tensor(k)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\storage.py", line 234, in getitem
return super().getitem(*args, **kwargs)
C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\storage.py:234: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:84.)
return super().getitem(*args, **kwargs)
Traceback (most recent call last):
File "C:\Users\User\OneDrive\Desktop\SmolLM2-135M\SmolLM2-135M_f32.PY", line 8, in
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3278, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([192, 576]) from checkpoint, the shape in current model is torch.Size([576, 576]).
[etcetera]
size mismatch for model.layers.29.self_attn.k_proj.weight: copying a param with shape torch.Size([192, 576]) from checkpoint, the shape in current model is torch.Size([576, 576]).
size mismatch for model.layers.29.self_attn.v_proj.weight: copying a param with shape torch.Size([192, 576]) from checkpoint, the shape in current model is torch.Size([576, 576]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

C:\Users\User\OneDrive\Desktop\SmolLM2-135M>

How to find the actual model.py script for checkpoint = "HuggingFaceTB/SmolLM2-135M"

Model
16.1s
You can't directly find a standalone model.py script for a Hugging Face model hosted like that. The model architecture and code are part of the model repository itself. Here's how you can find the relevant code:

Go to the Model Page: Open your web browser and navigate to the model card page on Hugging Face Hub: https://huggingface.co/HuggingFaceTB/SmolLM2-135M

Check the "Files and versions" Tab: On the model card, look for the "Files and versions" tab (it might be a dropdown or a separate button). Click on it. This will show you a list of all the files and folders within the model repository.

Locate the Modeling File: Search for a Python file that likely contains the model definition. Common names include:

modeling_smol_lm2.py

modeling_smol.py

modeling.py

smol_lm2.py (less common but possible)
The exact file name might vary. Look for files that suggest they contain model architecture code.

View the Code: Once you've found the likely candidate file, click on it. You'll be able to view the code directly in the browser on the Hugging Face Hub.

Identify the Model Class: Within the modeling file, look for a class that defines the model architecture. It will usually inherit from a base class like transformers.PreTrainedModel, transformers.modeling_utils.PreTrainedModel torch.nn.Module, or a custom base class specific to the model implementation. The class name will probably be something like:

SmolLM2ForCausalLM

SmolLM2Model
Pay attention to the class name and any config parameters it takes in its init method.

Example (Illustrative):

Let's say you find a file named modeling_smol.py and inside it, you see this class:

from transformers import PreTrainedModel, ...

class SmolLM2ForCausalLM(PreTrainedModel):
config_class = SmolLM2Config

def __init__(self, config):
    # ... model architecture definition ...
    pass

def forward(self, input_ids, ...): # ...
    pass

Use code with caution.
Python
In your model.py script (the one you're writing to load and use the model), you would then use:

from transformers import AutoTokenizer
from .modeling_smol import SmolLM2ForCausalLM # Assuming modeling_smol.py is in the same dir as your script

checkpoint = "HuggingFaceTB/SmolLM2-135M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = SmolLM2ForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map="auto", offload_folder="offload", offload_state_dict = True)

... rest of your code to use the model ...

Use code with caution.
Python
Remember that you need to have the transformers library installed (pip install transformers). You would also need to download modeling_smol.py and place it in the same directory as your script (or adjust the import path accordingly) if you don't want to rely on dynamically loading the architecture from the model repository.

So, where is the SmolLM2ForCausalLM.py to support these SmolLM2 model parameters? and for debugging the automodel script that fails to run the model because of size mismatch? Where? Where? Where? Why publish weights without publishing a corresponding model.py and train.py to support the USE the weights and finetuning or at least debugging????

Hugging Face TB Research org

Note that SmolLM2 uses a Llama architecture, and you can find the modelling file here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py

Please let us know which version of transformers you are running where you see that error? Can you try upgrade to the latest version to see if the error persists?

pip install --upgrade transformers

Thank you. As I recall, Generative-AI analysis disclosed to me that the dimension mismatch seems to arise from the factor-of-three? reduction of the number of unique k-projection-matrices in the 15 attention heads (due to re-use of same k-vector among three attention heads per level?) So, the published model safetensors file has three times more parameters in the k-projection matrix and v-projection part than is required to operate the model. This is indicated in the hyperparameter (num_key_value_heads) in config? As I understand the mismatch problem, the actual safetensor file published has redundant parameters not consistent with the use of Grouped-Query Attention (GQA) in the modeling_llama.py. (the key and value projections produce a smaller number of unique vectors (num_key_value_heads) compared to the query vectors (num_heads)). As you can see, the mismatch only applies to K and V projection matricies, but NOT the full-sized Q projection matrix:
size mismatch for model.layers.29.self_attn.k_proj.weight: copying a param with shape torch.Size([192, 576]) from checkpoint, the shape in current model is torch.Size([576, 576]).
size mismatch for model.layers.29.self_attn.v_proj.weight: copying a param with shape torch.Size([192, 576]) from checkpoint, the shape in current model is torch.Size([576, 576]).

As I recall, the dimension mismatch occurred at all versions of Transformers, and when I upgraded to latest-version of transformers, this (unrelated?) additional warning appeared (but did not seem to make any difference) :
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

According to AI-analysis of
https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/src/transformers/models/llama/modeling_llama.py

The code supports the use of the same k-vector in multiple heads within the same layer through a technique called Grouped-Query Attention (GQA). Here's where it happens and how it works:

  1. Configuration:

The LlamaConfig class allows setting num_key_value_heads which is different from num_attention_heads. This indicates that the number of key-value heads can be smaller than the number of query heads, implying that multiple query heads will share the same key-value pair.

  1. Projection in LlamaAttention:

In the LlamaAttention class, the linear projections for keys (k_proj) and values (v_proj) are defined to project the hidden states to a dimension based on num_key_value_heads:

self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
Use code with caution.
Python
3. Repetition in repeat_kv function and LlamaAttention.forward:

The repeat_kv function is designed to repeat the key and value tensors to match the number of query heads.

def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
Use code with caution.
Python
Inside LlamaAttention.forward, repeat_kv is called after the key and value states are computed:

key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
Use code with caution.
Python
self.num_key_value_groups is calculated as self.num_heads // self.num_key_value_heads, representing how many times each key-value pair needs to be repeated.

How it works (GQA):

The input hidden states are projected into query, key, and value spaces. However, the key and value projections produce a smaller number of heads (num_key_value_heads) compared to the query heads (num_heads).

The repeat_kv function then repeats each key and value head num_key_value_groups times. This effectively makes the key-value pairs available to multiple query heads.

The attention mechanism then proceeds as usual, but now multiple query heads are attending to the same key-value pair.

In essence, the repeat_kv function is the core of how GQA is implemented in this code, allowing the sharing of k-vectors (and v-vectors) among multiple query heads within the same layer. This reduces the memory footprint and computational cost compared to standard Multi-Head Attention (MHA) where each head has its own unique key and value.

According to https://huggingface.co/docs/transformers/main/model_doc/llama
and
https://huggingface.co/docs/transformers/main/model_doc/llama2
the Grouped-Query Attention (GQA) feature was introduced in "Llama2".
So, the original Llama models/llama/modeling_llama.py will not support the Grouped-Query Attention (GQA) feature?

So, please provide a URL to the modeling_llama2.py

Hugging Face TB Research org

All llama models use that same modeling_llama.py. Please confirm your transformers version:

transformers-cli env

Sign up or log in to comment