raise error when `use_cache = True`

#23
by wjfwzzc - opened

transformers version: 4.33.2

AutoModelForCausalLM.from_pretrained("microsoft/phi-1", trust_remote_code=True, torch_dtype="auto", use_cache=True)

raise the following error:

File /usr/local/lib/python3.9/dist-packages/transformers/models/auto/auto_factory.py:558, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    556     else:
    557         cls.register(config.__class__, model_class, exist_ok=True)
--> 558     return model_class.from_pretrained(
    559         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    560     )
    561 elif type(config) in cls._model_mapping.keys():
    562     model_class = _get_model_class(config, cls._model_mapping)

File /usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:2966, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   2963     init_contexts.append(init_empty_weights())
   2965 with ContextManagers(init_contexts):
-> 2966     model = cls(config, *model_args, **model_kwargs)
   2968 # Check first if we are `from_pt`
   2969 if use_keep_in_fp32_modules:

TypeError: __init__() got an unexpected keyword argument 'use_cache'
Microsoft org

Hey @wjfwzzc , thanks for your issue!

It seems there is an issue with the propagation of unused kwargs when using remote code, cc @ArthurZ .

To do what you're trying to do, you could define a GenerationConfig locally with use_cache set to True:

from transformers import GenerationConfig

generation_config = GenerationConfig(use_cache=True)

You can then pass this to the generate method:

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
>>> inputs = tokenizer('''```python
... def print_prime(n):
...     """
...     Print all primes between 1 and n
...     """''', return_tensors="pt", return_attention_mask=False)


>>> model.generate(**inputs, max_length=200, generation_config=generation_config)

Please let me know if that works for you!

Hi @lysandre , thanks for your help and it works for me!
Nevertheless I'm still confused about the attention_mask. It seems that return_attention_mask=True will raise

ValueError: The following `model_kwargs` are not used by the model: ['attention_mask'] (note: typos in the generate arguments will also show up in this list)

But how to do batch inferencing with padding without attention mask?

Microsoft org

Hey @wjfwzzc , Phi is being contributed to transformers in this PR: https://github.com/huggingface/transformers/pull/26170

This should enable leveraging the attention mask to perform batch inference.

Microsoft org

Hello @wjfwzzc !

I just added support for attention_mask in the forward pass, so you should be able to perform batched inference. Meanwhile, this will be a proxy till Phi gets contributed to transformers (which I hugely appreciate that!).

gugarosa changed discussion status to closed

Sign up or log in to comment