突然加载不了tokenizer了,是不是升级导致的版本不兼容

#67
by Jiiao - opened

AttributeError Traceback (most recent call last)
测试.ipynb Cell 19 line 4
1 # vpm_resampler_embedtokens_weight = torch.load(f"{model_dir}/vpm_resampler_embedtokens.pt")
2
3 # msg = model.load_state_dict(vpm_resampler_embedtokens_weight, strict=False)
----> 4 tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

File /usr/local/lib/python3.9/dist-packages/transformers/models/auto/tokenization_auto.py:877, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
875 else:
876 class_ref = tokenizer_auto_map[0]
--> 877 tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
878 _ = kwargs.pop("code_revision", None)
879 if os.path.isdir(pretrained_model_name_or_path):

File /usr/local/lib/python3.9/dist-packages/transformers/dynamic_module_utils.py:514, in get_class_from_dynamic_module(class_reference, pretrained_model_name_or_path, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, repo_type, code_revision, **kwargs)
501 # And lastly we get the class inside our newly created module
502 final_module = get_cached_module_file(
503 repo_id,
504 module_file + ".py",
ref='/usr/local/lib/python3.9/dist-packages/transformers/dynamic_module_utils.py:0'>0;32m (...)
512 repo_type=repo_type,
513 )
--> 514 return get_class_in_module(class_name, final_module)

File /usr/local/lib/python3.9/dist-packages/transformers/dynamic_module_utils.py:213, in get_class_in_module(class_name, module_path)
211 # reload in both cases
212 module_spec.loader.exec_module(module)
--> 213 return getattr(module, class_name)

AttributeError: module 'transformers_modules.openbmb.MiniCPM-Llama3-V-2_5.287e3f85192a7c4acf2564fc6bda0637439a9d78.modeling_minicpmv' has no attribute 'PreTrainedTokenizerFastWrapper'

请问model_dir 是有小数点吗?MiniCPM-Llama3-V-2_5.287e3f85192a7c4acf2564fc6bda0637439a9d78 这样? 由于huggingface的机制问题,model_dir中带小数点会导致dynamic import错误,请把小数点改成其他符号

请问model_dir 是有小数点吗?MiniCPM-Llama3-V-2_5.287e3f85192a7c4acf2564fc6bda0637439a9d78 这样? 由于huggingface的机制问题,model_dir中带小数点会导致dynamic import错误,请把小数点改成其他符号

hi,谢谢您的回复!
model_dir中没有小数点,我加载的代码如下:

       self.model = AutoPeftModelForCausalLM.from_pretrained(model_dir,
                                                            device_map='cuda:0',
                                                            trust_remote_code=True,
                                                            torch_dtype=torch.float16
                                                        ).eval()

        vpm_resampler_embedtokens_weight = torch.load(f"{model_dir}/vpm_resampler_embedtokens.pt")

        self.msg = self.model.load_state_dict(vpm_resampler_embedtokens_weight, strict=False)
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

在7天前的commit(287e3f8)之后,代码执行会报以上错误,我目前理解的过程是,因为用PEFT加载模型基本上必须要 trust_remote_code=True,每次会拉最新的代码,也就是transformers_modules.openbmb.MiniCPM-Llama3-V-2_5.287e3f85192a7c4acf2564fc6bda0637439a9d78.modeling_minicpmv;最近这个版本的提交,tokenizer的定义中少了PreTrainedTokenizerFastWrapper,导致加载失败,我自己修改需要把整个仓库clone下来自己修复,后续跟进不了最新版本的代码,不知道是否你们能够修复呢,或者如何避免呢

OpenBMB org

看起来是model_dir中的代码和json没有跟最新commit(287e3f8)完全一致,请再检查一下

看起来是model_dir中的代码和json没有跟最新commit(287e3f8)完全一致,请再检查一下

明白了。。。谢!

给后续遇到该问题的同学提供下具体解决方案:把老版本minicpm训练出来的lora目录下的tokenizer_config.json进行修改。
一共改下面这两行

image.png

感谢minicpm项目组人员的付出!但我建议后续维护考虑下版本向下兼容的问题 :) 这个tokenizer问题确实让我找了半天

finalf0 changed discussion status to closed

Sign up or log in to comment