finetuning with axolotl error with tokenizer
I'm working on finetuning this model with axolotl but I'm getting the following error during dataset preprocessing with axolotl.cli.preprocess
The same dataset had worked with stablelm-zephyr-3b.
TypeError: cannot pickle 'builtins.CoreBPE' object
Hi,
@interstellarninja
! This model uses a different tokenizer from stablelm-zephyr-3b
based on tiktoken
. Which version of datasets
are you on? I came across a similar issue that suggests updating to latest: https://github.com/huggingface/datasets/issues/5769
If this doesn't work, I'll try to repro on axolotl
ASAP. Thanks for reporting!
Hi jon - I see that it uses "Arcade100kTokenizer".
I'm using datasets Version: 2.16.1
which is the latest version. I think I should open an issue on axolotl repo.
@interstellarninja A simple trick that I used for my training script that you can refer to:
import copyreg
import tiktoken
def pickle_Encoding(enc):
return (functools.partial(tiktoken.core.Encoding, enc.name, pat_str=enc._pat_str, mergeable_ranks=enc._mergeable_ranks, special_tokens=enc._special_tokens), ())
copyreg.pickle(tiktoken.core.Encoding, pickle_Encoding)
Add this before your tokenized function, I guess it is a map function using huggingface datasets. Hope it helps.
@interstellarninja The tokenizer has been updated to support pickling. Let us know if you run into any further issues. Thanks for raising this!