finetuning with axolotl error with tokenizer

#1
by interstellarninja - opened

I'm working on finetuning this model with axolotl but I'm getting the following error during dataset preprocessing with axolotl.cli.preprocess
The same dataset had worked with stablelm-zephyr-3b.

TypeError: cannot pickle 'builtins.CoreBPE' object

Stability AI org

Hi, @interstellarninja ! This model uses a different tokenizer from stablelm-zephyr-3b based on tiktoken. Which version of datasets are you on? I came across a similar issue that suggests updating to latest: https://github.com/huggingface/datasets/issues/5769

If this doesn't work, I'll try to repro on axolotl ASAP. Thanks for reporting!

Hi jon - I see that it uses "Arcade100kTokenizer".

I'm using datasets Version: 2.16.1 which is the latest version. I think I should open an issue on axolotl repo.

Stability AI org

@interstellarninja A simple trick that I used for my training script that you can refer to:

import copyreg
import tiktoken
def pickle_Encoding(enc):
        return (functools.partial(tiktoken.core.Encoding, enc.name, pat_str=enc._pat_str, mergeable_ranks=enc._mergeable_ranks, special_tokens=enc._special_tokens), ())
copyreg.pickle(tiktoken.core.Encoding, pickle_Encoding)

Add this before your tokenized function, I guess it is a map function using huggingface datasets. Hope it helps.

Stability AI org

@interstellarninja The tokenizer has been updated to support pickling. Let us know if you run into any further issues. Thanks for raising this!

jon-tow changed discussion status to closed

Sign up or log in to comment