finetuning with axolotl error with tokenizer

by interstellarninja - opened Jan 20, 2024

Jan 20, 2024

I'm working on finetuning this model with axolotl but I'm getting the following error during dataset preprocessing with axolotl.cli.preprocess
The same dataset had worked with stablelm-zephyr-3b.

TypeError: cannot pickle 'builtins.CoreBPE' object

jon-tow

Stability AI org Jan 20, 2024

Hi, @interstellarninja ! This model uses a different tokenizer from stablelm-zephyr-3b based on tiktoken. Which version of datasets are you on? I came across a similar issue that suggests updating to latest: https://github.com/huggingface/datasets/issues/5769

If this doesn't work, I'll try to repro on axolotl ASAP. Thanks for reporting!

interstellarninja

Jan 20, 2024

Hi jon - I see that it uses "Arcade100kTokenizer".

I'm using datasets Version: 2.16.1 which is the latest version. I think I should open an issue on axolotl repo.

pvduy

Jan 22, 2024

@interstellarninja A simple trick that I used for my training script that you can refer to:

import copyreg
import tiktoken
def pickle_Encoding(enc):
        return (functools.partial(tiktoken.core.Encoding, enc.name, pat_str=enc._pat_str, mergeable_ranks=enc._mergeable_ranks, special_tokens=enc._special_tokens), ())
copyreg.pickle(tiktoken.core.Encoding, pickle_Encoding)

Add this before your tokenized function, I guess it is a map function using huggingface datasets. Hope it helps.

jon-tow

Stability AI org Jan 23, 2024

@interstellarninja The tokenizer has been updated to support pickling. Let us know if you run into any further issues. Thanks for raising this!

jon-tow changed discussion status to closed Jan 23, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment