Which tokenizer has been saved and how to use it?

#1
by javiermartinezcebrian - opened

Before this HF compatible models for Mamba, I saw people (and the Mamba repo in /eval) used EleutherAI/gpt-neox-20b. Is this the one saved here in the HF repo for Mamba models? Thanks!

@ArthurZ @amyeroberts @koayon @tridao @albertgu I would like to add:
I have seen that sometimes for generation tasks it is usually set the pad_token to the eos_token. For Mamba usage in the original repository it is set add_special_tokens=False while tokenizing. Also I see discussions around using DataCollatorForLanguageModeling or DataCollatorForSeq2Seq. In addition, I see it is typical using left padding for better generation tasks.

In this state-spaces/mamba-2.8b-hf tokenizer I see there is already pad_token and eos_token: how should I use this HF Mamba model for training and for generation regarding that points? Currently I'm not setting pad_token=eos_token, I'm setting add_special_tokens=False, using DataCollatorForLanguageModeling, and left padding.

some refs:
https://github.com/huggingface/transformers/issues/22794
https://medium.com/@geronimo7/mamba-a-shallow-dive-into-a-new-architecture-for-llms-54c70ade5957
https://github.com/havenhq/mamba-chat/blob/main/train_mamba.py

Thanks!

javiermartinezcebrian changed discussion title from Which tokenizer has been saved? to Which tokenizer has been saved and how to use it?

thanks @nielsr ! this is regarding which tokenizer, and regarding the other points about the usage with the pretrained mamba models?

Sign up or log in to comment