Further Tuning

#3
by ChasapasK - opened

Hello, congrats for the great work, guys.
Can you please share the tokenizer's necessary adjustments for further tuning?
Thank you

Institute for Language and Speech Processing org

Hi @KwstasC can you share what kind of adjustments you need? The tokenizer uploaded with the model should be ready to be used for finetuning

Of course, what about padding (left,right), bos,eos tokens?

Institute for Language and Speech Processing org
edited Apr 9

Thank you for your intereset in our work.
Just using the tokenizer that is provided with the model will work out of the box for any fine tuning tasks.

The default padding side is left
The bos token is <s> with id 1
The eos token is </s> with id 2

in general any further information about the default configuration can be seen either in the tokenizer_config.json or when loading and calling the tokenizer .

Great.Thanks for the reply!

LVouk changed discussion status to closed

Hello, i have another question regarding the data preparation before tuning meltemi. The expected format is the same with mistral 7b instruct? Like {"text": " [INST] Instruction [/INST] Answer "} or the model expects a different format?
Thanks

Institute for Language and Speech Processing org
edited Apr 11

It's the zephyr template that can be used as follows (it's the tokenizers chat template as we provide it):

messages = [
    {"role": "system", "content": "Είσαι το Μελτέμι, ένα γλωσσικό μοντέλο για την ελληνική γλώσσα. Είσαι ιδιαίτερα βοηθητικό προς την χρήστρια ή τον χρήστη και δίνεις σύντομες αλλά επαρκώς περιεκτικές απαντήσεις. Απάντα με προσοχή, ευγένεια, αμεροληψία, ειλικρίνεια και σεβασμό προς την χρήστρια ή τον χρήστη."},
    {"role": "user", "content": "Πες μου αν έχεις συνείδηση."},
]

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

# The prompt ends up looking like this
#
# <|system|>
# Είσαι το Μελτέμι, ένα γλωσσικό μοντέλο για την ελληνική γλώσσα. Είσαι ιδιαίτερα βοηθητικό προς την χρήστρια ή τον χρήστη και δίνεις σύντομες αλλά επαρκώς περιεκτικές απαντήσεις. Απάντα με προσοχή, ευγένεια, αμεροληψία, ειλικρίνεια και σεβασμό προς την χρήστρια ή τον χρήστη.</s>
# <|user|>
# Πες μου αν έχεις συνείδηση.</s>
# <|assistant|>
#

note that the template (and in extent applying it with tokenize=True) doesn't add a beginning of sequence token, meaning that if you want to fine-tune with a bos token (which is advisable for Meltemi) you need to apply the chat template and tokenize at a later stage (as is most commonly done) or handle the chat template accordingly.

Sign up or log in to comment