Further Tuning

by ChasapasK - opened Apr 9, 2024

Apr 9, 2024

Hello, congrats for the great work, guys.
Can you please share the tokenizer's necessary adjustments for further tuning?
Thank you

geopar

Institute for Language and Speech Processing org Apr 9, 2024

Hi @KwstasC can you share what kind of adjustments you need? The tokenizer uploaded with the model should be ready to be used for finetuning

ChasapasK

Apr 9, 2024

Of course, what about padding (left,right), bos,eos tokens?

LVouk

Institute for Language and Speech Processing org Apr 9, 2024

•

edited Apr 9, 2024

Thank you for your intereset in our work.
Just using the tokenizer that is provided with the model will work out of the box for any fine tuning tasks.

The default padding side is left
The bos token is <s> with id 1
The eos token is </s> with id 2

in general any further information about the default configuration can be seen either in the tokenizer_config.json or when loading and calling the tokenizer .

ChasapasK

Apr 9, 2024

Great.Thanks for the reply!

LVouk changed discussion status to closed Apr 11, 2024

Kwstas

Apr 11, 2024

•

edited Apr 11, 2024

Hello, i have another question regarding the data preparation before tuning meltemi. The expected format is the same with mistral 7b instruct? Like {"text": " ~~[INST] Instruction [/INST] Answer~~ "} or the model expects a different format?
Thanks

LVouk

Institute for Language and Speech Processing org Apr 11, 2024

•

edited Apr 11, 2024

It's the zephyr template that can be used as follows (it's the tokenizers chat template as we provide it):

messages = [
    {"role": "system", "content": "Είσαι το Μελτέμι, ένα γλωσσικό μοντέλο για την ελληνική γλώσσα. Είσαι ιδιαίτερα βοηθητικό προς την χρήστρια ή τον χρήστη και δίνεις σύντομες αλλά επαρκώς περιεκτικές απαντήσεις. Απάντα με προσοχή, ευγένεια, αμεροληψία, ειλικρίνεια και σεβασμό προς την χρήστρια ή τον χρήστη."},
    {"role": "user", "content": "Πες μου αν έχεις συνείδηση."},
]

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

# The prompt ends up looking like this
#
# <|system|>
# Είσαι το Μελτέμι, ένα γλωσσικό μοντέλο για την ελληνική γλώσσα. Είσαι ιδιαίτερα βοηθητικό προς την χρήστρια ή τον χρήστη και δίνεις σύντομες αλλά επαρκώς περιεκτικές απαντήσεις. Απάντα με προσοχή, ευγένεια, αμεροληψία, ειλικρίνεια και σεβασμό προς την χρήστρια ή τον χρήστη.</s>
# <|user|>
# Πες μου αν έχεις συνείδηση.</s>
# <|assistant|>
#

note that the template (and in extent applying it with tokenize=True) doesn't add a beginning of sequence token, meaning that if you want to fine-tune with a bos token (which is advisable for Meltemi) you need to apply the chat template and tokenize at a later stage (as is most commonly done) or handle the chat template accordingly.

Geo

Jun 21, 2024

•

edited Jun 21, 2024

I have also a question regarding data preparation before tuning meltemi.
How should the format be in order to fine tuning meltemi to answer questions using only information from a provided document?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment