Multi-turn input template

#6
by linzeqipku - opened

Hi team, thank you for the great work.
I'm wondering how to use this model for calculating the rewards of multi-turn messages (i.e., what is the chat_template to convert messages to input string).
For example,

messages = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "I'm fine, thank you. And you?"}
]

I'm also wondering how the rewards of multi-turn messages are modeled during training. For example, it models the quality of the last assistant message or all assistant message?

OpenBMB org
edited Jun 8

Hi,

Thanks for your kind words.

For all kinds of messages, we directly format them using tokenizer.apply_chat_template() function.

In the RM training, the attn masks of contexts (messages[:-1] in your example) were set to 0 and only the masks of the last assistant message (message[-1]) were set to 1. That said, it was trained to model the last assistant message given single/multiple turns of contexts.

Hope this helps :)

Sign up or log in to comment