where is the example ?

#1
by chuangzhidian - opened

where is readme? wanna try it

Hi, thanks for your interest; it's the same model interface as https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2
The model was trained to predict the best answer among two.
A forward pass of the frozen model provides a loss which can be used a signal for RLHF, there are some codes to do that but I didn't explore them.

thank you ,you're being very helpful. :)
seems to me ,the output is higher than this model 's you mentioned ealier[https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2].
is it true in general , your model is better in terms of discriminating performance?

thank you ,you're being very helpful. :)
seems to me ,the output is higher than this model 's you mentioned ealier[https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2].
is it true in general , your model is better in terms of discriminating performance?

The absolute value of the output doesn't really count, it's all about the gradient, which is related to the difference of value for different inputs

so the absolute value of the output doesn't matter that much but the comparison result of the two?Thank you for your prompt attention to this matter

so the absolute value of the output doesn't matter that much but the comparison result of the two?Thank you for your prompt attention to this matter

Yes. The reward model provides a score of how "good" some generated text is. We can then optimize a generator so that it generate "better" things by using the reward model output as the opposite of the loss
See the RLHF paper by openai

Sign up or log in to comment