Fredithefish/ReasonixPajama-3B-HF

Aug 26, 2023

•

edited Aug 26, 2023

Just curious, does the training dataset includes the truthfulqa dataset which is being used during evaluation for Open LLM LB?

If so, is it still fair to use truthfulqa as a metric to evaluate this model? Because compared with other 3B, (or maybe 65B/70B) models, the truthfulqa metric is very high.

Interesting part is according to your config.json the base model is RedPajama-INCITE-Chat-3B-v1 which has all other evals higher then your model beside truthfulqa_mc. So yeah I think something is quite not right here.

Fredithefish

Owner Aug 26, 2023

Oh, there is the issue.
I utilized parts of the ARC and truthfulQA dataset.
I recently (about 2 weeks ago) discovered that my format for answering questions in ARC was incorrect. Previously, I did not provide the correct letter choice like this:

<Question>
A. <first option>
B. <second option>
<Letter of the correct answer>

But instead of the above I used this format:

<Question>
A. <first option>
B. <second option>
<text of the option instead of one letter>

That's why it gets so bad ARC score, will fix this issue as soon as possible and update the dataset and the model.
Thanks for reporting it!

pankajmathur

Aug 26, 2023

Thanks for the details @Fredithefish .

@clefourrier : you may want to look into this too .