Unreliable Benchmarks. Definitely worse than LLaMA2-13b

#5
by anon7463435254 - opened

These benchmarks declaring better results than those obtained by Microsoft itself are unreliable, considering that the model is not even able to answer simple reasoning questions, which other models like LLaMA2, Nous Hermes, WizardLM handle correctly. Below one of the many questions I'm referring to:

image.png

cap2.PNG

According to your benchmarks, should we expect that the original Orca will fail to answer such questions correctly? I don't think it likely.
In conclusion, either these benchmarks are untrue or the original Microsoft Orca model will be incredibly disappointing.

Let's wait for the 100% version to see if it will answer properly to the same question.

I don't know what parameters you set up on the space, locally I get this answer:
image.png
also I guess you know it, but you should dial down the temperature to 0.01 if you want more precise answers, and even with that you can still sometimes get a different answer if you don't enable the cache.

Partially its their fault for providing a default gradio space which enables to get bad answers from the model.

Yeah. I get the correct answer too for this style of common sense question (with the default gradio settings in the space)

Screenshot 2023-08-04 at 6.02.44 PM.png

winglian changed discussion status to closed

I am getting the same/similar results here as @anon7463435254 . Additionally, the model seems to perform somewhat worse at coding tasks compared to other models that I have tried (LLaMa 2 13B chat, WizardLM 13B v1.2).

I have tried context length of both 2048 and 4096 (not 100% sure which is correct, changing this had no effect although ofc the prompt was less than 2048 tokens anyway). Temperature 0.8, top_p 0.95, top_k 40. Changing temperature to 0.3 had no effect.

Using prompt template <|user|> <|user-message|><|end_of_turn|><|bot|> <|bot-message|><|end_of_turn|>. I also tried putting newlines between the two sides etc. (as the documentation has a contradiction as to which exact format is correct), there was no change. I am using the system prompt You are a helpful assistant. Please answer all questions as truthfully as possible to the best of your knowledge and ability.. Using the system prompt from the documentation causes the model to provide a very verbose reasoning but it arrives at the same (incorrect) answer.

In a few cases it said things such as "2 apples (which equals 2 oranges)" so I suspect that it is incorrectly inferring that apples and oranges are equivalent for the purposes of the discussion, instead of realising that the question is misleading.

Probably unrelated but in text-generation-webui it doesn't always correctly end generation at the end of its turn. I am using the prompt template as above and I have also tried adding <|end_of_turn|> to the stop tokens on the parameters page. The model does not appear to always output the <|end_of_turn|> text before it starts writing as the user. Or sometimes, "<|end_of_turn|>" appears in the actual message in the conversation and then the model continues writing as the user instead of generation being stopped. So I suspect that either text-generation-webui is not tokenising the <|end_of_turn|> marker correctly or the model is actually writing < | end _ of _ turn | > as separate characters/tokens instead of generating a single special token.

image.png

Sign up or log in to comment