Eval results

by theblackcat102 - opened

Its interesting to know how does this model performs compared to others in terms of CoT and world knowledge use ( mainly due to the expanded FF layer )

chargoddard/llama2-22b 37.48
vicuna-13B v1.3 35.78
WizardLM-13B-V1.1 39.59
llama-v1-13b 36.52

Still running MMLU, but the all the sub tasks score does seems similar to llama-v2-13b

Updated MMLU scores:
WizardLM-13B-V1.1 49.95
vicuna-13B v1.3 52.1
llama-v1-13b 46.2
chargoddard/llama2-22b 53.60
llama-v2-13b 55.75

Thanks for running these! It’s great to have actual benchmark scores. I’d call this a win - the fact that the score is only slightly deteriorated from llama-v2-13b is very promising. The amount of rehabilitation training done to this model was fairly minimal. I’m hopeful that this will shine with some actual training.

Sign up or log in to comment