lmsys/vicuna-33b-v1.3 · Bigger is NOT always better...

MrDevolver

Jul 10, 2023

zjmiller

Jul 12, 2023

Any guesses what might be going on here?

MrDevolver

Jul 12, 2023

Any guesses what might be going on here?

What do you mean? It's a simple comparison of the outputs to the same prompt from two different AI models, one bigger than the other.

zjmiller

Jul 12, 2023

I just meant any guesses as to what explains the difference.

MrDevolver

Jul 12, 2023

•

edited Jul 12, 2023

I just meant any guesses as to what explains the difference.

Honestly, I have no idea. The output is usually random, and I believe that eventually if you tried long enough, the big model would produce a good answer, but it feels like the amount of really bad answers is pretty high all across the board, but try chatglm model for example and it will almost never be tricked by this question. I did a little test yesterday. I was interested in seeing which models would produce good enough answers and I scored them. I picked a few which seemed the best ones and then I tested them all against each other. It turns out there were a few which were good, but sadly most of them weren't. I wasn't really strict in my testing, I accepted even weird math answers if the result itself was good enough for my purpose which is story telling, but yeah when the model produced something like you see on the left or similar with the final age guess that's obviously incorrect because it's < 43 (yeah that really happened and it happened more often than I would like to see), then I gave them even negative score. I wrapped up the results of my private test and I came to a conclusion that chatglm was the best, mpt model was the second, but that one produced some really whacky results sometimes that I would disqualify it just for that alone, if I wanted to be too strict lol. Then I tried something a little bit different. I made the same prompt, but I decided to rephrase it a little bit to make it so that the model didn't have any other choice than answer it correctly and then all of the models answered it correctly when I made the correct answer super obvious in the prompt. So I concluded that some models have better common sense (if that's applicable here at all) than others.

practical-dreamer

Jul 29, 2023

sample size of one