Why is this LLM so good at DROP?

#8
by Phil337 - opened

After the new tests were added this is now the highest scoring 7b LLM on Hugging Face, thanks primarily to a DROP score that's ~35 points higher than the previous leader's.

Intrigued, I ran my person tests on Synthia 1.3 and 2, and while they both performed above average across the board, they didn't score higher in any testing category.

So my question is, what makes this LLM perform so well on DROP?

It’s probably the Tree-of-Thought capability. It’s trained to deconstruct the question as a tree structure, and backtrack when needed. Maybe that gives it good comprehension capability to answer questions with long text/paragraphs. SynthIA models in general are good at long form conversations. The longer the context, the better the responses.

Just for my info, what kind of tests did you run?

Thanks for responding. I'm new to this, but that made sense to me. There was one part of a test SynthIA did better at, and that was long-prompt story telling. The stories weren't as good, but it adhered to the instructions all the way until the end of the stories, which is rare.

To answer your question, progressively longer story and poem prompting, such as limericks and sonnets, is one testing category. This is primarily because LLMs have a tendency to force shopworn story elements into to every story, even when they're inappropriate and result in blatant contradictions. For example, if prompted to have someone getting caught stealing money from a counter the LLM will say 'he heard footsteps coming down the hall', yet moments later still have him get caught red-handed grabbing the money. And when I ask why it always says 'to build suspense'. And when I ask why getting a heads up like hearing footsteps contradicts being caught red-handed it can usually explain.

Anyways, I also devised a list of tricky questions on various topics, especially pop culture because it's a blind spot for most LLMs. Smart LLMs like GPT5 almost always get them right, while dumb LLMs like Falcon almost always get wrong. For example, which character did Meg Ryan play in Joe versus the Volcano?, which is a trick question because she played 3 different roles. Another question is "What Meghan Trainer song is the lyric "So don't be thinking I'll be home and baking apple pies" from?". Since small LLMs don't contain precise lyrics, but only the gist of songs, the dumb ones always say her most popular song (e.g. All About The Bass) or something else. But the smart ones find the obvious connection between the lyric and the song title (Dear Future Husband).

Another category is censorship, moralizing..., but not things that should be censored (e.g. stealing a car or celebrity information illegally released via phone hacking), but things that are perfectly legal yet contentious. For example, there's a solo scene in Mulholland Drive by Naomi Watts that made a lot of waves and was used to symbolize the character's distress. This is a good test because while even censored LLMs usually identify sex scenes in movies, including male solo scenes, they do all kinds of weird things when asked about female solo scenes (e.g. deny that they occurred, lecture you about respecting the privacy of celebrities even though millions saw the scene, and so on).

The most interesting test is how an LLMs respond to logic. That is, when an LLM makes a logical mistake I correct it, then ask it to respond. Stupid LLMs, such as those trained primarily on multi-turn conversations, just return irrelevant per-packaged nonsense. But smart LLMs trained on multi-step instruction/explanation data like Orca, but usually only if cleaned of alignment like Dolphin and SynthIA, will often catch their error. And when prompted, explain why it was an error with different words, examples... proving that they weren't just giving in and humoring the user, but actually processed why they were wrong.

This comment has been hidden

Sign up or log in to comment