This LLM can reason through stubborness, censorship and alignment, unlike any other I tested.

#7
by Phil337 - opened

In short, in a single paragraph long response it comprehended logic, conceded it meant it was wrong, then asked why everybody, including the director of the movie in question, said otherwise. Try it yourself (see below).

I carefully constructed questions to test the primary failures of LLMs, such as hallucinations and censorship.

Since even aligned LLMs reliably answered questions about sex scenes in movies, including male solo scenes, but commonly not solo female scenes, I added a question involving one of the most widely known and discussed female scenes in order to rule out ignorance and shine a light on censorship (Naomi Watts in Mulholland Drive).

Across multiple conversations is stubbornly said no such scene exists. And even when I said I'm watching the scene from the Mulholland Drive DVD as we speak it suggested I might be watching the wrong movie. Finally, after describing the scene I got it to admit there's a scene that matches your description, but it was a symbolic scene, hence it wasn't a self-pleasuring scene. And despite this being a blatantly illogical statement, it stuck to it no matter what I tried.

So I asked, if a movie has a scene in which an actress jumps into the ocean to symbolize the washing away of her old life as she transitions in her future life, does the movie still contain a scene in which the actress jumped into the ocean? -Yes.

Then I described the Mulholland Drive scene again, and said does the fact that the scene communicate a symbolic meaning change the fact that the movie contains a self-pleasuring scene?

It responded by asking me to describe the scene in as much detail as possible, so I did (crying, interrupted by a ringing phone..). It then said it was aware of the scene, added details about it (proving it wasn't just humoring me), conceded I was right, explained why, then asked why everybody, including the director, misrepresented the scene.

Again, I've never witness this before, even with GPT4. It was fed blatantly illogical information; for example, jumping in the ocean symbolized the character's washing away of her old life; therefore, there's no scene in the movie in which she jumps in the ocean. And it stuck by either this illogical notion, or the scene simply not existing at all, no matter what I tried. But tricking it to understand the applicable logic by using a different example not only worked, but made it confused as to why all the information it has about the movie misrepresented the scene.

My best guess for why this is happening is the combination of censorship and alignment at multiple levels, not just from the data selected for training, but the director himself, and what made it through your unalignement attempts, created a lie (no such scene), followed by a blatantly illogical defense of said lie (the scene doesn't exist because it's making a symbolic point), and a stubborn refusal to concede otherwise. However, since Mistral contains millions of logical statements it can understand the applicable logic, concedes defeat, and most surprisingly, asked about the odd contradiction of there being a very explicit scene that clearly depicts masturbation, yet all the information it has stated otherwise.

Cognitive Computations org

Really nice to hear this.

This would make an excellent blog post or YouTube video.

Sign up or log in to comment